The SAD Instrument¶
Status: Proven by gates (instrument validation). Theoretical framing theoretically motivated (not yet empirically grounded).
SAD captures post-RoPE Q/K/V tensors from inside the model's native attention forward, then recomputes both softmax and linear attention in fp32. The cosine distance between per-head outputs produces a scalar trajectory over generation steps --- one time series per (layer, head) pair, which we treat as a delay-coordinate embedding.
SAD is not a truth detector. It is a dynamical systems probe that reconstructs per-head attractor structure. What you ask about that structure is a separate question.
End-to-end pipeline¶
1. Model loading¶
The model is loaded with attn_implementation="eager" (hard requirement) and native dtype (fp16 for gate verification). KV cache is disabled (use_cache=False). The model revision is pinned in gate fixtures to ensure reproducibility. Currently Mistral-7B-Instruct-v0.2 only --- other families earn registry entries after their gates pass.
2. Registry lookup and adapter installation¶
get_family_config(model.config) reads model.config.architectures[0] and looks up the corresponding ModelFamilyConfig in the registry. For Mistral, this returns a Tier A config with adapter_factory=MistralAdapter. The InstrumentManager installs the adapter on every attention layer, replacing each module's forward method with a verbatim upstream copy containing capture callbacks.
3. Per-step capture during generation¶
During model.generate(), each forward pass fires the patched forward for every attention layer. At each layer, insertion point 1 calls the capture callback with post-RoPE query_states, key_states, value_states.
Step accounting is handled by a LogitsProcessor injected into the generation loop via LogitsProcessorList. The processor increments the manager's step_idx after each forward pass completes across all layers. Each generation step produces exactly num_layers records.
4. SAD delta computation¶
Inside the capture callback, for each layer at each step:
- Q/K/V are cloned, detached, and upcast to fp32
- GQA expansion: if
num_kv_heads != num_q_heads, K and V are expanded viarepeat_interleave(Mistral-7B: 8 KV heads expanded to 32) - Newest-token query slice:
q_last = q_fp32[:, :, -1:, :] - Softmax path: scaled dot-product attention in fp32. \( \text{scores} = q \cdot K^T / \sqrt{d_k} \), softmax, matmul with V
- Linear path: ELU+1 feature map on Q and K. Accumulated \( S = K^T V \) via einsum. Normalized by \( z = \sum K_{\text{mapped}} \)
- Per-head cosine distance: \( 1 - \cos(\text{softmax}_h, \text{linear}_h) \) for each head
The result is a StepRecord(step_idx, layer_idx, per_head_delta) appended to the record list.
5. Serialization¶
After generation, records are packed into a RawSampleRecord with provenance metadata and written to gzipped JSONL. Raw records are immutable --- never modified after writing.
6. Downstream signal processing¶
Analysis operates on serialized records, never during inference:
- Aggregation: uniform mean across layers and heads per step, producing a per-token delta series. Raises on non-contiguous
step_idx(fail-closed). - Finite differences: first, second, third differences of the delta series.
- Permutation entropy: per-(layer, head) PE on first-differenced SAD trajectories. Bandt-Pompe ordinal patterns (D=3, tau=1) with tie exclusion. Eligibility minimum: 2*D! points.
Scope limitations¶
- Cache-off only.
use_cache=Falseis a method definition, not a performance choice. Generalization to cache-on inference is unverified. - Mistral only. Other families earn their entries after passing Gates 0 and 1.
- Single sequence. The instrument pipeline assumes
B=1. - Eager attention only. SDPA and Flash Attention are incompatible with the forward-replacement adapter.
- fp16 for gates, q8 minimum for production. Quantization below q8 introduces dequantization artifacts as a confound.