Gate 2 --- Stability¶
Status: Proven by gates. Passes on Mistral-7B-Instruct-v0.2.
50 consecutive generations with full instrumentation and JSONL serialization. Zero VRAM creep (0.0 MiB spread, limit 16 MiB). CPU RSS growth 0.7 MiB (limit 128 MiB). Provenance round-trip validated.
See Gate Discipline for the stability methodology and I/O modules for the JSONL serialization pipeline.
Protocol¶
2 warmup generations followed by 50 measured generations on a fixed long-form probe prompt, with full instrumentation and JSONL serialization active. Each generation: 100 tokens, greedy decode, cache off.
Warmup discipline¶
The first 2 generations exercise the full pipeline but are excluded from gate metrics and JSONL output. This prevents CUDA lazy initialization and JIT compilation from inflating the baseline.
Baseline and thresholds¶
VRAM baseline: max(vram[:3]) --- the maximum post-cleanup allocation across the first 3 measured samples. Using the max rather than the mean guards against a low first-measurement biasing the baseline. If early samples are unstable, increase warmup --- never the threshold.
Gate 2 uses fixed thresholds (engineering limits, not calibrated from observed behavior):
| Metric | Threshold | Observed |
|---|---|---|
VRAM spread (max - min) |
<= 16 MiB | 0.0 MiB |
CPU RSS growth (rss[-1] - rss[0]) |
<= 128 MiB | 0.7 MiB |
Memory measurement¶
VRAM: After each generation, explicit cleanup (del output, records, mgr.reset(), gc.collect(), torch.cuda.synchronize(), torch.cuda.empty_cache()). Post-cleanup torch.cuda.memory_allocated() is the gate metric. reset_peak_memory_stats() called before each generation for per-sample peaks.
CPU RSS: psutil.Process(os.getpid()).memory_info().rss post-cleanup. The growth check catches monotonic Python-side leaks (accumulating objects, uncollected records, buffer growth) that would not appear in VRAM.
Provenance checks¶
The second Gate 2 test reads back the written JSONL and validates every record:
- Schema fields:
schema_version == 1,record_type == "raw", correct model ID,capture_tier == "A",benchmark == "gate2_stability",num_tokens_generated == 100,layers_hooked == [0..31] - Record count: Exactly 50 (warmup excluded)
- Sample ID uniqueness: No duplicates
- Per-step/per-layer bijection: Same structural invariant as Gate 0 --- each step has exactly one entry per layer, no duplicates, no gaps
- StepRecord reconstruction: First
per_stepentry passed toStepRecord(**entry)to verify schema compatibility between write and read paths - Delta range \( [0, 2] \): Every
per_head_deltavalue must be finite and within \( [0.0, 2.0] \). Cosine distance is bounded by this range. Values outside indicate NaN, Inf, or numerical corruption