Files
weeyuga-benchmarks-public/methodology.md
slobodanmargetic988 a18db6a3da B3 staging seed — 21 runs + catalogue v1.0-draft + methodology + README
Initial population of the weeyuga-benchmarks-public archive (PRIVATE
staging visibility — flips public after Miljan + Stevan security audit
sign-off per Sloba's 17:34Z dispatch).

Contents:
- README.md       — public-facing intro (warns staging state, schema overview, citation pattern, license split)
- LICENSE         — CC-BY-4.0 default (auto-init from Gitea)
- catalogue.json  — schema_version=1.0-draft (locked once Tomas ratifies); 21 benchmarks indexed, 13 complete + 8 meta-only
- methodology.md  — mirror of WeeyugaWeb docs/BENCHMARKS/HARNESS.md (canonical methodology)
- runs/<id>/run.jsonl|run.log|run.md|metadata.json — packaged copies of every run in WeeyugaWeb docs/BENCHMARKS/runs/*

Run set covers:
- Mission 1 (2026-04-28/29): pavilion-weeyuga-v1 + reconstructed v3 (96 calls, 16 models routed via weeyuga :11435)
- Predator trio (2026-05-04): granite-4.1-8B + gemma-4-E4B-it + qwen3.5-9B
- Predator qwen rerun (2026-05-04): qwen3.5-9B think500/nothink + qwen3-14B feasibility
- A3B campaign (2026-05-04/05): pavilion-a3b + predator-a3b NGL matrix + ctx sweep + NGL+ctx 2D + NGL=6 deep dive
- VPS50 CPU matrix + gemma-e4b CPU lane (2026-05-04/05)

Visibility GATE: this repo stays private until Miljan G1-G4 audit and
Stevan G3 credential audit both green. After sign-off, single API call
flips visibility=public, anonymous read on, push-protection requires
auth, issues moderate by default.

No raw IPs, no SSH user@host strings, no /Users/ paths, no whisper
transcripts in any of these files. Hardware names (pavilion, predator,
vps50) are intentional and fine to share.

Builder: WeeyugaWeb/scripts/benchmarks/build_catalogue.py (deterministic,
idempotent, ~5s wall on 21 runs).
Publish flow: WeeyugaWeb/scripts/benchmarks/publish_bench_run.py
(builds packaged dirs, regenerates catalogue, optional --push to mirror
into this repo, optional --deploy stub for cicd rsync).

Owner: mac/benchmark-tester-ben (Ben).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 19:46:01 +02:00

23 KiB
Raw Permalink Blame History

BENCHMARK HARNESS — Weeyuga cluster

Owner: mac/benchmark-tester-ben (Ben). Ping me on the bus with Ben — benchmark X and I'll either point at an existing harness output or queue a measurement.

Purpose: the reproducible measurement surface for the cluster. Anyone who reads v1_BASELINE.md (or any future baseline) needs to know exactly how the numbers were produced — same git SHA, same prompts, same env, same telemetry tagging — so cross-run diffs are actually comparable.

Companion docs:

  • docs/BENCHMARKS/v1_BASELINE.md — the first matrix (per-node × per-engine × per-model × per-prompt) of cold/warm latency, tokens/sec, p50/p95.
  • docs/MONITORING_RUNBOOK.md (Luka) — query side, how to filter benchmark traffic out (metadata.test:true).
  • coordination/HARDWARE_INVENTORY.md — node spec ground truth.
  • docs/architecture/elasticsearch/templates/index-weeyuga-telemetry.json — ground envelope schema (metadata: flattened is where my tagging goes).

1. What I measure (the metric set)

Every run captures these fields. Anything missing means the run is discarded — partial numbers are not numbers.

Per-call (one prompt → one inference call)

Field Type Source Notes
first_delta_ms int client wall-clock between POST and first SSE chunk arriving TTFT — most user-perceived
total_duration_ms int client wall-clock between POST and final SSE chunk full request span
prompt_tokens int response body usage from /v1/chat/completions usage.prompt_tokens
completion_tokens int response body usage for llama.cpp this includes reasoning tokens; for Ollama (think:false bypass) it's answer-only
tokens_per_sec float completion_tokens / ((total_duration_ms - first_delta_ms) / 1000) excludes prefill/queue time
finish_reason enum response body stop / length / error — anything other than stop is a flagged run
backend enum response header X-Weeyuga-Backend if present, else inferred from URL ollama / llamacpp / brain-routed
cell_id str harness <node>:<engine>:<model> (e.g. mac:llamacpp:qwen3.5:0.8b)
prompt_id enum harness P-EASY / P-MEDIUM / P-HARD (frozen — see §3)
run_idx int harness 0..N within a (cell,prompt) cell
phase enum harness cold (first run after model-load) / warm (subsequent)
error str? harness populated on exception; the run is recorded but excluded from p50/p95

Per-batch (one cell = one cold + N warm)

Field Source
cold.first_delta_ms the single cold run
cold.total_duration_ms the single cold run
warm.first_delta_ms.p50 / .p95 percentile across N warm runs
warm.total_duration_ms.p50 / .p95 percentile across N warm runs
warm.tokens_per_sec.mean mean across N warm runs
warm.success_rate (N - error_count) / N

Per-run (one harness invocation = many cells × many prompts)

Field Source
benchmark_run_id uuid4() generated at harness start
git_sha git rev-parse HEAD
git_dirty True if git status --porcelain non-empty
harness_version scripts/benchmarks/HARNESS_VERSION constant (bump on shape change)
started_at_utc / finished_at_utc wall-clock
host socket.gethostname() (where the harness is driven from)
load_avg_start / load_avg_end os.getloadavg() snapshot
env_route WEEYUGA_INFERENCE_ROUTE if set
env_llamacpp_url WEEYUGA_QWEN35_LLAMACPP_URL if set
cells_planned the cell list from cells.yaml after target-availability filtering

These metadata fields are written once at the top of the JSONL ledger as a single meta record, then every subsequent line is a per-call record.

I do not measure GPU memory peak, CPU utilization, or network bytes in v1 — those need on-target instrumentation (nvidia-smi / top sampling / pcap) which adds harness complexity and another moving part. v2 will add per-target system-load samplers driven by SSH from the harness host. Today these are surfaced via Luka's Cluster Health Overview dashboard if the run window matters.


2. Run patterns

Every cell runs all four patterns by default; opt out per-cell via cells.yaml flags.

2.1 Cold vs warm

  • Cold: model is forced out of memory before the first call.
    • Ollama: POST /api/generate {"model":"<name>","keep_alive":0} with empty prompt to trigger unload, then a 2-second pause, then the measured call.
    • llama.cpp: server keeps the model resident across requests by design — there is no per-request unload. A "cold" llama.cpp run is captured by sending the very first request after a fresh llama-server start; on Pavilion + Mac the server is a long-lived daemon, so cold-after-restart is the only true cold, and we record the warm-from-now-on number for these. Marked cold_kind: process_warm in the JSONL.
  • Warm: the call is preceded by another call to the same model on the same engine within the last 60 s.

The harness runs 1 cold + 5 warm per (cell, prompt). 5 is the v1 N — small enough the run finishes in < 30 min for the v1 cell matrix; large enough to compute a meaningful p50/p95.

Bump N to 20 only when investigating a regression — it's expensive and adds nothing to baseline shape.

2.2 Single-thread vs N-parallel

v1: single-thread only. Parallel-capacity tests are v2 and need extra coordination because parallel inference on GTX 1050 (only 4 GB VRAM with qwen3.5:0.8b at ~1.2 GB) thrashes hard, and Mac's M1 unified memory shares with the OS — running 4 parallel inference calls during Sloba's work day is exactly the kind of "you didn't tell me you were going to do that" event that earns trust loss.

When v2 lands, parallel runs go into v1_BASELINE_PARALLEL.md (or v2_BASELINE.md if I bump the baseline shape).

2.3 Local vs cross-node-routed

v1: local only. Each cell measures the engine on its own node — the harness drives the call directly to that node's :11434 (Ollama) or :11436 (llama.cpp).

v2 will add cross-node-routed cells: the harness drives the call to the brain (https://cluster.weeyuga.com) with WEEYUGA_INFERENCE_ROUTE=<node> set, the brain forwards to the node, and we measure routing overhead = cross_node.duration_ms - local.duration_ms per (cell, prompt).

Atlas owns the routing knob; coordinate before adding cross-node cells (the brain may need to honor a benchmark header — see §5).


3. Frozen canonical prompts

Three prompts. They never change once shipped — diffing across runs depends on the input being byte-identical.

If you need a new prompt for an investigation, add it as P-NEW1/P-NEW2 to scripts/benchmarks/prompts.yaml; do NOT edit the existing three. Future-Ben will thank you.

P-EASY:
  intent: trivial — single-token response space, near-zero work
  prompt: |
    hi
  max_tokens: 64

P-MEDIUM:
  intent: bounded structured task — 4 sentences on a known topic
  prompt: |
    Explain in 4 sentences why the sky appears blue at noon.
  max_tokens: 512

P-HARD:
  intent: open-ended creative — 200-word generation
  prompt: |
    Write a 200-word story about a fisherman who discovers a coin from a sunken empire.
  max_tokens: 1024

Why these three:

  • P-EASY is the trivial-bypass test. Mac's Phase 0+1 routing classifies sub-3-word prompts to Ollama think:false bypass; on llama.cpp the same prompt eats reasoning budget and shows the "empty bubble" phone bug Atlas called out in the Phase 2 broadcast. P-EASY is how we keep that regression visible.
  • P-MEDIUM is Bane's Phase 2 smoke prompt verbatim. We already have a sanity reference: Mac M1 llama.cpp 30.5 s, Pavilion GTX 1050 llama.cpp 37.8 s. New runs landing far from those values are a flag.
  • P-HARD stresses the answer side — completion_tokens is the dominant axis, so this is where tokens-per-sec across hardware is most legible. The 200-word target is loose; finish_reason=stop vs length is a separate dimension we record.

The trio spans the prompt space cheaply: 1 token / ~50 token / ~250 token expected outputs, three orders of magnitude apart.


4. Cell matrix

Cells are declared in scripts/benchmarks/cells.yaml. Each cell is a (node, engine, model) triple plus availability flags.

v1 cells (the matrix v1_BASELINE.md captures):

Cell ID Node Engine Model Endpoint Available?
mac:ollama:qwen3.5:0.8b Mac M1 Ollama qwen3.5:0.8b http://127.0.0.1:11434
mac:ollama:qwen2.5-coder:0.5b Mac M1 Ollama qwen2.5-coder:0.5b http://127.0.0.1:11434 if pulled
mac:ollama:qwen2.5-coder:1.5b Mac M1 Ollama qwen2.5-coder:1.5b http://127.0.0.1:11434 if pulled
mac:llamacpp:qwen3.5:0.8b Mac M1 llama.cpp qwen3.5:0.8b Q4_K_M http://127.0.0.1:11436 if llama-server running
pavilion:ollama:qwen3.5:0.8b Pavilion Ollama qwen3.5:0.8b http://10.8.0.3:11434 via WG
pavilion:llamacpp:qwen3.5:0.8b Pavilion llama.cpp qwen3.5:0.8b Q4_K_M http://10.8.0.3:11436 via WG
predator:ollama:qwen3.5:0.8b Predator Ollama qwen3.5:0.8b http://10.8.0.7:11434 if pulled
predator:llamacpp:qwen3.5:0.8b Predator llama.cpp (pending) http://10.8.0.7:11436 pending Trinity Job B

Skipped in v1:

  • cicd Ollama — cicd is the brain host. Driving inference load on it directly risks affecting Sloba's mobile chat path. Add only on explicit Sam dispatch.
  • qwen3:4b / qwen3:9b / qwen3:35b-a3b on any node — heavier models, need their own measurement pass with longer N and longer windows. Queued for v2.

The harness probes each cell's availability with a 1-second HEAD/GET health check before running. Unavailable cells are recorded as skipped: <reason> in the JSONL, NOT silently dropped.


5. Telemetry tagging — metadata.test=true + metadata.benchmark_run_id

Hard rule (per Sam's kickoff): every benchmark call that flows through the brain or any path that emits to weeyuga-telemetry-* must be tagged so Luka's dashboards can filter benchmark traffic out of production graphs.

v1 convention (proposed — see §5.4 for coordination)

For events landing in weeyuga-telemetry-* (ground envelope):

{
  "...": "...",
  "metadata": {
    "test":               true,
    "benchmark_run_id":   "<uuid4>",
    "benchmark_cell_id":  "mac:llamacpp:qwen3.5:0.8b",
    "benchmark_prompt_id": "P-MEDIUM",
    "benchmark_phase":    "warm",
    "benchmark_run_idx":  3,
    "harness_version":    "1"
  }
}

metadata is flattened in the index template (Nemanja weeyuga-mappings-common v2.1 + index-weeyuga-telemetry.json), so the keys above index without a mapping change. That's the intended extension hatch.

For events landing in weeyuga-logs-* / behavioral indices, the equivalent goes under labels.test=true (also flattened).

5.1 Mode A — direct-engine (default for v1)

The harness drives calls directly to the engine (:11434, :11436), which does not emit ground envelope. No telemetry tag is needed because no telemetry is generated. Pure empirical measurement, zero brain side-effect.

v1 baseline runs in Mode A only. Cleanest, fastest, lowest coordination cost.

5.2 Mode B — brain-routed (v2)

When v2 adds cross-node routing measurement, the harness drives calls to https://cluster.weeyuga.com and the brain forwards. Brain emits ground envelope on every dispatch.

Convention proposed for Atlas + Luka:

  • Harness sends headers:
    • X-Weeyuga-Test: true
    • X-Weeyuga-Benchmark-Run-Id: <uuid>
    • X-Weeyuga-Benchmark-Cell-Id: <id>
    • X-Weeyuga-Benchmark-Prompt-Id: <id>
  • Brain copies header values into metadata.test, metadata.benchmark_run_id, etc. on every emitted envelope for this request.
  • Luka's dashboards default-filter metadata.test:false OR NOT metadata.test:*. A "show benchmark traffic" toggle flips the filter.

Status: proposed by Ben 2026-04-28; awaiting Atlas + Luka ratification before v2 lands. Until then, v1 stays Mode A.

5.3 Local ledger (always emitted, regardless of mode)

Independent of brain telemetry, the harness writes docs/BENCHMARKS/runs/<benchmark_run_id>.jsonl on the harness host (Mac). One JSON object per line; first line is the meta record (§1 per-run fields), subsequent lines are per-call records.

The local ledger is the canonical source for v1_BASELINE.md and any aggregation. Brain telemetry is a nice-to-have for cross-correlation in Kibana but is NOT load-bearing on the baseline numbers.

5.4 Coordination ratification needed

This convention is currently unilateral from Ben. Before Mode B ships:

  • Atlas confirms the brain copies X-Weeyuga-Test* headers into metadata.* on every envelope without breaking existing emit paths.
  • Luka adds the default metadata.test filter to all prod-facing dashboards (Cluster Health Overview / Mobile Chat Activity / Agent Telemetry / Error Funnel / Cluster Connectivity) and confirms the toggle works.
  • Nemanja ratifies that metadata: flattened is the right place (vs. extending actor mapping) — leaning yes per his §3.4 use of flattened for forward-compat.

A separate transcript dispatch carries this proposal. Until all three ack, v1 baseline runs Mode A only and leaves brain telemetry untouched.


6. Output format

6.1 Per-run JSONL ledger

docs/BENCHMARKS/runs/<benchmark_run_id>.jsonl

Line 1 — meta record (per-run fields, §1). Lines 2..N — call records (per-call fields, §1) plus a phase: "skipped" line for any cell that failed availability.

Example (truncated):

{"type":"meta","benchmark_run_id":"4e2a...","git_sha":"e2d6a6d","git_dirty":true,"harness_version":"1","started_at_utc":"2026-04-29T02:00:00Z","host":"slobodan-mac","load_avg_start":[1.2,1.5,1.4],"cells_planned":["mac:ollama:qwen3.5:0.8b","mac:llamacpp:qwen3.5:0.8b","pavilion:ollama:qwen3.5:0.8b","pavilion:llamacpp:qwen3.5:0.8b"]}
{"type":"call","cell_id":"mac:llamacpp:qwen3.5:0.8b","prompt_id":"P-EASY","phase":"cold","run_idx":0,"first_delta_ms":2810,"total_duration_ms":2840,"prompt_tokens":11,"completion_tokens":2,"tokens_per_sec":66.7,"finish_reason":"stop","backend":"llamacpp","error":null}
{"type":"call","cell_id":"mac:llamacpp:qwen3.5:0.8b","prompt_id":"P-EASY","phase":"warm","run_idx":0,"first_delta_ms":120,"total_duration_ms":150,"prompt_tokens":11,"completion_tokens":2,"tokens_per_sec":66.7,"finish_reason":"stop","backend":"llamacpp","error":null}

Aggregator (scripts/benchmarks/aggregate.py) reads the JSONL and emits the v1_BASELINE.md table. Re-running the aggregator on the same JSONL is deterministic and idempotent.

6.2 Markdown table shape (v1_BASELINE.md)

Per cell, two stacked tables: cold and warm. Per (cell, prompt), one row with first_delta_ms, total_duration_ms, completion_tokens, tokens_per_sec. Across-prompts summary at the end of each cell's section. The cell matrix appears as a top-level summary table (TTFT-warm-p50 only) above the per-cell detail.

This shape stays fixed across baselines so v1 → v2 → v3 diffs are mechanical. Adding a new metric goes at the end of the per-cell table; never reorder existing columns.


7. Regression thresholds

A run is flagged on the bus when, vs. the most recent baseline:

Metric Threshold Severity
warm.first_delta_ms.p50 (any cell) ≥ 30% slower regression — bus heads-up
warm.total_duration_ms.p50 (any cell) ≥ 30% slower regression — bus heads-up
warm.success_rate (any cell) < 0.95 red flag — investigate before publishing
warm.tokens_per_sec.mean (any cell) ≥ 30% lower regression — bus heads-up
Improvement ≥ 30% on any of the above wins — bus heads-up publish

30% is a deliberately wide threshold for v1 because run-to-run variance on shared hardware (Mac M1 also runs Sloba's work) can easily be 15-20%. Tighten when N is bumped from 5 to 20.

A regression doesn't auto-block a release; it triggers the operator question "is this a real regression or a load-day blip?" and prompts a re-run with N=20.


8. Reproducibility checklist

Before a baseline run is published, every entry in this checklist must hold. If any fails, the JSONL is shipped but the v1_BASELINE.md is annotated unreliable: <reason> rather than written to be read as canonical.

  • git status clean OR every dirty file documented in the run metadata (e.g. "ignored: M mobile/macos/Flutter/GeneratedPluginRegistrant.swift")
  • git rev-parse HEAD recorded
  • Harness version recorded
  • Each target's engine version recorded:
    • Ollama: curl http://<node>:11434/api/version
    • llama.cpp: curl http://<node>:11436/health (if exposed) or record the b-build from the operator config
  • Each target's model digest recorded:
    • Ollama: curl http://<node>:11434/api/tags | jq '.models[]|select(.name=="<name>").digest'
    • llama.cpp: GGUF SHA256 from operator config (Bane's Pavilion install records this in his message; Mac TBD)
  • Wall-clock window logged in heads-up message on bus
  • No competing benchmark runs going (only one harness across cluster at a time — even a different model on a different node — to keep the network noise floor predictable)
  • Sloba's prime-time avoided OR explicit authorization on the bus (Sam dispatch, or "go ahead" from Sloba in chat)

The harness itself enforces a subset:

  • Refuses to run if git status shows changes the operator hasn't acknowledged via --allow-dirty.
  • Refuses to run if --for-publication is set and any cell health check fails.
  • Records the start-time load average and refuses to start if getloadavg()[0] > 4.0 unless --force-load is set (the Mac is too busy and numbers will be noisy).

9. Workflow — running a baseline

# 1. Check the bus for heads-up window collisions
cd /Users/slobodan/projects/WeeyugaWeb
tail -50 coordination/CLAUDE_TRANSCRIPT.md

# 2. Post heads-up
# (write coordination/messages/<utc>Z-benchmark-tester-ben-baseline-window.md
#  + transcript entry, commit + push)

# 3. Health-check targets
python3 scripts/benchmarks/run_harness.py --probe

# 4. Smoke (1 cell, 1 prompt, 1 run) to validate end-to-end
python3 scripts/benchmarks/run_harness.py --smoke --cells mac:llamacpp:qwen3.5:0.8b

# 5. Full v1 baseline
python3 scripts/benchmarks/run_harness.py --full --cells-yaml scripts/benchmarks/cells.yaml --prompts-yaml scripts/benchmarks/prompts.yaml

# 6. Aggregate to markdown
python3 scripts/benchmarks/aggregate.py docs/BENCHMARKS/runs/<run-id>.jsonl > docs/BENCHMARKS/v1_BASELINE.md

# 7. Commit + push the JSONL ledger AND the markdown together (per-run commit)
git add docs/BENCHMARKS/runs/<run-id>.jsonl docs/BENCHMARKS/v1_BASELINE.md
git commit -m "benchmark: v1 baseline run <run-id>"
git push

# 8. Post bus message linking the result + a 1-paragraph framing for Janie

Subsequent baselines follow the same flow with the harness writing a different <run-id>.jsonl per invocation. Old ledgers are preserved forever — they're the audit trail for "did this number move because of a code change or a load-day blip."


10. Coordination contract

Who What I owe them What they owe me
Sam Per-deliverable transcript entries + weekly Mon "regressions/improvements" digest Cross-cutting decisions; spawn coordination
Nemanja Metric set ratification; cell matrix sanity-check; field-naming convention review Authoritative ground envelope schema; ratifying metadata.* extension
Atlas Header convention proposal; Mode-B test traffic scoped + visible Brain copies X-Weeyuga-Test*metadata.*; informs me on emit-path changes that affect harness
Luka Heads-up before any run (so his dashboards aren't read during noisy windows) Default metadata.test:false filter on prod dashboards; "show benchmark traffic" toggle; query-side help
Bane / Viktor Heads-up before Pavilion / Predator runs; idle-coordination on long runs Engine-version + model-digest reads on demand; infra-stability heads-up
Pablo / Filip Heads-up if a measurement window overlaps their device-test windows Awareness of when the harness is generating mobile-shape traffic
Janie Raw numbers + a 1-paragraph framing per run (what's interesting here) Storytelling — turning numbers into Janie blog posts
Sloba Numbers when asked; standing offer Authorization for prime-time runs; prompt freezes (don't change P-EASY/P-MEDIUM/P-HARD without ack)

11. Hard rules I commit to

  1. metadata.test=true (or Mode-A direct-engine) on every benchmark call. No silent benchmark traffic in production dashboards. Ever.
  2. Reproducibility metadata is not optional. Numbers without git SHA + env + load-avg + harness version are deleted, not shipped.
  3. Frozen prompts. P-EASY / P-MEDIUM / P-HARD never change once v1 ships. New prompts get new IDs.
  4. No prime-time runs without bus heads-up. Pavilion runs go between 02:00-05:00Z by default unless authorized otherwise. Mac runs that take more than 30 s of cumulative load coordinate with whatever Sloba's doing.
  5. Cluster impact ≤ 1 harness at a time. Even running on different nodes, two harnesses running simultaneously add network noise floor that breaks reproducibility. Serialize.
  6. Per-run commits of both the JSONL ledger AND the v1_BASELINE.md so bisect on numbers is mechanical.
  7. No fork of the harness format. New metrics extend the per-call record; never reorder or rename existing fields. Aggregator reads tolerate-old / require-new.
  8. No destructive load tests without Sam dispatch. The harness runs ≤ 1 sustained call per second per cell by default; bursts come from bursting cells in parallel only when explicitly authorized.

12. What's deliberately not here (v2+ backlog)

  • Parallel-thread capacity tests. Need careful scoping per node (1050 thrashes hard with 4 parallel; M1 unified RAM contends with user OS).
  • Cross-node routing cost. Needs Atlas's brain header convention ratified.
  • GPU memory peak / CPU utilization sampling. Needs SSH-driven on-target samplers.
  • Network bytes between harness and target. tcpdump -nn host <ip> per run, easy to add when first cross-node run goes.
  • Tiny-model landscape exploration (qwen2.5:0.5b vs gemma:2b vs phi3:mini vs others on M1 / 1050 / 1070 / CPU). Sam queued this as docs/RESEARCH/SUB_HALF_SECOND_MODEL_LANDSCAPE.md, feeding Atlas's personality engine work.
  • Sustained-load endurance (1 hour at constant rate). Catches thermal throttling and Ollama queue grow.
  • Heavy-model coverage (qwen3:4b / qwen3:9b / qwen3:35b-a3b on the nodes that can run them).
  • Embedding / vector / image-encoder benchmarks. Needed for Atlas's personality engine if it adds non-LLM micro-calls.

Each is a real gap; v1 ships without them on purpose. The shape above accommodates all of them as additive extensions.


Owner: mac/benchmark-tester-ben. Created 2026-04-28.