# BENCHMARK HARNESS — Weeyuga cluster > **Owner:** mac/benchmark-tester-ben (Ben). Ping me on the bus with > `Ben — benchmark X` and I'll either point at an existing harness > output or queue a measurement. > > **Purpose:** the reproducible measurement surface for the cluster. > Anyone who reads `v1_BASELINE.md` (or any future baseline) needs > to know exactly how the numbers were produced — same git SHA, > same prompts, same env, same telemetry tagging — so cross-run > diffs are actually comparable. > > **Companion docs:** > - `docs/BENCHMARKS/v1_BASELINE.md` — the first matrix (per-node × > per-engine × per-model × per-prompt) of cold/warm latency, > tokens/sec, p50/p95. > - `docs/MONITORING_RUNBOOK.md` (Luka) — query side, how to filter > benchmark traffic out (`metadata.test:true`). > - `coordination/HARDWARE_INVENTORY.md` — node spec ground truth. > - `docs/architecture/elasticsearch/templates/index-weeyuga-telemetry.json` — > ground envelope schema (`metadata: flattened` is where my > tagging goes). --- ## 1. What I measure (the metric set) Every run captures these fields. Anything missing means the run is discarded — partial numbers are not numbers. ### Per-call (one prompt → one inference call) | Field | Type | Source | Notes | |---|---|---|---| | `first_delta_ms` | int | client wall-clock between `POST` and first SSE chunk arriving | TTFT — most user-perceived | | `total_duration_ms` | int | client wall-clock between `POST` and final SSE chunk | full request span | | `prompt_tokens` | int | response body usage | from `/v1/chat/completions` `usage.prompt_tokens` | | `completion_tokens` | int | response body usage | for llama.cpp this includes reasoning tokens; for Ollama (think:false bypass) it's answer-only | | `tokens_per_sec` | float | `completion_tokens / ((total_duration_ms - first_delta_ms) / 1000)` | excludes prefill/queue time | | `finish_reason` | enum | response body | `stop` / `length` / `error` — anything other than `stop` is a flagged run | | `backend` | enum | response header `X-Weeyuga-Backend` if present, else inferred from URL | `ollama` / `llamacpp` / `brain-routed` | | `cell_id` | str | harness | `::` (e.g. `mac:llamacpp:qwen3.5:0.8b`) | | `prompt_id` | enum | harness | `P-EASY` / `P-MEDIUM` / `P-HARD` (frozen — see §3) | | `run_idx` | int | harness | 0..N within a (cell,prompt) cell | | `phase` | enum | harness | `cold` (first run after model-load) / `warm` (subsequent) | | `error` | str? | harness | populated on exception; the run is recorded but excluded from p50/p95 | ### Per-batch (one cell = one cold + N warm) | Field | Source | |---|---| | `cold.first_delta_ms` | the single cold run | | `cold.total_duration_ms` | the single cold run | | `warm.first_delta_ms.p50` / `.p95` | percentile across N warm runs | | `warm.total_duration_ms.p50` / `.p95` | percentile across N warm runs | | `warm.tokens_per_sec.mean` | mean across N warm runs | | `warm.success_rate` | `(N - error_count) / N` | ### Per-run (one harness invocation = many cells × many prompts) | Field | Source | |---|---| | `benchmark_run_id` | `uuid4()` generated at harness start | | `git_sha` | `git rev-parse HEAD` | | `git_dirty` | `True` if `git status --porcelain` non-empty | | `harness_version` | `scripts/benchmarks/HARNESS_VERSION` constant (bump on shape change) | | `started_at_utc` / `finished_at_utc` | wall-clock | | `host` | `socket.gethostname()` (where the harness is driven from) | | `load_avg_start` / `load_avg_end` | `os.getloadavg()` snapshot | | `env_route` | `WEEYUGA_INFERENCE_ROUTE` if set | | `env_llamacpp_url` | `WEEYUGA_QWEN35_LLAMACPP_URL` if set | | `cells_planned` | the cell list from `cells.yaml` after target-availability filtering | These metadata fields are written **once at the top of the JSONL ledger** as a single `meta` record, then every subsequent line is a per-call record. I do **not** measure GPU memory peak, CPU utilization, or network bytes in v1 — those need on-target instrumentation (nvidia-smi / top sampling / pcap) which adds harness complexity and another moving part. v2 will add per-target system-load samplers driven by SSH from the harness host. Today these are surfaced via Luka's Cluster Health Overview dashboard if the run window matters. --- ## 2. Run patterns Every cell runs **all four patterns** by default; opt out per-cell via `cells.yaml` flags. ### 2.1 Cold vs warm - **Cold**: model is forced out of memory before the first call. - Ollama: `POST /api/generate {"model":"","keep_alive":0}` with empty prompt to trigger unload, then a 2-second pause, then the measured call. - llama.cpp: server keeps the model resident across requests by design — there is no per-request unload. A "cold" llama.cpp run is captured by sending the very first request after a fresh `llama-server` start; on Pavilion + Mac the server is a long-lived daemon, so cold-after-restart is the only true cold, and we record the warm-from-now-on number for these. Marked `cold_kind: process_warm` in the JSONL. - **Warm**: the call is preceded by another call to the same model on the same engine within the last 60 s. The harness runs **1 cold + 5 warm** per (cell, prompt). 5 is the v1 N — small enough the run finishes in < 30 min for the v1 cell matrix; large enough to compute a meaningful p50/p95. Bump N to 20 only when investigating a regression — it's expensive and adds nothing to baseline shape. ### 2.2 Single-thread vs N-parallel **v1: single-thread only.** Parallel-capacity tests are v2 and need extra coordination because parallel inference on GTX 1050 (only 4 GB VRAM with qwen3.5:0.8b at ~1.2 GB) thrashes hard, and Mac's M1 unified memory shares with the OS — running 4 parallel inference calls during Sloba's work day is exactly the kind of "you didn't tell me you were going to do that" event that earns trust loss. When v2 lands, parallel runs go into `v1_BASELINE_PARALLEL.md` (or v2_BASELINE.md if I bump the baseline shape). ### 2.3 Local vs cross-node-routed **v1: local only.** Each cell measures the engine on its own node — the harness drives the call directly to that node's `:11434` (Ollama) or `:11436` (llama.cpp). **v2 will add cross-node-routed cells**: the harness drives the call to the brain (`https://cluster.weeyuga.com`) with `WEEYUGA_INFERENCE_ROUTE=` set, the brain forwards to the node, and we measure routing overhead = `cross_node.duration_ms - local.duration_ms` per (cell, prompt). Atlas owns the routing knob; coordinate before adding cross-node cells (the brain may need to honor a benchmark header — see §5). --- ## 3. Frozen canonical prompts Three prompts. They never change once shipped — diffing across runs depends on the input being byte-identical. If you need a new prompt for an investigation, **add it as `P-NEW1`/`P-NEW2`** to `scripts/benchmarks/prompts.yaml`; do NOT edit the existing three. Future-Ben will thank you. ```yaml P-EASY: intent: trivial — single-token response space, near-zero work prompt: | hi max_tokens: 64 P-MEDIUM: intent: bounded structured task — 4 sentences on a known topic prompt: | Explain in 4 sentences why the sky appears blue at noon. max_tokens: 512 P-HARD: intent: open-ended creative — 200-word generation prompt: | Write a 200-word story about a fisherman who discovers a coin from a sunken empire. max_tokens: 1024 ``` Why these three: - **P-EASY** is the trivial-bypass test. Mac's Phase 0+1 routing classifies sub-3-word prompts to Ollama `think:false` bypass; on llama.cpp the same prompt eats reasoning budget and shows the "empty bubble" phone bug Atlas called out in the Phase 2 broadcast. P-EASY is how we keep that regression visible. - **P-MEDIUM** is Bane's Phase 2 smoke prompt verbatim. We already have a sanity reference: Mac M1 llama.cpp 30.5 s, Pavilion GTX 1050 llama.cpp 37.8 s. New runs landing far from those values are a flag. - **P-HARD** stresses the answer side — completion_tokens is the dominant axis, so this is where tokens-per-sec across hardware is most legible. The 200-word target is loose; finish_reason=stop vs length is a separate dimension we record. The trio spans the prompt space cheaply: 1 token / ~50 token / ~250 token expected outputs, three orders of magnitude apart. --- ## 4. Cell matrix Cells are declared in `scripts/benchmarks/cells.yaml`. Each cell is a `(node, engine, model)` triple plus availability flags. **v1 cells** (the matrix v1_BASELINE.md captures): | Cell ID | Node | Engine | Model | Endpoint | Available? | |---|---|---|---|---|---| | `mac:ollama:qwen3.5:0.8b` | Mac M1 | Ollama | qwen3.5:0.8b | `http://127.0.0.1:11434` | ✅ | | `mac:ollama:qwen2.5-coder:0.5b` | Mac M1 | Ollama | qwen2.5-coder:0.5b | `http://127.0.0.1:11434` | ✅ if pulled | | `mac:ollama:qwen2.5-coder:1.5b` | Mac M1 | Ollama | qwen2.5-coder:1.5b | `http://127.0.0.1:11434` | ✅ if pulled | | `mac:llamacpp:qwen3.5:0.8b` | Mac M1 | llama.cpp | qwen3.5:0.8b Q4_K_M | `http://127.0.0.1:11436` | ✅ if `llama-server` running | | `pavilion:ollama:qwen3.5:0.8b` | Pavilion | Ollama | qwen3.5:0.8b | `http://10.8.0.3:11434` | ✅ via WG | | `pavilion:llamacpp:qwen3.5:0.8b` | Pavilion | llama.cpp | qwen3.5:0.8b Q4_K_M | `http://10.8.0.3:11436` | ✅ via WG | | `predator:ollama:qwen3.5:0.8b` | Predator | Ollama | qwen3.5:0.8b | `http://10.8.0.7:11434` | ⏳ if pulled | | `predator:llamacpp:qwen3.5:0.8b` | Predator | llama.cpp | (pending) | `http://10.8.0.7:11436` | ❌ pending Trinity Job B | **Skipped in v1**: - **cicd Ollama** — cicd is the brain host. Driving inference load on it directly risks affecting Sloba's mobile chat path. Add only on explicit Sam dispatch. - **qwen3:4b / qwen3:9b / qwen3:35b-a3b on any node** — heavier models, need their own measurement pass with longer N and longer windows. Queued for v2. The harness probes each cell's availability with a 1-second `HEAD`/`GET` health check before running. Unavailable cells are recorded as `skipped: ` in the JSONL, NOT silently dropped. --- ## 5. Telemetry tagging — `metadata.test=true` + `metadata.benchmark_run_id` **Hard rule (per Sam's kickoff):** every benchmark call that flows through the brain or any path that emits to `weeyuga-telemetry-*` must be tagged so Luka's dashboards can filter benchmark traffic out of production graphs. ### v1 convention (proposed — see §5.4 for coordination) For events landing in `weeyuga-telemetry-*` (ground envelope): ```jsonc { "...": "...", "metadata": { "test": true, "benchmark_run_id": "", "benchmark_cell_id": "mac:llamacpp:qwen3.5:0.8b", "benchmark_prompt_id": "P-MEDIUM", "benchmark_phase": "warm", "benchmark_run_idx": 3, "harness_version": "1" } } ``` `metadata` is `flattened` in the index template (Nemanja `weeyuga-mappings-common` v2.1 + `index-weeyuga-telemetry.json`), so the keys above index without a mapping change. That's the intended extension hatch. For events landing in `weeyuga-logs-*` / behavioral indices, the equivalent goes under `labels.test=true` (also `flattened`). ### 5.1 Mode A — direct-engine (default for v1) The harness drives calls **directly to the engine** (`:11434`, `:11436`), which does **not** emit ground envelope. No telemetry tag is needed because no telemetry is generated. Pure empirical measurement, zero brain side-effect. **v1 baseline runs in Mode A only.** Cleanest, fastest, lowest coordination cost. ### 5.2 Mode B — brain-routed (v2) When v2 adds cross-node routing measurement, the harness drives calls to `https://cluster.weeyuga.com` and the brain forwards. Brain emits ground envelope on every dispatch. **Convention proposed for Atlas + Luka:** - Harness sends headers: - `X-Weeyuga-Test: true` - `X-Weeyuga-Benchmark-Run-Id: ` - `X-Weeyuga-Benchmark-Cell-Id: ` - `X-Weeyuga-Benchmark-Prompt-Id: ` - Brain copies header values into `metadata.test`, `metadata.benchmark_run_id`, etc. on every emitted envelope for this request. - Luka's dashboards default-filter `metadata.test:false OR NOT metadata.test:*`. A "show benchmark traffic" toggle flips the filter. **Status:** proposed by Ben 2026-04-28; awaiting Atlas + Luka ratification before v2 lands. Until then, v1 stays Mode A. ### 5.3 Local ledger (always emitted, regardless of mode) Independent of brain telemetry, the harness writes `docs/BENCHMARKS/runs/.jsonl` on the harness host (Mac). One JSON object per line; first line is the `meta` record (§1 per-run fields), subsequent lines are per-call records. The local ledger is the **canonical source** for v1_BASELINE.md and any aggregation. Brain telemetry is a nice-to-have for cross-correlation in Kibana but is NOT load-bearing on the baseline numbers. ### 5.4 Coordination ratification needed This convention is currently **unilateral** from Ben. Before Mode B ships: - Atlas confirms the brain copies `X-Weeyuga-Test*` headers into `metadata.*` on every envelope without breaking existing emit paths. - Luka adds the default `metadata.test` filter to all prod-facing dashboards (Cluster Health Overview / Mobile Chat Activity / Agent Telemetry / Error Funnel / Cluster Connectivity) and confirms the toggle works. - Nemanja ratifies that `metadata: flattened` is the right place (vs. extending `actor` mapping) — leaning yes per his §3.4 use of `flattened` for forward-compat. A separate transcript dispatch carries this proposal. Until all three ack, v1 baseline runs Mode A only and leaves brain telemetry untouched. --- ## 6. Output format ### 6.1 Per-run JSONL ledger `docs/BENCHMARKS/runs/.jsonl` Line 1 — `meta` record (per-run fields, §1). Lines 2..N — `call` records (per-call fields, §1) plus a `phase: "skipped"` line for any cell that failed availability. Example (truncated): ```jsonl {"type":"meta","benchmark_run_id":"4e2a...","git_sha":"e2d6a6d","git_dirty":true,"harness_version":"1","started_at_utc":"2026-04-29T02:00:00Z","host":"slobodan-mac","load_avg_start":[1.2,1.5,1.4],"cells_planned":["mac:ollama:qwen3.5:0.8b","mac:llamacpp:qwen3.5:0.8b","pavilion:ollama:qwen3.5:0.8b","pavilion:llamacpp:qwen3.5:0.8b"]} {"type":"call","cell_id":"mac:llamacpp:qwen3.5:0.8b","prompt_id":"P-EASY","phase":"cold","run_idx":0,"first_delta_ms":2810,"total_duration_ms":2840,"prompt_tokens":11,"completion_tokens":2,"tokens_per_sec":66.7,"finish_reason":"stop","backend":"llamacpp","error":null} {"type":"call","cell_id":"mac:llamacpp:qwen3.5:0.8b","prompt_id":"P-EASY","phase":"warm","run_idx":0,"first_delta_ms":120,"total_duration_ms":150,"prompt_tokens":11,"completion_tokens":2,"tokens_per_sec":66.7,"finish_reason":"stop","backend":"llamacpp","error":null} ``` Aggregator (`scripts/benchmarks/aggregate.py`) reads the JSONL and emits the v1_BASELINE.md table. Re-running the aggregator on the same JSONL is deterministic and idempotent. ### 6.2 Markdown table shape (v1_BASELINE.md) Per cell, two stacked tables: cold and warm. Per (cell, prompt), one row with `first_delta_ms`, `total_duration_ms`, `completion_tokens`, `tokens_per_sec`. Across-prompts summary at the end of each cell's section. The cell matrix appears as a top-level summary table (TTFT-warm-p50 only) above the per-cell detail. This shape stays fixed across baselines so v1 → v2 → v3 diffs are mechanical. Adding a new metric goes at the end of the per-cell table; never reorder existing columns. --- ## 7. Regression thresholds A run is flagged on the bus when, vs. the most recent baseline: | Metric | Threshold | Severity | |---|---|---| | `warm.first_delta_ms.p50` (any cell) | ≥ 30% slower | regression — bus heads-up | | `warm.total_duration_ms.p50` (any cell) | ≥ 30% slower | regression — bus heads-up | | `warm.success_rate` (any cell) | < 0.95 | red flag — investigate before publishing | | `warm.tokens_per_sec.mean` (any cell) | ≥ 30% lower | regression — bus heads-up | | **Improvement** ≥ 30% on any of the above | wins — bus heads-up | publish | 30% is a deliberately wide threshold for v1 because run-to-run variance on shared hardware (Mac M1 also runs Sloba's work) can easily be 15-20%. Tighten when N is bumped from 5 to 20. A regression doesn't auto-block a release; it triggers the operator question "is this a real regression or a load-day blip?" and prompts a re-run with `N=20`. --- ## 8. Reproducibility checklist Before a baseline run is published, every entry in this checklist must hold. If any fails, the JSONL is shipped but the v1_BASELINE.md is annotated `unreliable: ` rather than written to be read as canonical. - [ ] `git status` clean OR every dirty file documented in the run metadata (e.g. "ignored: M mobile/macos/Flutter/GeneratedPluginRegistrant.swift") - [ ] `git rev-parse HEAD` recorded - [ ] Harness version recorded - [ ] Each target's engine version recorded: - Ollama: `curl http://:11434/api/version` - llama.cpp: `curl http://:11436/health` (if exposed) or record the b-build from the operator config - [ ] Each target's model digest recorded: - Ollama: `curl http://:11434/api/tags | jq '.models[]|select(.name=="").digest'` - llama.cpp: GGUF SHA256 from operator config (Bane's Pavilion install records this in his message; Mac TBD) - [ ] Wall-clock window logged in heads-up message on bus - [ ] No competing benchmark runs going (only one harness across cluster at a time — even a different model on a different node — to keep the network noise floor predictable) - [ ] Sloba's prime-time avoided OR explicit authorization on the bus (Sam dispatch, or "go ahead" from Sloba in chat) The harness itself enforces a subset: - Refuses to run if `git status` shows changes the operator hasn't acknowledged via `--allow-dirty`. - Refuses to run if `--for-publication` is set and any cell health check fails. - Records the start-time load average and refuses to start if `getloadavg()[0] > 4.0` unless `--force-load` is set (the Mac is too busy and numbers will be noisy). --- ## 9. Workflow — running a baseline ```bash # 1. Check the bus for heads-up window collisions cd /Users/slobodan/projects/WeeyugaWeb tail -50 coordination/CLAUDE_TRANSCRIPT.md # 2. Post heads-up # (write coordination/messages/Z-benchmark-tester-ben-baseline-window.md # + transcript entry, commit + push) # 3. Health-check targets python3 scripts/benchmarks/run_harness.py --probe # 4. Smoke (1 cell, 1 prompt, 1 run) to validate end-to-end python3 scripts/benchmarks/run_harness.py --smoke --cells mac:llamacpp:qwen3.5:0.8b # 5. Full v1 baseline python3 scripts/benchmarks/run_harness.py --full --cells-yaml scripts/benchmarks/cells.yaml --prompts-yaml scripts/benchmarks/prompts.yaml # 6. Aggregate to markdown python3 scripts/benchmarks/aggregate.py docs/BENCHMARKS/runs/.jsonl > docs/BENCHMARKS/v1_BASELINE.md # 7. Commit + push the JSONL ledger AND the markdown together (per-run commit) git add docs/BENCHMARKS/runs/.jsonl docs/BENCHMARKS/v1_BASELINE.md git commit -m "benchmark: v1 baseline run " git push # 8. Post bus message linking the result + a 1-paragraph framing for Janie ``` Subsequent baselines follow the same flow with the harness writing a different `.jsonl` per invocation. Old ledgers are preserved forever — they're the audit trail for "did this number move because of a code change or a load-day blip." --- ## 10. Coordination contract | Who | What I owe them | What they owe me | |---|---|---| | **Sam** | Per-deliverable transcript entries + weekly Mon "regressions/improvements" digest | Cross-cutting decisions; spawn coordination | | **Nemanja** | Metric set ratification; cell matrix sanity-check; field-naming convention review | Authoritative ground envelope schema; ratifying `metadata.*` extension | | **Atlas** | Header convention proposal; Mode-B test traffic scoped + visible | Brain copies `X-Weeyuga-Test*` → `metadata.*`; informs me on emit-path changes that affect harness | | **Luka** | Heads-up before any run (so his dashboards aren't read during noisy windows) | Default `metadata.test:false` filter on prod dashboards; "show benchmark traffic" toggle; query-side help | | **Bane / Viktor** | Heads-up before Pavilion / Predator runs; idle-coordination on long runs | Engine-version + model-digest reads on demand; infra-stability heads-up | | **Pablo / Filip** | Heads-up if a measurement window overlaps their device-test windows | Awareness of when the harness is generating mobile-shape traffic | | **Janie** | Raw numbers + a 1-paragraph framing per run (what's interesting here) | Storytelling — turning numbers into Janie blog posts | | **Sloba** | Numbers when asked; standing offer | Authorization for prime-time runs; prompt freezes (don't change P-EASY/P-MEDIUM/P-HARD without ack) | --- ## 11. Hard rules I commit to 1. **`metadata.test=true` (or Mode-A direct-engine) on every benchmark call.** No silent benchmark traffic in production dashboards. Ever. 2. **Reproducibility metadata is not optional.** Numbers without git SHA + env + load-avg + harness version are deleted, not shipped. 3. **Frozen prompts.** P-EASY / P-MEDIUM / P-HARD never change once v1 ships. New prompts get new IDs. 4. **No prime-time runs without bus heads-up.** Pavilion runs go between 02:00-05:00Z by default unless authorized otherwise. Mac runs that take more than 30 s of cumulative load coordinate with whatever Sloba's doing. 5. **Cluster impact ≤ 1 harness at a time.** Even running on different nodes, two harnesses running simultaneously add network noise floor that breaks reproducibility. Serialize. 6. **Per-run commits** of both the JSONL ledger AND the v1_BASELINE.md so bisect on numbers is mechanical. 7. **No fork of the harness format.** New metrics extend the per-call record; never reorder or rename existing fields. Aggregator reads tolerate-old / require-new. 8. **No destructive load tests** without Sam dispatch. The harness runs ≤ 1 sustained call per second per cell by default; bursts come from bursting cells in parallel only when explicitly authorized. --- ## 12. What's deliberately not here (v2+ backlog) - **Parallel-thread capacity tests.** Need careful scoping per node (1050 thrashes hard with 4 parallel; M1 unified RAM contends with user OS). - **Cross-node routing cost.** Needs Atlas's brain header convention ratified. - **GPU memory peak / CPU utilization sampling.** Needs SSH-driven on-target samplers. - **Network bytes between harness and target.** `tcpdump -nn host ` per run, easy to add when first cross-node run goes. - **Tiny-model landscape exploration** (qwen2.5:0.5b vs gemma:2b vs phi3:mini vs others on M1 / 1050 / 1070 / CPU). Sam queued this as `docs/RESEARCH/SUB_HALF_SECOND_MODEL_LANDSCAPE.md`, feeding Atlas's personality engine work. - **Sustained-load endurance** (1 hour at constant rate). Catches thermal throttling and Ollama queue grow. - **Heavy-model coverage** (qwen3:4b / qwen3:9b / qwen3:35b-a3b on the nodes that can run them). - **Embedding / vector / image-encoder benchmarks.** Needed for Atlas's personality engine if it adds non-LLM micro-calls. Each is a real gap; v1 ships without them on purpose. The shape above accommodates all of them as additive extensions. --- _Owner: mac/benchmark-tester-ben. Created 2026-04-28._