Initial population of the weeyuga-benchmarks-public archive (PRIVATE staging visibility — flips public after Miljan + Stevan security audit sign-off per Sloba's 17:34Z dispatch). Contents: - README.md — public-facing intro (warns staging state, schema overview, citation pattern, license split) - LICENSE — CC-BY-4.0 default (auto-init from Gitea) - catalogue.json — schema_version=1.0-draft (locked once Tomas ratifies); 21 benchmarks indexed, 13 complete + 8 meta-only - methodology.md — mirror of WeeyugaWeb docs/BENCHMARKS/HARNESS.md (canonical methodology) - runs/<id>/run.jsonl|run.log|run.md|metadata.json — packaged copies of every run in WeeyugaWeb docs/BENCHMARKS/runs/* Run set covers: - Mission 1 (2026-04-28/29): pavilion-weeyuga-v1 + reconstructed v3 (96 calls, 16 models routed via weeyuga :11435) - Predator trio (2026-05-04): granite-4.1-8B + gemma-4-E4B-it + qwen3.5-9B - Predator qwen rerun (2026-05-04): qwen3.5-9B think500/nothink + qwen3-14B feasibility - A3B campaign (2026-05-04/05): pavilion-a3b + predator-a3b NGL matrix + ctx sweep + NGL+ctx 2D + NGL=6 deep dive - VPS50 CPU matrix + gemma-e4b CPU lane (2026-05-04/05) Visibility GATE: this repo stays private until Miljan G1-G4 audit and Stevan G3 credential audit both green. After sign-off, single API call flips visibility=public, anonymous read on, push-protection requires auth, issues moderate by default. No raw IPs, no SSH user@host strings, no /Users/ paths, no whisper transcripts in any of these files. Hardware names (pavilion, predator, vps50) are intentional and fine to share. Builder: WeeyugaWeb/scripts/benchmarks/build_catalogue.py (deterministic, idempotent, ~5s wall on 21 runs). Publish flow: WeeyugaWeb/scripts/benchmarks/publish_bench_run.py (builds packaged dirs, regenerates catalogue, optional --push to mirror into this repo, optional --deploy stub for cicd rsync). Owner: mac/benchmark-tester-ben (Ben). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
527 lines
23 KiB
Markdown
527 lines
23 KiB
Markdown
# BENCHMARK HARNESS — Weeyuga cluster
|
||
|
||
> **Owner:** mac/benchmark-tester-ben (Ben). Ping me on the bus with
|
||
> `Ben — benchmark X` and I'll either point at an existing harness
|
||
> output or queue a measurement.
|
||
>
|
||
> **Purpose:** the reproducible measurement surface for the cluster.
|
||
> Anyone who reads `v1_BASELINE.md` (or any future baseline) needs
|
||
> to know exactly how the numbers were produced — same git SHA,
|
||
> same prompts, same env, same telemetry tagging — so cross-run
|
||
> diffs are actually comparable.
|
||
>
|
||
> **Companion docs:**
|
||
> - `docs/BENCHMARKS/v1_BASELINE.md` — the first matrix (per-node ×
|
||
> per-engine × per-model × per-prompt) of cold/warm latency,
|
||
> tokens/sec, p50/p95.
|
||
> - `docs/MONITORING_RUNBOOK.md` (Luka) — query side, how to filter
|
||
> benchmark traffic out (`metadata.test:true`).
|
||
> - `coordination/HARDWARE_INVENTORY.md` — node spec ground truth.
|
||
> - `docs/architecture/elasticsearch/templates/index-weeyuga-telemetry.json` —
|
||
> ground envelope schema (`metadata: flattened` is where my
|
||
> tagging goes).
|
||
|
||
---
|
||
|
||
## 1. What I measure (the metric set)
|
||
|
||
Every run captures these fields. Anything missing means the run is
|
||
discarded — partial numbers are not numbers.
|
||
|
||
### Per-call (one prompt → one inference call)
|
||
|
||
| Field | Type | Source | Notes |
|
||
|---|---|---|---|
|
||
| `first_delta_ms` | int | client wall-clock between `POST` and first SSE chunk arriving | TTFT — most user-perceived |
|
||
| `total_duration_ms` | int | client wall-clock between `POST` and final SSE chunk | full request span |
|
||
| `prompt_tokens` | int | response body usage | from `/v1/chat/completions` `usage.prompt_tokens` |
|
||
| `completion_tokens` | int | response body usage | for llama.cpp this includes reasoning tokens; for Ollama (think:false bypass) it's answer-only |
|
||
| `tokens_per_sec` | float | `completion_tokens / ((total_duration_ms - first_delta_ms) / 1000)` | excludes prefill/queue time |
|
||
| `finish_reason` | enum | response body | `stop` / `length` / `error` — anything other than `stop` is a flagged run |
|
||
| `backend` | enum | response header `X-Weeyuga-Backend` if present, else inferred from URL | `ollama` / `llamacpp` / `brain-routed` |
|
||
| `cell_id` | str | harness | `<node>:<engine>:<model>` (e.g. `mac:llamacpp:qwen3.5:0.8b`) |
|
||
| `prompt_id` | enum | harness | `P-EASY` / `P-MEDIUM` / `P-HARD` (frozen — see §3) |
|
||
| `run_idx` | int | harness | 0..N within a (cell,prompt) cell |
|
||
| `phase` | enum | harness | `cold` (first run after model-load) / `warm` (subsequent) |
|
||
| `error` | str? | harness | populated on exception; the run is recorded but excluded from p50/p95 |
|
||
|
||
### Per-batch (one cell = one cold + N warm)
|
||
|
||
| Field | Source |
|
||
|---|---|
|
||
| `cold.first_delta_ms` | the single cold run |
|
||
| `cold.total_duration_ms` | the single cold run |
|
||
| `warm.first_delta_ms.p50` / `.p95` | percentile across N warm runs |
|
||
| `warm.total_duration_ms.p50` / `.p95` | percentile across N warm runs |
|
||
| `warm.tokens_per_sec.mean` | mean across N warm runs |
|
||
| `warm.success_rate` | `(N - error_count) / N` |
|
||
|
||
### Per-run (one harness invocation = many cells × many prompts)
|
||
|
||
| Field | Source |
|
||
|---|---|
|
||
| `benchmark_run_id` | `uuid4()` generated at harness start |
|
||
| `git_sha` | `git rev-parse HEAD` |
|
||
| `git_dirty` | `True` if `git status --porcelain` non-empty |
|
||
| `harness_version` | `scripts/benchmarks/HARNESS_VERSION` constant (bump on shape change) |
|
||
| `started_at_utc` / `finished_at_utc` | wall-clock |
|
||
| `host` | `socket.gethostname()` (where the harness is driven from) |
|
||
| `load_avg_start` / `load_avg_end` | `os.getloadavg()` snapshot |
|
||
| `env_route` | `WEEYUGA_INFERENCE_ROUTE` if set |
|
||
| `env_llamacpp_url` | `WEEYUGA_QWEN35_LLAMACPP_URL` if set |
|
||
| `cells_planned` | the cell list from `cells.yaml` after target-availability filtering |
|
||
|
||
These metadata fields are written **once at the top of the JSONL
|
||
ledger** as a single `meta` record, then every subsequent line is a
|
||
per-call record.
|
||
|
||
I do **not** measure GPU memory peak, CPU utilization, or network
|
||
bytes in v1 — those need on-target instrumentation (nvidia-smi /
|
||
top sampling / pcap) which adds harness complexity and another
|
||
moving part. v2 will add per-target system-load samplers driven by
|
||
SSH from the harness host. Today these are surfaced via Luka's
|
||
Cluster Health Overview dashboard if the run window matters.
|
||
|
||
---
|
||
|
||
## 2. Run patterns
|
||
|
||
Every cell runs **all four patterns** by default; opt out per-cell via
|
||
`cells.yaml` flags.
|
||
|
||
### 2.1 Cold vs warm
|
||
|
||
- **Cold**: model is forced out of memory before the first call.
|
||
- Ollama: `POST /api/generate {"model":"<name>","keep_alive":0}` with empty prompt to trigger unload, then a 2-second pause, then the measured call.
|
||
- llama.cpp: server keeps the model resident across requests by
|
||
design — there is no per-request unload. A "cold" llama.cpp run
|
||
is captured by sending the very first request after a fresh
|
||
`llama-server` start; on Pavilion + Mac the server is a
|
||
long-lived daemon, so cold-after-restart is the only true cold,
|
||
and we record the warm-from-now-on number for these. Marked
|
||
`cold_kind: process_warm` in the JSONL.
|
||
- **Warm**: the call is preceded by another call to the same model
|
||
on the same engine within the last 60 s.
|
||
|
||
The harness runs **1 cold + 5 warm** per (cell, prompt). 5 is the
|
||
v1 N — small enough the run finishes in < 30 min for the v1 cell
|
||
matrix; large enough to compute a meaningful p50/p95.
|
||
|
||
Bump N to 20 only when investigating a regression — it's expensive
|
||
and adds nothing to baseline shape.
|
||
|
||
### 2.2 Single-thread vs N-parallel
|
||
|
||
**v1: single-thread only.** Parallel-capacity tests are v2 and need
|
||
extra coordination because parallel inference on GTX 1050 (only 4 GB
|
||
VRAM with qwen3.5:0.8b at ~1.2 GB) thrashes hard, and Mac's M1
|
||
unified memory shares with the OS — running 4 parallel inference
|
||
calls during Sloba's work day is exactly the kind of "you didn't
|
||
tell me you were going to do that" event that earns trust loss.
|
||
|
||
When v2 lands, parallel runs go into `v1_BASELINE_PARALLEL.md` (or
|
||
v2_BASELINE.md if I bump the baseline shape).
|
||
|
||
### 2.3 Local vs cross-node-routed
|
||
|
||
**v1: local only.** Each cell measures the engine on its own
|
||
node — the harness drives the call directly to that node's
|
||
`:11434` (Ollama) or `:11436` (llama.cpp).
|
||
|
||
**v2 will add cross-node-routed cells**: the harness drives the
|
||
call to the brain (`https://cluster.weeyuga.com`) with
|
||
`WEEYUGA_INFERENCE_ROUTE=<node>` set, the brain forwards to the
|
||
node, and we measure routing overhead = `cross_node.duration_ms -
|
||
local.duration_ms` per (cell, prompt).
|
||
|
||
Atlas owns the routing knob; coordinate before adding cross-node
|
||
cells (the brain may need to honor a benchmark header — see §5).
|
||
|
||
---
|
||
|
||
## 3. Frozen canonical prompts
|
||
|
||
Three prompts. They never change once shipped — diffing across
|
||
runs depends on the input being byte-identical.
|
||
|
||
If you need a new prompt for an investigation, **add it as
|
||
`P-NEW1`/`P-NEW2`** to `scripts/benchmarks/prompts.yaml`; do NOT
|
||
edit the existing three. Future-Ben will thank you.
|
||
|
||
```yaml
|
||
P-EASY:
|
||
intent: trivial — single-token response space, near-zero work
|
||
prompt: |
|
||
hi
|
||
max_tokens: 64
|
||
|
||
P-MEDIUM:
|
||
intent: bounded structured task — 4 sentences on a known topic
|
||
prompt: |
|
||
Explain in 4 sentences why the sky appears blue at noon.
|
||
max_tokens: 512
|
||
|
||
P-HARD:
|
||
intent: open-ended creative — 200-word generation
|
||
prompt: |
|
||
Write a 200-word story about a fisherman who discovers a coin from a sunken empire.
|
||
max_tokens: 1024
|
||
```
|
||
|
||
Why these three:
|
||
- **P-EASY** is the trivial-bypass test. Mac's Phase 0+1 routing
|
||
classifies sub-3-word prompts to Ollama `think:false` bypass; on
|
||
llama.cpp the same prompt eats reasoning budget and shows the
|
||
"empty bubble" phone bug Atlas called out in the Phase 2 broadcast.
|
||
P-EASY is how we keep that regression visible.
|
||
- **P-MEDIUM** is Bane's Phase 2 smoke prompt verbatim. We already
|
||
have a sanity reference: Mac M1 llama.cpp 30.5 s, Pavilion GTX 1050
|
||
llama.cpp 37.8 s. New runs landing far from those values are a
|
||
flag.
|
||
- **P-HARD** stresses the answer side — completion_tokens is the
|
||
dominant axis, so this is where tokens-per-sec across hardware is
|
||
most legible. The 200-word target is loose; finish_reason=stop
|
||
vs length is a separate dimension we record.
|
||
|
||
The trio spans the prompt space cheaply: 1 token / ~50 token / ~250
|
||
token expected outputs, three orders of magnitude apart.
|
||
|
||
---
|
||
|
||
## 4. Cell matrix
|
||
|
||
Cells are declared in `scripts/benchmarks/cells.yaml`. Each cell is
|
||
a `(node, engine, model)` triple plus availability flags.
|
||
|
||
**v1 cells** (the matrix v1_BASELINE.md captures):
|
||
|
||
| Cell ID | Node | Engine | Model | Endpoint | Available? |
|
||
|---|---|---|---|---|---|
|
||
| `mac:ollama:qwen3.5:0.8b` | Mac M1 | Ollama | qwen3.5:0.8b | `http://127.0.0.1:11434` | ✅ |
|
||
| `mac:ollama:qwen2.5-coder:0.5b` | Mac M1 | Ollama | qwen2.5-coder:0.5b | `http://127.0.0.1:11434` | ✅ if pulled |
|
||
| `mac:ollama:qwen2.5-coder:1.5b` | Mac M1 | Ollama | qwen2.5-coder:1.5b | `http://127.0.0.1:11434` | ✅ if pulled |
|
||
| `mac:llamacpp:qwen3.5:0.8b` | Mac M1 | llama.cpp | qwen3.5:0.8b Q4_K_M | `http://127.0.0.1:11436` | ✅ if `llama-server` running |
|
||
| `pavilion:ollama:qwen3.5:0.8b` | Pavilion | Ollama | qwen3.5:0.8b | `http://10.8.0.3:11434` | ✅ via WG |
|
||
| `pavilion:llamacpp:qwen3.5:0.8b` | Pavilion | llama.cpp | qwen3.5:0.8b Q4_K_M | `http://10.8.0.3:11436` | ✅ via WG |
|
||
| `predator:ollama:qwen3.5:0.8b` | Predator | Ollama | qwen3.5:0.8b | `http://10.8.0.7:11434` | ⏳ if pulled |
|
||
| `predator:llamacpp:qwen3.5:0.8b` | Predator | llama.cpp | (pending) | `http://10.8.0.7:11436` | ❌ pending Trinity Job B |
|
||
|
||
**Skipped in v1**:
|
||
- **cicd Ollama** — cicd is the brain host. Driving inference
|
||
load on it directly risks affecting Sloba's mobile chat path.
|
||
Add only on explicit Sam dispatch.
|
||
- **qwen3:4b / qwen3:9b / qwen3:35b-a3b on any node** — heavier
|
||
models, need their own measurement pass with longer N and longer
|
||
windows. Queued for v2.
|
||
|
||
The harness probes each cell's availability with a 1-second
|
||
`HEAD`/`GET` health check before running. Unavailable cells are
|
||
recorded as `skipped: <reason>` in the JSONL, NOT silently dropped.
|
||
|
||
---
|
||
|
||
## 5. Telemetry tagging — `metadata.test=true` + `metadata.benchmark_run_id`
|
||
|
||
**Hard rule (per Sam's kickoff):** every benchmark call that flows
|
||
through the brain or any path that emits to
|
||
`weeyuga-telemetry-*` must be tagged so Luka's dashboards can
|
||
filter benchmark traffic out of production graphs.
|
||
|
||
### v1 convention (proposed — see §5.4 for coordination)
|
||
|
||
For events landing in `weeyuga-telemetry-*` (ground envelope):
|
||
|
||
```jsonc
|
||
{
|
||
"...": "...",
|
||
"metadata": {
|
||
"test": true,
|
||
"benchmark_run_id": "<uuid4>",
|
||
"benchmark_cell_id": "mac:llamacpp:qwen3.5:0.8b",
|
||
"benchmark_prompt_id": "P-MEDIUM",
|
||
"benchmark_phase": "warm",
|
||
"benchmark_run_idx": 3,
|
||
"harness_version": "1"
|
||
}
|
||
}
|
||
```
|
||
|
||
`metadata` is `flattened` in the index template (Nemanja
|
||
`weeyuga-mappings-common` v2.1 + `index-weeyuga-telemetry.json`),
|
||
so the keys above index without a mapping change. That's the
|
||
intended extension hatch.
|
||
|
||
For events landing in `weeyuga-logs-*` / behavioral indices, the
|
||
equivalent goes under `labels.test=true` (also `flattened`).
|
||
|
||
### 5.1 Mode A — direct-engine (default for v1)
|
||
|
||
The harness drives calls **directly to the engine** (`:11434`,
|
||
`:11436`), which does **not** emit ground envelope. No telemetry
|
||
tag is needed because no telemetry is generated. Pure empirical
|
||
measurement, zero brain side-effect.
|
||
|
||
**v1 baseline runs in Mode A only.** Cleanest, fastest, lowest
|
||
coordination cost.
|
||
|
||
### 5.2 Mode B — brain-routed (v2)
|
||
|
||
When v2 adds cross-node routing measurement, the harness drives
|
||
calls to `https://cluster.weeyuga.com` and the brain forwards.
|
||
Brain emits ground envelope on every dispatch.
|
||
|
||
**Convention proposed for Atlas + Luka:**
|
||
|
||
- Harness sends headers:
|
||
- `X-Weeyuga-Test: true`
|
||
- `X-Weeyuga-Benchmark-Run-Id: <uuid>`
|
||
- `X-Weeyuga-Benchmark-Cell-Id: <id>`
|
||
- `X-Weeyuga-Benchmark-Prompt-Id: <id>`
|
||
- Brain copies header values into `metadata.test`,
|
||
`metadata.benchmark_run_id`, etc. on every emitted envelope for
|
||
this request.
|
||
- Luka's dashboards default-filter `metadata.test:false OR
|
||
NOT metadata.test:*`. A "show benchmark traffic" toggle flips
|
||
the filter.
|
||
|
||
**Status:** proposed by Ben 2026-04-28; awaiting Atlas + Luka
|
||
ratification before v2 lands. Until then, v1 stays Mode A.
|
||
|
||
### 5.3 Local ledger (always emitted, regardless of mode)
|
||
|
||
Independent of brain telemetry, the harness writes
|
||
`docs/BENCHMARKS/runs/<benchmark_run_id>.jsonl` on the harness
|
||
host (Mac). One JSON object per line; first line is the `meta`
|
||
record (§1 per-run fields), subsequent lines are per-call records.
|
||
|
||
The local ledger is the **canonical source** for v1_BASELINE.md
|
||
and any aggregation. Brain telemetry is a nice-to-have for
|
||
cross-correlation in Kibana but is NOT load-bearing on the
|
||
baseline numbers.
|
||
|
||
### 5.4 Coordination ratification needed
|
||
|
||
This convention is currently **unilateral** from Ben. Before Mode
|
||
B ships:
|
||
- Atlas confirms the brain copies `X-Weeyuga-Test*` headers into
|
||
`metadata.*` on every envelope without breaking existing emit
|
||
paths.
|
||
- Luka adds the default `metadata.test` filter to all
|
||
prod-facing dashboards (Cluster Health Overview / Mobile Chat
|
||
Activity / Agent Telemetry / Error Funnel / Cluster Connectivity)
|
||
and confirms the toggle works.
|
||
- Nemanja ratifies that `metadata: flattened` is the right place
|
||
(vs. extending `actor` mapping) — leaning yes per his §3.4 use
|
||
of `flattened` for forward-compat.
|
||
|
||
A separate transcript dispatch carries this proposal. Until all
|
||
three ack, v1 baseline runs Mode A only and leaves brain
|
||
telemetry untouched.
|
||
|
||
---
|
||
|
||
## 6. Output format
|
||
|
||
### 6.1 Per-run JSONL ledger
|
||
|
||
`docs/BENCHMARKS/runs/<benchmark_run_id>.jsonl`
|
||
|
||
Line 1 — `meta` record (per-run fields, §1).
|
||
Lines 2..N — `call` records (per-call fields, §1) plus a
|
||
`phase: "skipped"` line for any cell that failed availability.
|
||
|
||
Example (truncated):
|
||
|
||
```jsonl
|
||
{"type":"meta","benchmark_run_id":"4e2a...","git_sha":"e2d6a6d","git_dirty":true,"harness_version":"1","started_at_utc":"2026-04-29T02:00:00Z","host":"slobodan-mac","load_avg_start":[1.2,1.5,1.4],"cells_planned":["mac:ollama:qwen3.5:0.8b","mac:llamacpp:qwen3.5:0.8b","pavilion:ollama:qwen3.5:0.8b","pavilion:llamacpp:qwen3.5:0.8b"]}
|
||
{"type":"call","cell_id":"mac:llamacpp:qwen3.5:0.8b","prompt_id":"P-EASY","phase":"cold","run_idx":0,"first_delta_ms":2810,"total_duration_ms":2840,"prompt_tokens":11,"completion_tokens":2,"tokens_per_sec":66.7,"finish_reason":"stop","backend":"llamacpp","error":null}
|
||
{"type":"call","cell_id":"mac:llamacpp:qwen3.5:0.8b","prompt_id":"P-EASY","phase":"warm","run_idx":0,"first_delta_ms":120,"total_duration_ms":150,"prompt_tokens":11,"completion_tokens":2,"tokens_per_sec":66.7,"finish_reason":"stop","backend":"llamacpp","error":null}
|
||
```
|
||
|
||
Aggregator (`scripts/benchmarks/aggregate.py`) reads the JSONL
|
||
and emits the v1_BASELINE.md table. Re-running the aggregator on
|
||
the same JSONL is deterministic and idempotent.
|
||
|
||
### 6.2 Markdown table shape (v1_BASELINE.md)
|
||
|
||
Per cell, two stacked tables: cold and warm. Per (cell, prompt),
|
||
one row with `first_delta_ms`, `total_duration_ms`, `completion_tokens`,
|
||
`tokens_per_sec`. Across-prompts summary at the end of each
|
||
cell's section. The cell matrix appears as a top-level summary
|
||
table (TTFT-warm-p50 only) above the per-cell detail.
|
||
|
||
This shape stays fixed across baselines so v1 → v2 → v3 diffs
|
||
are mechanical. Adding a new metric goes at the end of the per-cell
|
||
table; never reorder existing columns.
|
||
|
||
---
|
||
|
||
## 7. Regression thresholds
|
||
|
||
A run is flagged on the bus when, vs. the most recent baseline:
|
||
|
||
| Metric | Threshold | Severity |
|
||
|---|---|---|
|
||
| `warm.first_delta_ms.p50` (any cell) | ≥ 30% slower | regression — bus heads-up |
|
||
| `warm.total_duration_ms.p50` (any cell) | ≥ 30% slower | regression — bus heads-up |
|
||
| `warm.success_rate` (any cell) | < 0.95 | red flag — investigate before publishing |
|
||
| `warm.tokens_per_sec.mean` (any cell) | ≥ 30% lower | regression — bus heads-up |
|
||
| **Improvement** ≥ 30% on any of the above | wins — bus heads-up | publish |
|
||
|
||
30% is a deliberately wide threshold for v1 because run-to-run
|
||
variance on shared hardware (Mac M1 also runs Sloba's work) can
|
||
easily be 15-20%. Tighten when N is bumped from 5 to 20.
|
||
|
||
A regression doesn't auto-block a release; it triggers the
|
||
operator question "is this a real regression or a load-day blip?"
|
||
and prompts a re-run with `N=20`.
|
||
|
||
---
|
||
|
||
## 8. Reproducibility checklist
|
||
|
||
Before a baseline run is published, every entry in this checklist
|
||
must hold. If any fails, the JSONL is shipped but the
|
||
v1_BASELINE.md is annotated `unreliable: <reason>` rather than
|
||
written to be read as canonical.
|
||
|
||
- [ ] `git status` clean OR every dirty file documented in the run
|
||
metadata (e.g. "ignored: M mobile/macos/Flutter/GeneratedPluginRegistrant.swift")
|
||
- [ ] `git rev-parse HEAD` recorded
|
||
- [ ] Harness version recorded
|
||
- [ ] Each target's engine version recorded:
|
||
- Ollama: `curl http://<node>:11434/api/version`
|
||
- llama.cpp: `curl http://<node>:11436/health` (if exposed) or
|
||
record the b-build from the operator config
|
||
- [ ] Each target's model digest recorded:
|
||
- Ollama: `curl http://<node>:11434/api/tags | jq '.models[]|select(.name=="<name>").digest'`
|
||
- llama.cpp: GGUF SHA256 from operator config (Bane's Pavilion
|
||
install records this in his message; Mac TBD)
|
||
- [ ] Wall-clock window logged in heads-up message on bus
|
||
- [ ] No competing benchmark runs going (only one harness across
|
||
cluster at a time — even a different model on a different
|
||
node — to keep the network noise floor predictable)
|
||
- [ ] Sloba's prime-time avoided OR explicit authorization on the
|
||
bus (Sam dispatch, or "go ahead" from Sloba in chat)
|
||
|
||
The harness itself enforces a subset:
|
||
- Refuses to run if `git status` shows changes the operator hasn't
|
||
acknowledged via `--allow-dirty`.
|
||
- Refuses to run if `--for-publication` is set and any cell health
|
||
check fails.
|
||
- Records the start-time load average and refuses to start if
|
||
`getloadavg()[0] > 4.0` unless `--force-load` is set (the Mac
|
||
is too busy and numbers will be noisy).
|
||
|
||
---
|
||
|
||
## 9. Workflow — running a baseline
|
||
|
||
```bash
|
||
# 1. Check the bus for heads-up window collisions
|
||
cd /Users/slobodan/projects/WeeyugaWeb
|
||
tail -50 coordination/CLAUDE_TRANSCRIPT.md
|
||
|
||
# 2. Post heads-up
|
||
# (write coordination/messages/<utc>Z-benchmark-tester-ben-baseline-window.md
|
||
# + transcript entry, commit + push)
|
||
|
||
# 3. Health-check targets
|
||
python3 scripts/benchmarks/run_harness.py --probe
|
||
|
||
# 4. Smoke (1 cell, 1 prompt, 1 run) to validate end-to-end
|
||
python3 scripts/benchmarks/run_harness.py --smoke --cells mac:llamacpp:qwen3.5:0.8b
|
||
|
||
# 5. Full v1 baseline
|
||
python3 scripts/benchmarks/run_harness.py --full --cells-yaml scripts/benchmarks/cells.yaml --prompts-yaml scripts/benchmarks/prompts.yaml
|
||
|
||
# 6. Aggregate to markdown
|
||
python3 scripts/benchmarks/aggregate.py docs/BENCHMARKS/runs/<run-id>.jsonl > docs/BENCHMARKS/v1_BASELINE.md
|
||
|
||
# 7. Commit + push the JSONL ledger AND the markdown together (per-run commit)
|
||
git add docs/BENCHMARKS/runs/<run-id>.jsonl docs/BENCHMARKS/v1_BASELINE.md
|
||
git commit -m "benchmark: v1 baseline run <run-id>"
|
||
git push
|
||
|
||
# 8. Post bus message linking the result + a 1-paragraph framing for Janie
|
||
```
|
||
|
||
Subsequent baselines follow the same flow with the harness writing
|
||
a different `<run-id>.jsonl` per invocation. Old ledgers are
|
||
preserved forever — they're the audit trail for "did this number
|
||
move because of a code change or a load-day blip."
|
||
|
||
---
|
||
|
||
## 10. Coordination contract
|
||
|
||
| Who | What I owe them | What they owe me |
|
||
|---|---|---|
|
||
| **Sam** | Per-deliverable transcript entries + weekly Mon "regressions/improvements" digest | Cross-cutting decisions; spawn coordination |
|
||
| **Nemanja** | Metric set ratification; cell matrix sanity-check; field-naming convention review | Authoritative ground envelope schema; ratifying `metadata.*` extension |
|
||
| **Atlas** | Header convention proposal; Mode-B test traffic scoped + visible | Brain copies `X-Weeyuga-Test*` → `metadata.*`; informs me on emit-path changes that affect harness |
|
||
| **Luka** | Heads-up before any run (so his dashboards aren't read during noisy windows) | Default `metadata.test:false` filter on prod dashboards; "show benchmark traffic" toggle; query-side help |
|
||
| **Bane / Viktor** | Heads-up before Pavilion / Predator runs; idle-coordination on long runs | Engine-version + model-digest reads on demand; infra-stability heads-up |
|
||
| **Pablo / Filip** | Heads-up if a measurement window overlaps their device-test windows | Awareness of when the harness is generating mobile-shape traffic |
|
||
| **Janie** | Raw numbers + a 1-paragraph framing per run (what's interesting here) | Storytelling — turning numbers into Janie blog posts |
|
||
| **Sloba** | Numbers when asked; standing offer | Authorization for prime-time runs; prompt freezes (don't change P-EASY/P-MEDIUM/P-HARD without ack) |
|
||
|
||
---
|
||
|
||
## 11. Hard rules I commit to
|
||
|
||
1. **`metadata.test=true` (or Mode-A direct-engine) on every benchmark
|
||
call.** No silent benchmark traffic in production dashboards. Ever.
|
||
2. **Reproducibility metadata is not optional.** Numbers without
|
||
git SHA + env + load-avg + harness version are deleted, not
|
||
shipped.
|
||
3. **Frozen prompts.** P-EASY / P-MEDIUM / P-HARD never change once
|
||
v1 ships. New prompts get new IDs.
|
||
4. **No prime-time runs without bus heads-up.** Pavilion runs go
|
||
between 02:00-05:00Z by default unless authorized otherwise.
|
||
Mac runs that take more than 30 s of cumulative load coordinate
|
||
with whatever Sloba's doing.
|
||
5. **Cluster impact ≤ 1 harness at a time.** Even running on
|
||
different nodes, two harnesses running simultaneously add network
|
||
noise floor that breaks reproducibility. Serialize.
|
||
6. **Per-run commits** of both the JSONL ledger AND the
|
||
v1_BASELINE.md so bisect on numbers is mechanical.
|
||
7. **No fork of the harness format.** New metrics extend the
|
||
per-call record; never reorder or rename existing fields.
|
||
Aggregator reads tolerate-old / require-new.
|
||
8. **No destructive load tests** without Sam dispatch. The harness
|
||
runs ≤ 1 sustained call per second per cell by default; bursts
|
||
come from bursting cells in parallel only when explicitly
|
||
authorized.
|
||
|
||
---
|
||
|
||
## 12. What's deliberately not here (v2+ backlog)
|
||
|
||
- **Parallel-thread capacity tests.** Need careful scoping per node
|
||
(1050 thrashes hard with 4 parallel; M1 unified RAM contends with
|
||
user OS).
|
||
- **Cross-node routing cost.** Needs Atlas's brain header
|
||
convention ratified.
|
||
- **GPU memory peak / CPU utilization sampling.** Needs SSH-driven
|
||
on-target samplers.
|
||
- **Network bytes between harness and target.** `tcpdump -nn host
|
||
<ip>` per run, easy to add when first cross-node run goes.
|
||
- **Tiny-model landscape exploration** (qwen2.5:0.5b vs gemma:2b
|
||
vs phi3:mini vs others on M1 / 1050 / 1070 / CPU). Sam queued
|
||
this as `docs/RESEARCH/SUB_HALF_SECOND_MODEL_LANDSCAPE.md`,
|
||
feeding Atlas's personality engine work.
|
||
- **Sustained-load endurance** (1 hour at constant rate). Catches
|
||
thermal throttling and Ollama queue grow.
|
||
- **Heavy-model coverage** (qwen3:4b / qwen3:9b / qwen3:35b-a3b on
|
||
the nodes that can run them).
|
||
- **Embedding / vector / image-encoder benchmarks.** Needed for
|
||
Atlas's personality engine if it adds non-LLM micro-calls.
|
||
|
||
Each is a real gap; v1 ships without them on purpose. The shape
|
||
above accommodates all of them as additive extensions.
|
||
|
||
---
|
||
|
||
_Owner: mac/benchmark-tester-ben. Created 2026-04-28._
|