Initial population of the weeyuga-benchmarks-public archive (PRIVATE staging visibility — flips public after Miljan + Stevan security audit sign-off per Sloba's 17:34Z dispatch). Contents: - README.md — public-facing intro (warns staging state, schema overview, citation pattern, license split) - LICENSE — CC-BY-4.0 default (auto-init from Gitea) - catalogue.json — schema_version=1.0-draft (locked once Tomas ratifies); 21 benchmarks indexed, 13 complete + 8 meta-only - methodology.md — mirror of WeeyugaWeb docs/BENCHMARKS/HARNESS.md (canonical methodology) - runs/<id>/run.jsonl|run.log|run.md|metadata.json — packaged copies of every run in WeeyugaWeb docs/BENCHMARKS/runs/* Run set covers: - Mission 1 (2026-04-28/29): pavilion-weeyuga-v1 + reconstructed v3 (96 calls, 16 models routed via weeyuga :11435) - Predator trio (2026-05-04): granite-4.1-8B + gemma-4-E4B-it + qwen3.5-9B - Predator qwen rerun (2026-05-04): qwen3.5-9B think500/nothink + qwen3-14B feasibility - A3B campaign (2026-05-04/05): pavilion-a3b + predator-a3b NGL matrix + ctx sweep + NGL+ctx 2D + NGL=6 deep dive - VPS50 CPU matrix + gemma-e4b CPU lane (2026-05-04/05) Visibility GATE: this repo stays private until Miljan G1-G4 audit and Stevan G3 credential audit both green. After sign-off, single API call flips visibility=public, anonymous read on, push-protection requires auth, issues moderate by default. No raw IPs, no SSH user@host strings, no /Users/ paths, no whisper transcripts in any of these files. Hardware names (pavilion, predator, vps50) are intentional and fine to share. Builder: WeeyugaWeb/scripts/benchmarks/build_catalogue.py (deterministic, idempotent, ~5s wall on 21 runs). Publish flow: WeeyugaWeb/scripts/benchmarks/publish_bench_run.py (builds packaged dirs, regenerates catalogue, optional --push to mirror into this repo, optional --deploy stub for cicd rsync). Owner: mac/benchmark-tester-ben (Ben). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
23 KiB
BENCHMARK HARNESS — Weeyuga cluster
Owner: mac/benchmark-tester-ben (Ben). Ping me on the bus with
Ben — benchmark Xand I'll either point at an existing harness output or queue a measurement.Purpose: the reproducible measurement surface for the cluster. Anyone who reads
v1_BASELINE.md(or any future baseline) needs to know exactly how the numbers were produced — same git SHA, same prompts, same env, same telemetry tagging — so cross-run diffs are actually comparable.Companion docs:
docs/BENCHMARKS/v1_BASELINE.md— the first matrix (per-node × per-engine × per-model × per-prompt) of cold/warm latency, tokens/sec, p50/p95.docs/MONITORING_RUNBOOK.md(Luka) — query side, how to filter benchmark traffic out (metadata.test:true).coordination/HARDWARE_INVENTORY.md— node spec ground truth.docs/architecture/elasticsearch/templates/index-weeyuga-telemetry.json— ground envelope schema (metadata: flattenedis where my tagging goes).
1. What I measure (the metric set)
Every run captures these fields. Anything missing means the run is discarded — partial numbers are not numbers.
Per-call (one prompt → one inference call)
| Field | Type | Source | Notes |
|---|---|---|---|
first_delta_ms |
int | client wall-clock between POST and first SSE chunk arriving |
TTFT — most user-perceived |
total_duration_ms |
int | client wall-clock between POST and final SSE chunk |
full request span |
prompt_tokens |
int | response body usage | from /v1/chat/completions usage.prompt_tokens |
completion_tokens |
int | response body usage | for llama.cpp this includes reasoning tokens; for Ollama (think:false bypass) it's answer-only |
tokens_per_sec |
float | completion_tokens / ((total_duration_ms - first_delta_ms) / 1000) |
excludes prefill/queue time |
finish_reason |
enum | response body | stop / length / error — anything other than stop is a flagged run |
backend |
enum | response header X-Weeyuga-Backend if present, else inferred from URL |
ollama / llamacpp / brain-routed |
cell_id |
str | harness | <node>:<engine>:<model> (e.g. mac:llamacpp:qwen3.5:0.8b) |
prompt_id |
enum | harness | P-EASY / P-MEDIUM / P-HARD (frozen — see §3) |
run_idx |
int | harness | 0..N within a (cell,prompt) cell |
phase |
enum | harness | cold (first run after model-load) / warm (subsequent) |
error |
str? | harness | populated on exception; the run is recorded but excluded from p50/p95 |
Per-batch (one cell = one cold + N warm)
| Field | Source |
|---|---|
cold.first_delta_ms |
the single cold run |
cold.total_duration_ms |
the single cold run |
warm.first_delta_ms.p50 / .p95 |
percentile across N warm runs |
warm.total_duration_ms.p50 / .p95 |
percentile across N warm runs |
warm.tokens_per_sec.mean |
mean across N warm runs |
warm.success_rate |
(N - error_count) / N |
Per-run (one harness invocation = many cells × many prompts)
| Field | Source |
|---|---|
benchmark_run_id |
uuid4() generated at harness start |
git_sha |
git rev-parse HEAD |
git_dirty |
True if git status --porcelain non-empty |
harness_version |
scripts/benchmarks/HARNESS_VERSION constant (bump on shape change) |
started_at_utc / finished_at_utc |
wall-clock |
host |
socket.gethostname() (where the harness is driven from) |
load_avg_start / load_avg_end |
os.getloadavg() snapshot |
env_route |
WEEYUGA_INFERENCE_ROUTE if set |
env_llamacpp_url |
WEEYUGA_QWEN35_LLAMACPP_URL if set |
cells_planned |
the cell list from cells.yaml after target-availability filtering |
These metadata fields are written once at the top of the JSONL
ledger as a single meta record, then every subsequent line is a
per-call record.
I do not measure GPU memory peak, CPU utilization, or network bytes in v1 — those need on-target instrumentation (nvidia-smi / top sampling / pcap) which adds harness complexity and another moving part. v2 will add per-target system-load samplers driven by SSH from the harness host. Today these are surfaced via Luka's Cluster Health Overview dashboard if the run window matters.
2. Run patterns
Every cell runs all four patterns by default; opt out per-cell via
cells.yaml flags.
2.1 Cold vs warm
- Cold: model is forced out of memory before the first call.
- Ollama:
POST /api/generate {"model":"<name>","keep_alive":0}with empty prompt to trigger unload, then a 2-second pause, then the measured call. - llama.cpp: server keeps the model resident across requests by
design — there is no per-request unload. A "cold" llama.cpp run
is captured by sending the very first request after a fresh
llama-serverstart; on Pavilion + Mac the server is a long-lived daemon, so cold-after-restart is the only true cold, and we record the warm-from-now-on number for these. Markedcold_kind: process_warmin the JSONL.
- Ollama:
- Warm: the call is preceded by another call to the same model on the same engine within the last 60 s.
The harness runs 1 cold + 5 warm per (cell, prompt). 5 is the v1 N — small enough the run finishes in < 30 min for the v1 cell matrix; large enough to compute a meaningful p50/p95.
Bump N to 20 only when investigating a regression — it's expensive and adds nothing to baseline shape.
2.2 Single-thread vs N-parallel
v1: single-thread only. Parallel-capacity tests are v2 and need extra coordination because parallel inference on GTX 1050 (only 4 GB VRAM with qwen3.5:0.8b at ~1.2 GB) thrashes hard, and Mac's M1 unified memory shares with the OS — running 4 parallel inference calls during Sloba's work day is exactly the kind of "you didn't tell me you were going to do that" event that earns trust loss.
When v2 lands, parallel runs go into v1_BASELINE_PARALLEL.md (or
v2_BASELINE.md if I bump the baseline shape).
2.3 Local vs cross-node-routed
v1: local only. Each cell measures the engine on its own
node — the harness drives the call directly to that node's
:11434 (Ollama) or :11436 (llama.cpp).
v2 will add cross-node-routed cells: the harness drives the
call to the brain (https://cluster.weeyuga.com) with
WEEYUGA_INFERENCE_ROUTE=<node> set, the brain forwards to the
node, and we measure routing overhead = cross_node.duration_ms - local.duration_ms per (cell, prompt).
Atlas owns the routing knob; coordinate before adding cross-node cells (the brain may need to honor a benchmark header — see §5).
3. Frozen canonical prompts
Three prompts. They never change once shipped — diffing across runs depends on the input being byte-identical.
If you need a new prompt for an investigation, add it as
P-NEW1/P-NEW2 to scripts/benchmarks/prompts.yaml; do NOT
edit the existing three. Future-Ben will thank you.
P-EASY:
intent: trivial — single-token response space, near-zero work
prompt: |
hi
max_tokens: 64
P-MEDIUM:
intent: bounded structured task — 4 sentences on a known topic
prompt: |
Explain in 4 sentences why the sky appears blue at noon.
max_tokens: 512
P-HARD:
intent: open-ended creative — 200-word generation
prompt: |
Write a 200-word story about a fisherman who discovers a coin from a sunken empire.
max_tokens: 1024
Why these three:
- P-EASY is the trivial-bypass test. Mac's Phase 0+1 routing
classifies sub-3-word prompts to Ollama
think:falsebypass; on llama.cpp the same prompt eats reasoning budget and shows the "empty bubble" phone bug Atlas called out in the Phase 2 broadcast. P-EASY is how we keep that regression visible. - P-MEDIUM is Bane's Phase 2 smoke prompt verbatim. We already have a sanity reference: Mac M1 llama.cpp 30.5 s, Pavilion GTX 1050 llama.cpp 37.8 s. New runs landing far from those values are a flag.
- P-HARD stresses the answer side — completion_tokens is the dominant axis, so this is where tokens-per-sec across hardware is most legible. The 200-word target is loose; finish_reason=stop vs length is a separate dimension we record.
The trio spans the prompt space cheaply: 1 token / ~50 token / ~250 token expected outputs, three orders of magnitude apart.
4. Cell matrix
Cells are declared in scripts/benchmarks/cells.yaml. Each cell is
a (node, engine, model) triple plus availability flags.
v1 cells (the matrix v1_BASELINE.md captures):
| Cell ID | Node | Engine | Model | Endpoint | Available? |
|---|---|---|---|---|---|
mac:ollama:qwen3.5:0.8b |
Mac M1 | Ollama | qwen3.5:0.8b | http://127.0.0.1:11434 |
✅ |
mac:ollama:qwen2.5-coder:0.5b |
Mac M1 | Ollama | qwen2.5-coder:0.5b | http://127.0.0.1:11434 |
✅ if pulled |
mac:ollama:qwen2.5-coder:1.5b |
Mac M1 | Ollama | qwen2.5-coder:1.5b | http://127.0.0.1:11434 |
✅ if pulled |
mac:llamacpp:qwen3.5:0.8b |
Mac M1 | llama.cpp | qwen3.5:0.8b Q4_K_M | http://127.0.0.1:11436 |
✅ if llama-server running |
pavilion:ollama:qwen3.5:0.8b |
Pavilion | Ollama | qwen3.5:0.8b | http://10.8.0.3:11434 |
✅ via WG |
pavilion:llamacpp:qwen3.5:0.8b |
Pavilion | llama.cpp | qwen3.5:0.8b Q4_K_M | http://10.8.0.3:11436 |
✅ via WG |
predator:ollama:qwen3.5:0.8b |
Predator | Ollama | qwen3.5:0.8b | http://10.8.0.7:11434 |
⏳ if pulled |
predator:llamacpp:qwen3.5:0.8b |
Predator | llama.cpp | (pending) | http://10.8.0.7:11436 |
❌ pending Trinity Job B |
Skipped in v1:
- cicd Ollama — cicd is the brain host. Driving inference load on it directly risks affecting Sloba's mobile chat path. Add only on explicit Sam dispatch.
- qwen3:4b / qwen3:9b / qwen3:35b-a3b on any node — heavier models, need their own measurement pass with longer N and longer windows. Queued for v2.
The harness probes each cell's availability with a 1-second
HEAD/GET health check before running. Unavailable cells are
recorded as skipped: <reason> in the JSONL, NOT silently dropped.
5. Telemetry tagging — metadata.test=true + metadata.benchmark_run_id
Hard rule (per Sam's kickoff): every benchmark call that flows
through the brain or any path that emits to
weeyuga-telemetry-* must be tagged so Luka's dashboards can
filter benchmark traffic out of production graphs.
v1 convention (proposed — see §5.4 for coordination)
For events landing in weeyuga-telemetry-* (ground envelope):
{
"...": "...",
"metadata": {
"test": true,
"benchmark_run_id": "<uuid4>",
"benchmark_cell_id": "mac:llamacpp:qwen3.5:0.8b",
"benchmark_prompt_id": "P-MEDIUM",
"benchmark_phase": "warm",
"benchmark_run_idx": 3,
"harness_version": "1"
}
}
metadata is flattened in the index template (Nemanja
weeyuga-mappings-common v2.1 + index-weeyuga-telemetry.json),
so the keys above index without a mapping change. That's the
intended extension hatch.
For events landing in weeyuga-logs-* / behavioral indices, the
equivalent goes under labels.test=true (also flattened).
5.1 Mode A — direct-engine (default for v1)
The harness drives calls directly to the engine (:11434,
:11436), which does not emit ground envelope. No telemetry
tag is needed because no telemetry is generated. Pure empirical
measurement, zero brain side-effect.
v1 baseline runs in Mode A only. Cleanest, fastest, lowest coordination cost.
5.2 Mode B — brain-routed (v2)
When v2 adds cross-node routing measurement, the harness drives
calls to https://cluster.weeyuga.com and the brain forwards.
Brain emits ground envelope on every dispatch.
Convention proposed for Atlas + Luka:
- Harness sends headers:
X-Weeyuga-Test: trueX-Weeyuga-Benchmark-Run-Id: <uuid>X-Weeyuga-Benchmark-Cell-Id: <id>X-Weeyuga-Benchmark-Prompt-Id: <id>
- Brain copies header values into
metadata.test,metadata.benchmark_run_id, etc. on every emitted envelope for this request. - Luka's dashboards default-filter
metadata.test:false OR NOT metadata.test:*. A "show benchmark traffic" toggle flips the filter.
Status: proposed by Ben 2026-04-28; awaiting Atlas + Luka ratification before v2 lands. Until then, v1 stays Mode A.
5.3 Local ledger (always emitted, regardless of mode)
Independent of brain telemetry, the harness writes
docs/BENCHMARKS/runs/<benchmark_run_id>.jsonl on the harness
host (Mac). One JSON object per line; first line is the meta
record (§1 per-run fields), subsequent lines are per-call records.
The local ledger is the canonical source for v1_BASELINE.md and any aggregation. Brain telemetry is a nice-to-have for cross-correlation in Kibana but is NOT load-bearing on the baseline numbers.
5.4 Coordination ratification needed
This convention is currently unilateral from Ben. Before Mode B ships:
- Atlas confirms the brain copies
X-Weeyuga-Test*headers intometadata.*on every envelope without breaking existing emit paths. - Luka adds the default
metadata.testfilter to all prod-facing dashboards (Cluster Health Overview / Mobile Chat Activity / Agent Telemetry / Error Funnel / Cluster Connectivity) and confirms the toggle works. - Nemanja ratifies that
metadata: flattenedis the right place (vs. extendingactormapping) — leaning yes per his §3.4 use offlattenedfor forward-compat.
A separate transcript dispatch carries this proposal. Until all three ack, v1 baseline runs Mode A only and leaves brain telemetry untouched.
6. Output format
6.1 Per-run JSONL ledger
docs/BENCHMARKS/runs/<benchmark_run_id>.jsonl
Line 1 — meta record (per-run fields, §1).
Lines 2..N — call records (per-call fields, §1) plus a
phase: "skipped" line for any cell that failed availability.
Example (truncated):
{"type":"meta","benchmark_run_id":"4e2a...","git_sha":"e2d6a6d","git_dirty":true,"harness_version":"1","started_at_utc":"2026-04-29T02:00:00Z","host":"slobodan-mac","load_avg_start":[1.2,1.5,1.4],"cells_planned":["mac:ollama:qwen3.5:0.8b","mac:llamacpp:qwen3.5:0.8b","pavilion:ollama:qwen3.5:0.8b","pavilion:llamacpp:qwen3.5:0.8b"]}
{"type":"call","cell_id":"mac:llamacpp:qwen3.5:0.8b","prompt_id":"P-EASY","phase":"cold","run_idx":0,"first_delta_ms":2810,"total_duration_ms":2840,"prompt_tokens":11,"completion_tokens":2,"tokens_per_sec":66.7,"finish_reason":"stop","backend":"llamacpp","error":null}
{"type":"call","cell_id":"mac:llamacpp:qwen3.5:0.8b","prompt_id":"P-EASY","phase":"warm","run_idx":0,"first_delta_ms":120,"total_duration_ms":150,"prompt_tokens":11,"completion_tokens":2,"tokens_per_sec":66.7,"finish_reason":"stop","backend":"llamacpp","error":null}
Aggregator (scripts/benchmarks/aggregate.py) reads the JSONL
and emits the v1_BASELINE.md table. Re-running the aggregator on
the same JSONL is deterministic and idempotent.
6.2 Markdown table shape (v1_BASELINE.md)
Per cell, two stacked tables: cold and warm. Per (cell, prompt),
one row with first_delta_ms, total_duration_ms, completion_tokens,
tokens_per_sec. Across-prompts summary at the end of each
cell's section. The cell matrix appears as a top-level summary
table (TTFT-warm-p50 only) above the per-cell detail.
This shape stays fixed across baselines so v1 → v2 → v3 diffs are mechanical. Adding a new metric goes at the end of the per-cell table; never reorder existing columns.
7. Regression thresholds
A run is flagged on the bus when, vs. the most recent baseline:
| Metric | Threshold | Severity |
|---|---|---|
warm.first_delta_ms.p50 (any cell) |
≥ 30% slower | regression — bus heads-up |
warm.total_duration_ms.p50 (any cell) |
≥ 30% slower | regression — bus heads-up |
warm.success_rate (any cell) |
< 0.95 | red flag — investigate before publishing |
warm.tokens_per_sec.mean (any cell) |
≥ 30% lower | regression — bus heads-up |
| Improvement ≥ 30% on any of the above | wins — bus heads-up | publish |
30% is a deliberately wide threshold for v1 because run-to-run variance on shared hardware (Mac M1 also runs Sloba's work) can easily be 15-20%. Tighten when N is bumped from 5 to 20.
A regression doesn't auto-block a release; it triggers the
operator question "is this a real regression or a load-day blip?"
and prompts a re-run with N=20.
8. Reproducibility checklist
Before a baseline run is published, every entry in this checklist
must hold. If any fails, the JSONL is shipped but the
v1_BASELINE.md is annotated unreliable: <reason> rather than
written to be read as canonical.
git statusclean OR every dirty file documented in the run metadata (e.g. "ignored: M mobile/macos/Flutter/GeneratedPluginRegistrant.swift")git rev-parse HEADrecorded- Harness version recorded
- Each target's engine version recorded:
- Ollama:
curl http://<node>:11434/api/version - llama.cpp:
curl http://<node>:11436/health(if exposed) or record the b-build from the operator config
- Ollama:
- Each target's model digest recorded:
- Ollama:
curl http://<node>:11434/api/tags | jq '.models[]|select(.name=="<name>").digest' - llama.cpp: GGUF SHA256 from operator config (Bane's Pavilion install records this in his message; Mac TBD)
- Ollama:
- Wall-clock window logged in heads-up message on bus
- No competing benchmark runs going (only one harness across cluster at a time — even a different model on a different node — to keep the network noise floor predictable)
- Sloba's prime-time avoided OR explicit authorization on the bus (Sam dispatch, or "go ahead" from Sloba in chat)
The harness itself enforces a subset:
- Refuses to run if
git statusshows changes the operator hasn't acknowledged via--allow-dirty. - Refuses to run if
--for-publicationis set and any cell health check fails. - Records the start-time load average and refuses to start if
getloadavg()[0] > 4.0unless--force-loadis set (the Mac is too busy and numbers will be noisy).
9. Workflow — running a baseline
# 1. Check the bus for heads-up window collisions
cd /Users/slobodan/projects/WeeyugaWeb
tail -50 coordination/CLAUDE_TRANSCRIPT.md
# 2. Post heads-up
# (write coordination/messages/<utc>Z-benchmark-tester-ben-baseline-window.md
# + transcript entry, commit + push)
# 3. Health-check targets
python3 scripts/benchmarks/run_harness.py --probe
# 4. Smoke (1 cell, 1 prompt, 1 run) to validate end-to-end
python3 scripts/benchmarks/run_harness.py --smoke --cells mac:llamacpp:qwen3.5:0.8b
# 5. Full v1 baseline
python3 scripts/benchmarks/run_harness.py --full --cells-yaml scripts/benchmarks/cells.yaml --prompts-yaml scripts/benchmarks/prompts.yaml
# 6. Aggregate to markdown
python3 scripts/benchmarks/aggregate.py docs/BENCHMARKS/runs/<run-id>.jsonl > docs/BENCHMARKS/v1_BASELINE.md
# 7. Commit + push the JSONL ledger AND the markdown together (per-run commit)
git add docs/BENCHMARKS/runs/<run-id>.jsonl docs/BENCHMARKS/v1_BASELINE.md
git commit -m "benchmark: v1 baseline run <run-id>"
git push
# 8. Post bus message linking the result + a 1-paragraph framing for Janie
Subsequent baselines follow the same flow with the harness writing
a different <run-id>.jsonl per invocation. Old ledgers are
preserved forever — they're the audit trail for "did this number
move because of a code change or a load-day blip."
10. Coordination contract
| Who | What I owe them | What they owe me |
|---|---|---|
| Sam | Per-deliverable transcript entries + weekly Mon "regressions/improvements" digest | Cross-cutting decisions; spawn coordination |
| Nemanja | Metric set ratification; cell matrix sanity-check; field-naming convention review | Authoritative ground envelope schema; ratifying metadata.* extension |
| Atlas | Header convention proposal; Mode-B test traffic scoped + visible | Brain copies X-Weeyuga-Test* → metadata.*; informs me on emit-path changes that affect harness |
| Luka | Heads-up before any run (so his dashboards aren't read during noisy windows) | Default metadata.test:false filter on prod dashboards; "show benchmark traffic" toggle; query-side help |
| Bane / Viktor | Heads-up before Pavilion / Predator runs; idle-coordination on long runs | Engine-version + model-digest reads on demand; infra-stability heads-up |
| Pablo / Filip | Heads-up if a measurement window overlaps their device-test windows | Awareness of when the harness is generating mobile-shape traffic |
| Janie | Raw numbers + a 1-paragraph framing per run (what's interesting here) | Storytelling — turning numbers into Janie blog posts |
| Sloba | Numbers when asked; standing offer | Authorization for prime-time runs; prompt freezes (don't change P-EASY/P-MEDIUM/P-HARD without ack) |
11. Hard rules I commit to
metadata.test=true(or Mode-A direct-engine) on every benchmark call. No silent benchmark traffic in production dashboards. Ever.- Reproducibility metadata is not optional. Numbers without git SHA + env + load-avg + harness version are deleted, not shipped.
- Frozen prompts. P-EASY / P-MEDIUM / P-HARD never change once v1 ships. New prompts get new IDs.
- No prime-time runs without bus heads-up. Pavilion runs go between 02:00-05:00Z by default unless authorized otherwise. Mac runs that take more than 30 s of cumulative load coordinate with whatever Sloba's doing.
- Cluster impact ≤ 1 harness at a time. Even running on different nodes, two harnesses running simultaneously add network noise floor that breaks reproducibility. Serialize.
- Per-run commits of both the JSONL ledger AND the v1_BASELINE.md so bisect on numbers is mechanical.
- No fork of the harness format. New metrics extend the per-call record; never reorder or rename existing fields. Aggregator reads tolerate-old / require-new.
- No destructive load tests without Sam dispatch. The harness runs ≤ 1 sustained call per second per cell by default; bursts come from bursting cells in parallel only when explicitly authorized.
12. What's deliberately not here (v2+ backlog)
- Parallel-thread capacity tests. Need careful scoping per node (1050 thrashes hard with 4 parallel; M1 unified RAM contends with user OS).
- Cross-node routing cost. Needs Atlas's brain header convention ratified.
- GPU memory peak / CPU utilization sampling. Needs SSH-driven on-target samplers.
- Network bytes between harness and target.
tcpdump -nn host <ip>per run, easy to add when first cross-node run goes. - Tiny-model landscape exploration (qwen2.5:0.5b vs gemma:2b
vs phi3:mini vs others on M1 / 1050 / 1070 / CPU). Sam queued
this as
docs/RESEARCH/SUB_HALF_SECOND_MODEL_LANDSCAPE.md, feeding Atlas's personality engine work. - Sustained-load endurance (1 hour at constant rate). Catches thermal throttling and Ollama queue grow.
- Heavy-model coverage (qwen3:4b / qwen3:9b / qwen3:35b-a3b on the nodes that can run them).
- Embedding / vector / image-encoder benchmarks. Needed for Atlas's personality engine if it adds non-LLM micro-calls.
Each is a real gap; v1 ships without them on purpose. The shape above accommodates all of them as additive extensions.
Owner: mac/benchmark-tester-ben. Created 2026-04-28.