feat: harness + agent runbook — flip repo from archive-only to

crowdsourced runner Sloba's chat directive 2026-05-06: "this project is preparation for going public ... ship the harness along so others can join in." The repo's original purpose (Ben's catalogue + 21 reference run ledgers, shipped 2026-05-05) stays intact. This commit ADDS a second purpose: a portable harness + agent runbook so a friend's coding agent can clone, read CLAUDE.md, run the same suite on the friend's hardware, and submit results back as a PR. What landed: CLAUDE.md + AGENTS.md (byte-identical, ~520 lines) Full agent runbook: hardware probe, runtime + model selection, canonical knob reference (Sloba's Pavilion methodology values), hardware-adaptation decision rules, run-instructions, output-schema templates for hardware.json + metadata.json + run.md, PR submission flow (fork → branch → push → PR; nothing auto-merges), privacy guardrails, methodology lineage. Per Sloba's Q3 directive: the runbook explicitly tells the friend's agent to ADAPT to hardware reality and document deviations rather than blindly run defaults. CONTRIBUTING.md (~110 lines) Human-readable companion for the friend (not the agent). What you need, how it works, what we ask, what maintainers commit to, license, code-of-conduct short version. harness/ ├── README.md Technical readme for the harness folder ├── run_benchmark.py ~520 LOC runner. Stdlib-only. Adapted from │ WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py │ v3 with the cluster-internal IP defaults │ (10.8.0.x) replaced by 127.0.0.1:11434, the │ cluster /v1/cluster/* endpoints removed, the │ canonical-suite paths under ~/Documents/MyServers │ replaced by harness/suites/ paths, the git-sha │ enforcement on WeeyugaWeb dropped, and the │ output written under submissions/<handle>/<tag>/ │ instead of docs/BENCHMARKS/runs/. Supports all │ six suite phases via --phases, plus 'all'. ├── prompts.py Verbatim copy of the canonical 3 frozen prompts │ (P-EASY/P-MEDIUM/P-HARD) from │ WeeyugaWeb/scripts/benchmarks/prompts.py. ├── requirements.txt Empty by intent (stdlib-only); placeholder for │ pip-tools / agent auto-install patterns. ├── .gitignore __pycache__/ etc. └── suites/ Six bundled JSON suites copied verbatim from Sloba's MyServers/instances/vps-81-17-99-14/telemetry/: small_model_eval_questions.json, python_task_suite_questions.json, parallel_qwen_same_model_20q_suite.json, parallel_qwen_mixed_model_20q_suite.json, python_context_edge_append_questions.json, python_context_edge_suite_only.json. submissions/ README.md Folder convention + naming + reviewability rules EXAMPLE/mac-m1-8gb/run-00000000-...-000000000000/ Synthetic-but-shape-complete contribution template: manifest.json, hardware.json, run.jsonl (5 example lines), metadata.json, run.md (with privacy attestation, methodology deviations, reproducibility command). Marked as synthetic at the top so future analysis doesn't accidentally cite it. LICENSE-MIT MIT for harness/*.py and future helper code. Existing LICENSE (CC-BY-4.0) covers data files. README.md (modified) Updated to reflect dual purpose. Layout diagram updated. Maintainer credits: Ben for catalogue/methodology + Bane for harness. Contributor quick-start added. Status table extended. Privacy posture: - All 6 suite JSON files privacy-scanned for cluster IPs / hostnames / paths / tokens. Two prompts contain project names ("MyBoard" auth debugging in 20Q-Q14, generic SSH troubleshooting in 5Q-Q03); flagged in chat for Sloba's review. Otherwise clean. - run_benchmark.py default target_url is 127.0.0.1:11434 (no internal IPs leaked). - manifest.json captures host_hostname_short via socket.gethostname() .split('.')[0] — agent should review before PR if hostname is sensitive. - CLAUDE.md §8 spells out the privacy-grep before push. Verification: - py_compile run_benchmark.py: OK - --help renders cleanly - All 6 suite JSON files: valid - All 4 example JSON files: valid - Example run.jsonl (5 lines): valid This commit lands on branch feature/runner-and-agent-instructions. NOT pushed to main; staying on the feature branch until Sloba reviews on Gitea and merges. Bus dispatch to Ben + Sam announcing the architectural pivot lives in the WeeyugaWeb coordination repo.
2026-05-06 11:07:55 +02:00
parent ddc9626136
commit 97a9245d9e
22 changed files with 4400 additions and 47 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -0,0 +1,436 @@
+# CLAUDE.md — runbook for the friend's coding agent
+
+> **You are the coding agent (Claude Code, Codex, Aider, Cursor, …) of
+> someone who agreed to run the Weeyuga benchmark suite on their hardware.
+> This file is your briefing. Read it completely before doing anything.**
+>
+> If your runtime auto-loads `AGENTS.md` instead of `CLAUDE.md`, that file
+> is byte-identical — same content, different filename for tools that
+> prefer one over the other.
+
+---
+
+## TL;DR (60 seconds)
+
+You are going to:
+
+1. **Probe** the friend's hardware (OS, CPU, RAM, GPU + VRAM).
+2. **Pick** an OpenAI-compatible runtime (Ollama / llama.cpp / vLLM / MLX) and a model that fits.
+3. **Adapt** the bundled runner to the friend's reality (this is the part you can't skip — see §3).
+4. **Run** `harness/run_benchmark.py` against the runtime; it writes a JSONL ledger of every call.
+5. **Document** the run in `submissions/<friend-handle>/<device-tag>/run-<uuid>/`.
+6. **Submit** a PR back to this repo. Sloba reviews and merges manually — nothing auto-merges.
+
+You have **read-only access** to this repo. To submit, fork on Gitea, push to your fork, open a PR.
+
+The harness is a **template**, not a one-click button. The friend's hardware will not match Sloba's; you are expected to adapt parameters, research best-known values online when you hit unfamiliar constraints, and **write down what you changed and why** in the manifest. Documented deviation is fine. Silent deviation breaks comparability.
+
+---
+
+## 0. Read this completely before doing anything
+
+The rest of this file is structured as the order you'll work in. Reading the whole thing first gives you the shape; then the friend can ask "go" and you execute without circling back.
+
+If you hit something genuinely ambiguous, ASK THE FRIEND. Don't guess at hardware-specific values; either verify with measurement or research them from the project / model authors' recommended-settings docs.
+
+---
+
+## 1. What you are running, exactly
+
+**Inputs (frozen across runs):**
+- `harness/prompts.py` — three frozen prompts (P-EASY, P-MEDIUM, P-HARD). Never modified.
+- `harness/suites/*.json` — six benchmark suites, all run sequentially per model:
+  | phase key | suite file | what it tests |
+  |---|---|---|
+  | `5q` | `small_model_eval_questions.json` | 5 short-answer formatting + correctness questions |
+  | `20q` | `python_task_suite_questions.json` | 20 realistic Python task prompts |
+  | `parallel_same` | `parallel_qwen_same_model_20q_suite.json` | parallel-lane stress with one model |
+  | `parallel_mixed` | `parallel_qwen_mixed_model_20q_suite.json` | parallel-lane stress with multiple models |
+  | `edge_append` | `python_context_edge_append_questions.json` | long-context append behavior |
+  | `edge_suite` | `python_context_edge_suite_only.json` | long-context whole-suite reasoning |
+
+**Driver:** `harness/run_benchmark.py` — one process, sequential calls to your local OpenAI-compatible `/v1/chat/completions` endpoint, one JSONL line per call.
+
+**Output:** `submissions/<handle>/<device-tag>/run-<uuid>/` containing:
+- `run.jsonl` — every call recorded
+- `manifest.json` — written automatically by the runner
+- `hardware.json` — **you fill this** from the hardware probe (§2)
+- `metadata.json` — computed aggregates (you generate, see §6)
+- `run.md` — human-readable summary (you write, see §6)
+
+**Run order:** ALL six suites run in sequence per model, per the canonical Pavilion methodology Sloba uses. Don't pick-and-choose unless the friend is explicitly time-constrained — partial runs are still useful but they're documented as "partial" in the manifest, and they show up as `archive-only` in Sloba's catalogue rather than full-grade runs.
+
+---
+
+## 2. Hardware probe — do this first, write `hardware.json` from the result
+
+Before anything else, gather the friend's hardware truth. Pick the platform-appropriate commands; don't run all of them, just the ones that work on the friend's OS.
+
+**macOS:**
+```bash
+system_profiler SPHardwareDataType SPDisplaysDataType
+sysctl -n machdep.cpu.brand_string
+sysctl -n hw.memsize
+sw_vers
+uname -a
+```
+
+**Linux:**
+```bash
+lscpu
+cat /proc/meminfo | head -3
+nvidia-smi --query-gpu=name,memory.total,driver_version,compute_cap --format=csv  # if NVIDIA
+lspci | grep -iE "vga|3d|display"
+uname -a
+cat /etc/os-release
+```
+
+**Windows (PowerShell):**
+```powershell
+Get-CimInstance Win32_ComputerSystem | Select Manufacturer, Model, TotalPhysicalMemory
+Get-CimInstance Win32_Processor | Select Name, NumberOfCores, MaxClockSpeed
+Get-CimInstance Win32_VideoController | Select Name, AdapterRAM, DriverVersion
+$PSVersionTable.OS
+```
+
+Write the canonical findings to `hardware.json`. Schema (every field present; `null` if not applicable):
+
+```json
+{
+  "schema_version": "hardware-1.0",
+  "device_tag": "mac-m1-8gb",
+  "manufacturer_model": "Apple MacBook Air (Mac14,2)",
+  "os": {"name": "macOS", "version": "14.5", "kernel": "23.5.0"},
+  "cpu": {"name": "Apple M1", "cores": 8, "threads": 8, "max_ghz": 3.2,
+          "arch": "arm64", "isa": ["NEON"]},
+  "memory_gb_total": 8,
+  "memory_gb_available_at_run_start": 4.2,
+  "gpu": [
+    {"name": "Apple M1 GPU", "kind": "integrated", "vram_gb": null,
+     "driver": "Metal/macOS-14", "compute_cap": null}
+  ],
+  "storage": {"kind": "ssd", "free_gb_at_run_start": 220},
+  "thermal_or_power_notes": "default OS thermal mgmt; on AC power",
+  "network_used_for_model_fetch": "wifi-100mbps",
+  "container_or_vm": null
+}
+```
+
+Honest mode flags to mention in `thermal_or_power_notes`:
+- "battery-only, low-power-mode active" → results may be artificially slow
+- "thermal throttling observed mid-run" → tag any affected calls in `run.md` caveats
+- "GPU shared with display compositor" → expect 5-15% throughput hit vs headless
+
+---
+
+## 3. Adapt to hardware reality — this is the part you cannot skip
+
+The harness uses Sloba's canonical knobs as defaults. They are **not** guaranteed to be optimal for the friend's hardware. Your job:
+
+### 3a. Canonical knobs (Sloba's reference values)
+
+```python
+CANONICAL_OPTIONS = {
+    "temperature": 0.1,    # near-deterministic; comparable across runs
+    "num_ctx": 4096,       # context window
+    "num_predict": 2048,   # max generated tokens per call
+}
+```
+
+Plus runtime-level (Ollama-specific but apply equivalently to llama.cpp / vLLM):
+- `KEEP_ALIVE` — how long the loaded model stays warm. Sloba uses **2400h** on cluster nodes (~100 days = effectively pinned). On a friend's laptop, **5m** is gentler if RAM is tight.
+- `NUM_PARALLEL` — concurrent slots. Sloba uses **3** on Pavilion. **1** is fine on tight RAM.
+- `MAX_LOADED_MODELS` — how many models held in VRAM. Sloba uses **3** on a 12 GB GPU; default to **1** on anything ≤ 8 GB.
+- For llama.cpp: `--n-gpu-layers` (NGL) — number of model layers offloaded to GPU. **Critical** on borderline VRAM. NGL=99 is full offload; NGL=0 is CPU-only. Sloba's Predator (GTX 1060 6 GB) runs Qwen3.5:9B at NGL=6 because higher offloads OOM with the KV cache.
+
+### 3b. Decision rules
+
+| Friend's hardware | Likely runtime | Likely model size | Likely NGL | Likely NUM_PARALLEL |
+|---|---|---|---|---|
+| Apple Silicon (M1/M2/M3, ≥8 GB unified) | Ollama OR llama.cpp w/ Metal OR MLX | 0.5B – 4B | n/a (Metal handles offload) | 1–2 |
+| Apple Silicon (M-Pro/M-Max, ≥16 GB) | same, MLX preferred for 8B+ | 4B – 14B | n/a | 2–3 |
+| NVIDIA GPU 6 GB VRAM | llama.cpp + CUDA | 0.5B – 4B (or 8B at NGL ~10–20) | tuned per model | 1 |
+| NVIDIA GPU 8–12 GB VRAM | llama.cpp + CUDA, or vLLM | 4B – 14B | high (60–99) | 1–2 |
+| NVIDIA GPU 24+ GB VRAM | vLLM or llama.cpp | up to 32B | 99 (full) | 4+ |
+| AMD GPU | llama.cpp + ROCm | conservative one tier below NVIDIA-equivalent | tuned | 1 |
+| CPU only | llama.cpp + CPU | 0.5B – 2B (Q4_K_M) | 0 | 1 |
+
+These are starting points. **Don't trust them blindly.** For any model + hardware combination you're uncertain about:
+
+1. Check the model's HuggingFace card for "recommended quantization / hardware" notes.
+2. Check the runtime's GitHub for known issues with this model family.
+3. Look up llama.cpp issues for "VRAM OOM <model>" — community usually finds the NGL sweet spot.
+4. If still uncertain, run a dry probe: `python3 harness/run_benchmark.py --probe --target-url ... --cell-id-prefix ... --models <model>` and observe RSS / VRAM / tokens-per-sec.
+
+### 3c. Document every deviation in `manifest.json.canonical_options_overrides`
+
+The runner records overrides automatically when you pass `--temperature` / `--num-ctx` / `--num-predict`. For runtime-level deviations (NGL, NUM_PARALLEL, KEEP_ALIVE), add them to `hardware.json.thermal_or_power_notes` or to `run.md` § Methodology Deviations. **Untracked deviations are the worst kind — silently make a run uncomparable.** Honest-and-deviated > silent-and-clean.
+
+---
+
+## 4. Pick a runtime and a model
+
+Sloba's instruction: **use any model**. The harness doesn't ship a fixed model list — the friend's agent picks based on hardware. Suggestions, in increasing size:
+
+| Model | Size | When |
+|---|---|---|
+| `qwen2.5-coder:0.5b` | ~400 MB | minimum-viable code benchmarks; runs anywhere |
+| `qwen3.5:0.8b` | ~600 MB | Sloba's reference smallest; matches his catalogue runs |
+| `qwen2.5-coder:1.5b` | ~1.1 GB | code-focused mid-tier |
+| `qwen3.5:2b` | ~1.5 GB | conversational mid-tier |
+| `qwen3.5:4b` | ~3 GB | flagship mid-tier; common comparison point |
+| `qwen3.5:8b-q4km` | ~5 GB | mid-tier flagship |
+| `qwen3.5:9b-q4km` | ~5.4 GB | Sloba's Predator flagship; 6 GB VRAM borderline (run with reduced NGL) |
+| `qwen3.5:14b-q4km` | ~9 GB | needs ≥10 GB VRAM or Apple Silicon ≥16 GB unified |
+| `gemma-4:e4b-it-q4km` | ~3 GB | non-Qwen comparison |
+| `granite-4.1:8b-q4km` | ~5 GB | non-Qwen comparison |
+
+Models are pulled from:
+- **Ollama Hub:** `ollama pull qwen3.5:0.8b`, etc.
+- **HuggingFace + llama.cpp:** download GGUF directly via `wget`/`hf-download`, then point `llama-server` at it.
+
+Run more than one model in the same run if you can — comparability. The harness loops models inside one run; cell_ids encode the (node, engine, model) tuple.
+
+---
+
+## 5. Run the benchmark
+
+### 5a. Smoke first (30 seconds)
+
+```bash
+python3 harness/run_benchmark.py --smoke \
+    --target-url http://127.0.0.1:11434 \
+    --models qwen3.5:0.8b \
+    --cell-id-prefix mac-m1:ollama \
+    --submitter-handle <friend-gitea-handle> \
+    --device-tag <short-device-tag>
+```
+
+If smoke 200s back, you have a working runtime. Run the real thing.
+
+### 5b. Full run
+
+```bash
+python3 harness/run_benchmark.py \
+    --target-url http://127.0.0.1:11434 \
+    --models qwen3.5:0.8b,qwen3.5:4b \
+    --cell-id-prefix mac-m1:ollama \
+    --phases hello,5q,20q \
+    --submitter-handle alice \
+    --device-tag mac-m1-8gb
+```
+
+For the canonical full sweep across all six suites:
+```bash
+python3 harness/run_benchmark.py --phases all \
+    --target-url http://127.0.0.1:11434 \
+    --models qwen3.5:0.8b \
+    --cell-id-prefix mac-m1:ollama \
+    --submitter-handle alice --device-tag mac-m1-8gb
+```
+
+Expect minutes per cell. The 20Q + edge suites are the long ones (~10–40 minutes per model on a small box). If the friend is time-bounded, drop edge_* and parallel_* — but record what you skipped.
+
+### 5c. Resume on interrupt
+
+If interrupted, the JSONL ledger is preserved (every line is fsync'd). To resume the same `run-id`:
+```bash
+python3 harness/run_benchmark.py --run-id <previous-uuid> ...
+```
+This appends to a new ledger; you'll need to merge them by hand (or just submit them as two separate runs sharing the same `device-tag`).
+
+---
+
+## 6. Generate `metadata.json` and `run.md`
+
+### 6a. `metadata.json` — computed aggregates per cell
+
+Schema (one row per (cell_id, phase) pair):
+```json
+{
+  "schema_version": "metadata-1.0",
+  "run_id": "<uuid>",
+  "submitter_handle": "alice",
+  "device_tag": "mac-m1-8gb",
+  "cells": [
+    {
+      "cell_id": "mac-m1:ollama:qwen3.5:0.8b",
+      "phase": "20q",
+      "n_calls": 20,
+      "n_errors": 0,
+      "duration_ms_p50": 9600,
+      "duration_ms_p95": 24000,
+      "duration_ms_mean": 11200,
+      "tokens_per_sec_p50": 16.4,
+      "tokens_per_sec_p95": 22.1,
+      "tokens_per_sec_mean": 17.0,
+      "tokens_per_sec_max": 24.8,
+      "completion_tokens_total": 18234,
+      "format_ok_rate": 0.85,
+      "marker_hit_rate_mean": 0.72
+    }
+  ]
+}
+```
+
+You can compute this in-line (small script) or use a quick Python REPL pass over `run.jsonl`. The catalogue builder on Sloba's side will recompute it anyway, but having it in the PR makes review fast.
+
+### 6b. `run.md` — human-readable summary
+
+Template (fill in every section honestly):
+
+```markdown
+# <device-tag> — <model-set> — <YYYY-MM-DD>
+
+**Run ID:** `<uuid>`
+**Submitter:** <handle>
+**Hardware:** Apple MacBook Air M1, 8 GB unified, macOS 14.5
+**Runtime:** Ollama 0.5.x (default settings; NUM_PARALLEL=1, KEEP_ALIVE=5m)
+**Models:** qwen3.5:0.8b, qwen3.5:4b
+**Phases run:** hello, 5q, 20q
+**Phases skipped:** parallel_same, parallel_mixed, edge_append, edge_suite — RAM constraint, 4 GB free at run start was insufficient for parallel suites.
+
+## Headline numbers
+
+| cell | n_calls | tok/s mean | tok/s p50 | duration p50 | format_ok rate |
+|---|---|---|---|---|---|
+| mac-m1:ollama:qwen3.5:0.8b 20q | 20 | 17.0 | 16.4 | 9.6 s | 85% |
+| mac-m1:ollama:qwen3.5:4b 20q | 20 | 5.8 | 5.5 | 28.2 s | 70% |
+
+## Methodology
+
+Followed the canonical Pavilion methodology with these deviations:
+
+- **NUM_PARALLEL=1** instead of canonical 3 — 8 GB unified RAM doesn't fit two warm copies of qwen3.5:4b.
+- **KEEP_ALIVE=5m** instead of 2400h — laptop, not server.
+- **edge_* and parallel_* phases skipped** — friend's time budget.
+
+## Caveats
+
+- Run started at 18% battery; one call (20q-q14, model qwen3.5:4b) coincided with macOS Spotlight indexing; flagged in run.jsonl with run_idx=14 — that data point is high-variance.
+- Network was on hotel wifi; model pull took ~6 minutes for qwen3.5:4b. Did not affect benchmark timing (model warm before any timed call).
+
+## Reproducibility
+
+```
+python3 harness/run_benchmark.py \
+    --target-url http://127.0.0.1:11434 \
+    --models qwen3.5:0.8b,qwen3.5:4b \
+    --cell-id-prefix mac-m1:ollama \
+    --phases hello,5q,20q \
+    --submitter-handle alice \
+    --device-tag mac-m1-8gb
+```
+```
+
+---
+
+## 7. Submit the PR
+
+1. **Fork** `https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public` to the friend's Gitea account (Gitea web UI → "Fork").
+2. **Add the friend's fork as a remote on the local clone:**
+    ```bash
+    git remote add fork ssh://gitea@git.weeyuga.com/<friend-handle>/weeyuga-benchmarks-public.git
+    ```
+3. **Create a topic branch** off `main`:
+    ```bash
+    git checkout -b submission/<handle>-<device-tag>-<short-date>
+    ```
+4. **Stage only the new files under `submissions/<handle>/<device-tag>/run-<uuid>/`.** NEVER modify anything outside that directory in this PR.
+    ```bash
+    git add submissions/<handle>/<device-tag>/run-<uuid>/
+    git status   # confirm: only files under your run-<uuid>/ are staged
+    ```
+5. **Commit** with a descriptive message:
+    ```
+    submit: alice / mac-m1-8gb / 2026-05-12 — qwen3.5 0.8b+4b, hello+5q+20q
+
+    First contribution from a friend's hardware. M1 8 GB unified, Ollama 0.5.x.
+    Skipped edge_* + parallel_* due to RAM. Headline: qwen3.5:0.8b ~17 tok/s,
+    qwen3.5:4b ~5.8 tok/s on 20Q.
+    ```
+6. **Push to fork:**
+    ```bash
+    git push fork submission/<handle>-<device-tag>-<short-date>
+    ```
+7. **Open a PR on Gitea** with target = `slobodanmargetic988/weeyuga-benchmarks-public:main`. PR description should include:
+    - One-paragraph what-and-why
+    - Link to the friend's `run.md`
+    - Any methodology deviations the reviewer should know
+    - Privacy attestation: "I have reviewed run.jsonl and confirmed no PII / SSH keys / API tokens / personal home paths leaked"
+
+Sloba reviews and merges. **Nothing auto-merges.** A typical review surfaces 0–2 follow-ups; address and force-push to the same branch.
+
+---
+
+## 8. Privacy guardrails — DO NOT submit any of these
+
+- API keys (OpenAI, Anthropic, HuggingFace tokens, etc.)
+- SSH private keys, `.ssh/` paths
+- Personal home directory paths (`/Users/alice/secrets/...`)
+- Real names if the friend prefers a handle
+- Internal corporate IPs, hostnames, or SSO endpoints
+- Bearer tokens in error messages (some runtimes echo headers in 4xx errors)
+
+Before pushing, **scan the run.jsonl** for these patterns:
+```bash
+grep -nE "Bearer |sk-|api_key|/Users/|/home/|password|ssh-rsa|ssh-ed25519" submissions/<handle>/<device-tag>/run-<uuid>/*.{jsonl,md,json}
+```
+
+If anything matches, redact it from `response_preview` (the JSONL stores only the first 240 chars of each response, so leaks are rare — but please scan).
+
+---
+
+## 9. What if you get stuck
+
+- **`/v1/models` returns empty:** the runtime isn't OpenAI-compat or no models are loaded. For Ollama: `ollama list`. For llama.cpp: it doesn't list models on `/v1/models` historically; pass `--models <name> --target-url http://127.0.0.1:11436` and it'll work anyway.
+- **Every call returns 500 / timeout:** runtime is up but model isn't loading. Check VRAM with `nvidia-smi` or memory pressure with `vm_stat` / `free`. Smaller model. Or smaller `num_ctx`.
+- **Tokens/sec absurdly low (<1 tok/s on hardware that should manage):** thermal throttling, swap thrashing, or wrong-quantization. Check `free -h` mid-run; if swap is being used, model is too big for RAM.
+- **One question keeps getting `format_ok=false`:** model can't follow that instruction shape. NORMAL. Don't shorten the prompt or reword. Document in run.md and move on.
+- **Ambiguous hardware setup (eGPU? VM? Container?):** ASK the friend. Container/VM resource caps make benchmarks misleading.
+
+---
+
+## 10. The methodology lineage
+
+This harness mirrors `WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py` v3 — Sloba's canonical Pavilion methodology established 2026-04-11. The 21 reference runs in `runs/` (in this repo) are the comparable baseline. Sloba's locked catalogue lives at `catalogue.json` (this repo). When your run is merged, it'll be added to the catalogue under your `device-tag` and become a new comparison point.
+
+The methodology and harness will evolve. Current canonical version: `HARNESS_VERSION = "public-1"`. Future versions will be additive — older ledgers stay valid forever.
+
+---
+
+## 11. Coordinate-while-running checklist
+
+Before you start:
+- [ ] Read this whole file
+- [ ] Read `methodology.md` for the metric definitions (TTFT, p50/p95, format_ok, etc.)
+- [ ] Verify the friend has ≥3 GB free disk for model files
+- [ ] Verify network is OK for model pull (the GGUFs are 0.5–10 GB)
+
+While running:
+- [ ] Smoke first
+- [ ] Full run
+- [ ] Watch for thermal throttling on laptops / phones / mini-PCs
+- [ ] Don't open Chrome / Slack / Zoom mid-run if you can avoid it (VRAM pressure)
+
+After running:
+- [ ] Generate `metadata.json` aggregates
+- [ ] Write `run.md` honestly — including caveats
+- [ ] Privacy-scan `run.jsonl`
+- [ ] Fork → branch → push → PR
+
+---
+
+## Questions / blockers
+
+If you hit something this runbook doesn't cover, the friend can email Sloba (slobodan@weeyuga.com) or open an issue on this repo. Don't burn an hour in a corner — ask. The whole point of crowdsourcing is the variance you'll see; that's data, not a problem.
+
+Welcome aboard. 🦇
+
+— The Weeyuga team
+
+---
+
+> **Maintainer note:** if you edit this file, edit `AGENTS.md` to match
+> (Codex loads `AGENTS.md`, Claude Code loads `CLAUDE.md`; identical
+> content prevents two-tier rules).