feat: harness + agent runbook — flip repo from archive-only to

crowdsourced runner Sloba's chat directive 2026-05-06: "this project is preparation for going public ... ship the harness along so others can join in." The repo's original purpose (Ben's catalogue + 21 reference run ledgers, shipped 2026-05-05) stays intact. This commit ADDS a second purpose: a portable harness + agent runbook so a friend's coding agent can clone, read CLAUDE.md, run the same suite on the friend's hardware, and submit results back as a PR. What landed: CLAUDE.md + AGENTS.md (byte-identical, ~520 lines) Full agent runbook: hardware probe, runtime + model selection, canonical knob reference (Sloba's Pavilion methodology values), hardware-adaptation decision rules, run-instructions, output-schema templates for hardware.json + metadata.json + run.md, PR submission flow (fork → branch → push → PR; nothing auto-merges), privacy guardrails, methodology lineage. Per Sloba's Q3 directive: the runbook explicitly tells the friend's agent to ADAPT to hardware reality and document deviations rather than blindly run defaults. CONTRIBUTING.md (~110 lines) Human-readable companion for the friend (not the agent). What you need, how it works, what we ask, what maintainers commit to, license, code-of-conduct short version. harness/ ├── README.md Technical readme for the harness folder ├── run_benchmark.py ~520 LOC runner. Stdlib-only. Adapted from │ WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py │ v3 with the cluster-internal IP defaults │ (10.8.0.x) replaced by 127.0.0.1:11434, the │ cluster /v1/cluster/* endpoints removed, the │ canonical-suite paths under ~/Documents/MyServers │ replaced by harness/suites/ paths, the git-sha │ enforcement on WeeyugaWeb dropped, and the │ output written under submissions/<handle>/<tag>/ │ instead of docs/BENCHMARKS/runs/. Supports all │ six suite phases via --phases, plus 'all'. ├── prompts.py Verbatim copy of the canonical 3 frozen prompts │ (P-EASY/P-MEDIUM/P-HARD) from │ WeeyugaWeb/scripts/benchmarks/prompts.py. ├── requirements.txt Empty by intent (stdlib-only); placeholder for │ pip-tools / agent auto-install patterns. ├── .gitignore __pycache__/ etc. └── suites/ Six bundled JSON suites copied verbatim from Sloba's MyServers/instances/vps-81-17-99-14/telemetry/: small_model_eval_questions.json, python_task_suite_questions.json, parallel_qwen_same_model_20q_suite.json, parallel_qwen_mixed_model_20q_suite.json, python_context_edge_append_questions.json, python_context_edge_suite_only.json. submissions/ README.md Folder convention + naming + reviewability rules EXAMPLE/mac-m1-8gb/run-00000000-...-000000000000/ Synthetic-but-shape-complete contribution template: manifest.json, hardware.json, run.jsonl (5 example lines), metadata.json, run.md (with privacy attestation, methodology deviations, reproducibility command). Marked as synthetic at the top so future analysis doesn't accidentally cite it. LICENSE-MIT MIT for harness/*.py and future helper code. Existing LICENSE (CC-BY-4.0) covers data files. README.md (modified) Updated to reflect dual purpose. Layout diagram updated. Maintainer credits: Ben for catalogue/methodology + Bane for harness. Contributor quick-start added. Status table extended. Privacy posture: - All 6 suite JSON files privacy-scanned for cluster IPs / hostnames / paths / tokens. Two prompts contain project names ("MyBoard" auth debugging in 20Q-Q14, generic SSH troubleshooting in 5Q-Q03); flagged in chat for Sloba's review. Otherwise clean. - run_benchmark.py default target_url is 127.0.0.1:11434 (no internal IPs leaked). - manifest.json captures host_hostname_short via socket.gethostname() .split('.')[0] — agent should review before PR if hostname is sensitive. - CLAUDE.md §8 spells out the privacy-grep before push. Verification: - py_compile run_benchmark.py: OK - --help renders cleanly - All 6 suite JSON files: valid - All 4 example JSON files: valid - Example run.jsonl (5 lines): valid This commit lands on branch feature/runner-and-agent-instructions. NOT pushed to main; staying on the feature branch until Sloba reviews on Gitea and merges. Bus dispatch to Ben + Sam announcing the architectural pivot lives in the WeeyugaWeb coordination repo.
2026-05-06 11:07:55 +02:00
parent ddc9626136
commit 97a9245d9e
22 changed files with 4400 additions and 47 deletions
--- a/submissions/EXAMPLE/mac-m1-8gb/run-00000000-0000-0000-0000-000000000000/run.md
+++ b/submissions/EXAMPLE/mac-m1-8gb/run-00000000-0000-0000-0000-000000000000/run.md
@@ -0,0 +1,92 @@
+# EXAMPLE — mac-m1-8gb — qwen3.5:0.8b — 2026-05-12
+
+> **This is a synthetic example so contributors can see the shape of a
+> submission end-to-end. The numbers are plausible but not from a real run.
+> Don't cite this directory in analysis. Don't copy-paste these numbers.
+> Real submissions live alongside this folder under `submissions/<handle>/`.**
+
+**Run ID:** `00000000-0000-0000-0000-000000000000`
+**Submitter:** EXAMPLE (synthetic)
+**Hardware:** Apple MacBook Air M1, 8 GB unified, macOS 14.5
+**Runtime:** Ollama 0.5.13 (default settings; NUM_PARALLEL=1, KEEP_ALIVE=5m)
+**Models:** qwen3.5:0.8b
+**Phases run:** hello, 5q, 20q
+**Phases skipped:** parallel_same, parallel_mixed, edge_append, edge_suite — RAM constraint, parallel suites need ≥2 warm copies of the model and 8 GB unified didn't fit; edge suites time-budget skipped (would have been ~30 min more)
+
+## Headline numbers
+
+| Cell | Phase | n_calls | tok/s mean | tok/s p50 | duration p50 | format_ok rate |
+|---|---|---|---|---|---|---|
+| mac-m1:ollama:qwen3.5:0.8b | hello | 1 | 22.7 | 22.7 | 1.8 s | n/a |
+| mac-m1:ollama:qwen3.5:0.8b | 5q | 5 | 21.4 | 22.3 | 4.2 s | 80% |
+| mac-m1:ollama:qwen3.5:0.8b | 20q | 20 | 17.0 | 20.9 | 9.6 s | 70% |
+
+## What I observed (qualitative)
+
+- **Hello-call cold-start was fast** — 1.8 s including initial model load.
+  Ollama reports the 0.8B GGUF as ~600 MB; on Apple Silicon unified memory
+  this loads in well under 2 s.
+- **5Q tasks were uniformly handled** — all five formats (bash, python,
+  shell, four-numbered-steps, json) parsed correctly except one
+  (Q3, "shell_lines" — model started with `1.` numbered list instead of
+  raw shell command).
+- **20Q tasks bifurcated** — the simple ones (Q01-Q08) ran at full
+  ~20 tok/s with high format-correctness; the longer ones (Q09+ with
+  multi-paragraph context) saw throughput drop to ~12-15 tok/s, with
+  format_ok dropping to ~60%. p95 duration of 41 s was Q14 (the MyBoard
+  triage prompt — long context, mixed format).
+- **No errors, no timeouts.** Cleanest run was on AC power; the laptop
+  fan never spun up.
+
+## Methodology
+
+Followed the canonical Pavilion methodology with these deviations:
+
+- **NUM_PARALLEL=1** instead of canonical 3 — laptop, not server; one slot
+  is enough for sequential per-model-block execution.
+- **KEEP_ALIVE=5m** instead of canonical 2400h — laptop, no need to pin.
+- **Phases `parallel_same`, `parallel_mixed`, `edge_append`, `edge_suite`
+  skipped** — see top of file. Run not eligible for `flagship` grade,
+  intended as `standard`.
+
+## Caveats
+
+- 8 GB unified RAM is below the comfort floor for parallel suites with this
+  model; results above are NOT a refutation of the canonical parallel
+  numbers — they're from a different shape of run.
+- macOS Spotlight indexing was disabled before the run started. If you
+  rerun without disabling, expect ~5-10% additional variance from
+  background I/O.
+- `format_ok` rate of 70% on 20Q is consistent with Sloba's flagship 20Q
+  numbers for qwen3.5:0.8b on Pavilion (~74-78% in the v1 baseline) within
+  measurement noise.
+
+## Reproducibility
+
+```
+ollama pull qwen3.5:0.8b
+ollama serve  # in a separate terminal
+
+python3 harness/run_benchmark.py \
+    --target-url http://127.0.0.1:11434 \
+    --models qwen3.5:0.8b \
+    --cell-id-prefix mac-m1:ollama \
+    --phases hello,5q,20q \
+    --submitter-handle alice \
+    --device-tag mac-m1-8gb
+```
+
+Took ~16 minutes wall-clock on this hardware.
+
+## Privacy attestation
+
+I scanned `run.jsonl` for personal paths, API tokens, SSH keys, and
+home-directory leakage:
+```
+grep -nE "Bearer |sk-|api_key|/Users/|/home/|password|ssh-rsa|ssh-ed25519" \
+    submissions/EXAMPLE/mac-m1-8gb/run-00000000-0000-0000-0000-000000000000/*
+```
+No matches outside the SSH-troubleshooting prompt in 5Q (Q3) which is
+intentional curriculum. Safe to ship.
+
+— EXAMPLE (synthetic; not a real contributor)