feat: harness + agent runbook — flip repo from archive-only to

crowdsourced runner Sloba's chat directive 2026-05-06: "this project is preparation for going public ... ship the harness along so others can join in." The repo's original purpose (Ben's catalogue + 21 reference run ledgers, shipped 2026-05-05) stays intact. This commit ADDS a second purpose: a portable harness + agent runbook so a friend's coding agent can clone, read CLAUDE.md, run the same suite on the friend's hardware, and submit results back as a PR. What landed: CLAUDE.md + AGENTS.md (byte-identical, ~520 lines) Full agent runbook: hardware probe, runtime + model selection, canonical knob reference (Sloba's Pavilion methodology values), hardware-adaptation decision rules, run-instructions, output-schema templates for hardware.json + metadata.json + run.md, PR submission flow (fork → branch → push → PR; nothing auto-merges), privacy guardrails, methodology lineage. Per Sloba's Q3 directive: the runbook explicitly tells the friend's agent to ADAPT to hardware reality and document deviations rather than blindly run defaults. CONTRIBUTING.md (~110 lines) Human-readable companion for the friend (not the agent). What you need, how it works, what we ask, what maintainers commit to, license, code-of-conduct short version. harness/ ├── README.md Technical readme for the harness folder ├── run_benchmark.py ~520 LOC runner. Stdlib-only. Adapted from │ WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py │ v3 with the cluster-internal IP defaults │ (10.8.0.x) replaced by 127.0.0.1:11434, the │ cluster /v1/cluster/* endpoints removed, the │ canonical-suite paths under ~/Documents/MyServers │ replaced by harness/suites/ paths, the git-sha │ enforcement on WeeyugaWeb dropped, and the │ output written under submissions/<handle>/<tag>/ │ instead of docs/BENCHMARKS/runs/. Supports all │ six suite phases via --phases, plus 'all'. ├── prompts.py Verbatim copy of the canonical 3 frozen prompts │ (P-EASY/P-MEDIUM/P-HARD) from │ WeeyugaWeb/scripts/benchmarks/prompts.py. ├── requirements.txt Empty by intent (stdlib-only); placeholder for │ pip-tools / agent auto-install patterns. ├── .gitignore __pycache__/ etc. └── suites/ Six bundled JSON suites copied verbatim from Sloba's MyServers/instances/vps-81-17-99-14/telemetry/: small_model_eval_questions.json, python_task_suite_questions.json, parallel_qwen_same_model_20q_suite.json, parallel_qwen_mixed_model_20q_suite.json, python_context_edge_append_questions.json, python_context_edge_suite_only.json. submissions/ README.md Folder convention + naming + reviewability rules EXAMPLE/mac-m1-8gb/run-00000000-...-000000000000/ Synthetic-but-shape-complete contribution template: manifest.json, hardware.json, run.jsonl (5 example lines), metadata.json, run.md (with privacy attestation, methodology deviations, reproducibility command). Marked as synthetic at the top so future analysis doesn't accidentally cite it. LICENSE-MIT MIT for harness/*.py and future helper code. Existing LICENSE (CC-BY-4.0) covers data files. README.md (modified) Updated to reflect dual purpose. Layout diagram updated. Maintainer credits: Ben for catalogue/methodology + Bane for harness. Contributor quick-start added. Status table extended. Privacy posture: - All 6 suite JSON files privacy-scanned for cluster IPs / hostnames / paths / tokens. Two prompts contain project names ("MyBoard" auth debugging in 20Q-Q14, generic SSH troubleshooting in 5Q-Q03); flagged in chat for Sloba's review. Otherwise clean. - run_benchmark.py default target_url is 127.0.0.1:11434 (no internal IPs leaked). - manifest.json captures host_hostname_short via socket.gethostname() .split('.')[0] — agent should review before PR if hostname is sensitive. - CLAUDE.md §8 spells out the privacy-grep before push. Verification: - py_compile run_benchmark.py: OK - --help renders cleanly - All 6 suite JSON files: valid - All 4 example JSON files: valid - Example run.jsonl (5 lines): valid This commit lands on branch feature/runner-and-agent-instructions. NOT pushed to main; staying on the feature branch until Sloba reviews on Gitea and merges. Bus dispatch to Ben + Sam announcing the architectural pivot lives in the WeeyugaWeb coordination repo.
2026-05-06 11:07:55 +02:00
parent ddc9626136
commit 97a9245d9e
22 changed files with 4400 additions and 47 deletions
--- a/README.md
+++ b/README.md
@@ -1,54 +1,112 @@
 # weeyuga-benchmarks-public

-> **Status: PRIVATE STAGING** — this repo is not yet public. Flips to anonymous-read after [Miljan + Stevan's pre-launch security audit](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb/src/branch/main/coordination/messages) signs off. If you're reading this and you're not on the Weeyuga team, you got here too early.
+> **Status: PRIVATE STAGING** — this repo is not yet anonymous-readable.
+> Flips to public after the pre-launch security audit signs off.
+> If you got here too early, please hold; you'll be invited soon.

-Canonical raw-data archive for **[benchmarks.weeyuga.com](https://benchmarks.weeyuga.com)** — every benchmark run we publish on the site is mirrored here as raw JSONL + log + human summary so anyone can clone, re-analyse, or cite.
+Open benchmarks for local LLMs — same prompts, same suites, run on
+whatever hardware you've got, results compose into one ladder.
+
+This repo is two things in one:
+
+1. **Canonical archive** — every benchmark Sloba's team publishes on
+   [benchmarks.weeyuga.com](https://benchmarks.weeyuga.com) lives here as
+   raw JSONL + computed metadata + human summary, so anyone can clone,
+   re-analyse, or cite. This is the original purpose; see `runs/`.
+2. **Crowdsourced runner** — a portable harness + agent runbook so a
+   friend's coding agent (Claude Code, Codex, Aider, …) can clone this
+   repo, read `CLAUDE.md` / `AGENTS.md`, run the same suite on the
+   friend's hardware, and submit the result back as a PR. This is the
+   newer purpose; see `harness/` + `submissions/`.
+
+Both purposes share one schema, one prompt set, one methodology.
+
+## Quick start — for friends contributing a benchmark
+
+```bash
+git clone https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public.git
+cd weeyuga-benchmarks-public
+# Hand this file path to your coding agent and say:
+#   "Read CLAUDE.md, then run the Weeyuga benchmark on my hardware."
+```
+
+Then read `CONTRIBUTING.md` for the human-side flow (what to expect,
+how reviews work, license, code of conduct).

 ## Layout

 ```
 .
-├── README.md                       — this file
-├── LICENSE                         — CC-BY-4.0 (data) + MIT (helper code)
-├── catalogue.json                  — index of every published benchmark (mirror of the site catalogue)
-├── methodology.md                  — how we benchmark + fairness rules + reproducibility notes
-└── runs/
-    └── <run-id>/
-        ├── run.jsonl               — canonical raw event stream (one JSON object per line)
-        ├── run.log                 — tee'd stdout/stderr from the harness (when captured)
-        ├── run.md                  — human-readable summary (when synthesis exists)
-        └── metadata.json           — computed snapshot: meta record + per-cell aggregates + status
+├── README.md                — this file
+├── CLAUDE.md                — full agent runbook (read this if you're an LLM-driven agent)
+├── AGENTS.md                — byte-identical to CLAUDE.md (Codex / other tools that prefer this name)
+├── CONTRIBUTING.md          — human-readable contribution guide
+├── LICENSE                  — CC-BY-4.0 for data
+├── LICENSE-MIT              — MIT for harness/runner code
+├── catalogue.json           — index of every published benchmark (canonical archive)
+├── methodology.md           — how we benchmark + fairness rules + reproducibility notes
+├── harness/                 — portable runner + suites + prompts (the crowdsourced piece)
+│   ├── README.md
+│   ├── run_benchmark.py
+│   ├── prompts.py
+│   ├── requirements.txt
+│   └── suites/              — six bundled JSON suites (5Q, 20Q, parallel × 2, edge × 2)
+├── runs/                    — canonical archive: 21 reference runs from Sloba's cluster
+│   └── <run-id>/
+│       ├── run.jsonl
+│       ├── run.log          — when captured
+│       ├── run.md           — when synthesis exists
+│       └── metadata.json
+└── submissions/             — community contributions land here
+    ├── README.md
+    ├── EXAMPLE/             — one fully-filled-out template you can read
+    │   └── mac-m1-8gb/
+    │       └── run-<uuid>/
+    │           ├── manifest.json
+    │           ├── hardware.json
+    │           ├── run.jsonl
+    │           ├── metadata.json
+    │           └── run.md
+    └── <handle>/            — your contributions
+        └── <device-tag>/
+            └── run-<uuid>/...
 ```

 ## Run-ID format

-Every run gets a UUID v4 `<run-id>` assigned at harness startup. Run IDs are stable across re-runs of synthesis — the same run-id always points to the same raw `run.jsonl` event stream. Synthesis docs (`run.md`) and computed metadata (`metadata.json`) can be regenerated from the canonical jsonl at any time.
+Every run gets a UUID v4 `<run-id>` assigned at harness startup. Run IDs
+are stable across re-runs of synthesis — the same run-id always points
+to the same raw `run.jsonl` event stream. Synthesis docs (`run.md`) and
+computed metadata (`metadata.json`) can be regenerated from the
+canonical jsonl at any time.

 ## Schema

-The `catalogue.json` index follows `schema_version = "1.0-draft"` (or later — check the value at the top of the file). Per-benchmark entries include:
+The `catalogue.json` index follows `schema_version = "1.0-draft"` (or
+later — check the value at the top of the file). Per-benchmark entries
+include:

 - `id` — run-id
 - `title`, `headline`, `date`
- `hardware` (pavilion / predator / mac / vps50 / runpod)
+- `hardware` (pavilion / predator / mac / vps50 / runpod / community-`<device-tag>`)
 - `engine` (llamacpp / ollama / vllm / mlx / cpu)
- `harness` (which harness produced this — see `methodology.md` for the matrix)
+- `harness` (which harness produced this — see `methodology.md`)
 - `model_family`, `model_sizes`
- `cells[]` — per-(machine × engine × model) summary: n_calls, n_errors, duration_ms (mean + p50), tokens_per_sec (mean + max)
- `synthesis_doc` — filename of the synthesis prose for this run, if one exists
- `tags`, `status`, `visibility`
+- `cells[]` — per-(machine × engine × model) summary
+- `synthesis_doc` — filename of the synthesis prose, if one exists
+- `tags`, `status`, `visibility`, `site_grade`

 Per-run `metadata.json` adds `cells_full[]` with the full call list inline.

-## How to consume
+## How to consume the archive

-### Just download a single run
+### Single run

 ```bash
-curl -O https://benchmarks.weeyuga.com/data/runs/<run-id>/run.jsonl
+curl -O https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public/raw/branch/main/runs/<run-id>/run.jsonl
 ```

-### Clone the whole archive
+### Whole archive

 ```bash
 git clone https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public.git
@@ -56,48 +114,57 @@ git clone https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public.

 ### Re-build catalogue from raw

-The canonical builder lives in [WeeyugaWeb/scripts/benchmarks/build_catalogue.py](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb/src/branch/main/scripts/benchmarks/build_catalogue.py) and runs against `runs/*.jsonl`. If you want to regenerate the catalogue from your own clone of this repo:
+The canonical builder lives in
+[WeeyugaWeb/scripts/benchmarks/build_catalogue.py](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb/src/branch/main/scripts/benchmarks/build_catalogue.py)
+and runs against `runs/*/run.jsonl`.

-```bash
-git clone https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb.git
-cd WeeyugaWeb
-python3 scripts/benchmarks/build_catalogue.py
-```
+## How to contribute a benchmark
+
+See `CLAUDE.md` (or `AGENTS.md` — same content) for the full agent
+runbook, and `CONTRIBUTING.md` for the human-side flow. Short version:
+
+1. Clone repo
+2. Hand `CLAUDE.md` to your coding agent
+3. Agent probes hardware, picks a model, runs benchmark, writes results
+4. You review what the agent produced
+5. You fork → push → open PR
+6. Maintainers review and merge
+
+Read access is open. **Write access is via PR only — nothing auto-merges.**

 ## Citation

-If you use this data, please cite as:
-
 ```
-Margetić, S. & contributors. (2026). Weeyuga cluster benchmarks (raw data archive).
+Margetić, S. & contributors. (2026). Weeyuga local-LLM benchmarks.
 https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public
 ```

-(A more formal citation form will land here once Mila weighs in on academic-attribution conventions.)
-
 ## License

- **Data** (`runs/`, `catalogue.json`, `methodology.md`): [Creative Commons Attribution 4.0 International (CC-BY-4.0)](LICENSE)
- **Helper code** (any future scripts inside this repo): [MIT](LICENSE-MIT) (separate file added if/when code lands here)
+- **Data** (`runs/`, `submissions/`, `catalogue.json`, `methodology.md`,
+  `harness/suites/`): [Creative Commons Attribution 4.0 International (CC-BY-4.0)](LICENSE).
+- **Helper code** (`harness/*.py`, future scripts): [MIT](LICENSE-MIT).

-You are free to share, re-host, re-analyse, and remix the data with attribution.
+You're free to share, re-host, re-analyse, and remix the data with attribution.

-## What's in here vs what's NOT
+## Reporting issues

-This repo contains **bench-run output only**. No source code. No infrastructure config. No application internals. Reproducing a run requires the [WeeyugaWeb](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb) main repo (also Gitea-hosted; visibility separate).
-
-## Reporting an issue with the data
-
-If you spot a bench number that looks wrong, a methodology gap, or a privacy slip in published metadata: open an issue on this repo, or email the Weeyuga team. We'd rather know.
+If you spot a bench number that looks wrong, a methodology gap, or a
+privacy slip in published metadata: open an issue on this repo, or
+email the team at `slobodan@weeyuga.com`. We'd rather know.

 ## Status

 | What | State |
 |---|---|
 | Repo created | 2026-05-05 |
-| First 21 runs landed | 2026-05-05 |
-| Miljan + Stevan security audit | scheduled |
+| Canonical archive landed (21 runs) | 2026-05-05 |
+| Harness + agent runbook landed | 2026-05-06 |
+| Pre-launch security audit | scheduled |
 | Visibility flipped to public | pending audit sign-off |
-| Site `benchmarks.weeyuga.com` live | pending Bane DNS + nginx + Tomas site |
+| First friend's submission merged | pending |

-Owner: `mac/benchmark-tester-ben` (Ben). For coordination, see the [WeeyugaWeb coordination bus](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb/src/branch/main/coordination).
+Maintainers: `mac/benchmark-tester-ben` (catalogue + methodology),
+`mac/devops-bane` (harness + runner + this README change).
+For coordination, see the
+[WeeyugaWeb coordination bus](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb/src/branch/main/coordination).