feat: harness + agent runbook — flip repo from archive-only to
crowdsourced runner
Sloba's chat directive 2026-05-06: "this project is preparation for
going public ... ship the harness along so others can join in."
The repo's original purpose (Ben's catalogue + 21 reference run
ledgers, shipped 2026-05-05) stays intact. This commit ADDS a second
purpose: a portable harness + agent runbook so a friend's coding agent
can clone, read CLAUDE.md, run the same suite on the friend's hardware,
and submit results back as a PR.
What landed:
CLAUDE.md + AGENTS.md (byte-identical, ~520 lines)
Full agent runbook: hardware probe, runtime + model selection,
canonical knob reference (Sloba's Pavilion methodology values),
hardware-adaptation decision rules, run-instructions, output-schema
templates for hardware.json + metadata.json + run.md, PR submission
flow (fork → branch → push → PR; nothing auto-merges), privacy
guardrails, methodology lineage. Per Sloba's Q3 directive: the
runbook explicitly tells the friend's agent to ADAPT to hardware
reality and document deviations rather than blindly run defaults.
CONTRIBUTING.md (~110 lines)
Human-readable companion for the friend (not the agent). What you
need, how it works, what we ask, what maintainers commit to,
license, code-of-conduct short version.
harness/
├── README.md Technical readme for the harness folder
├── run_benchmark.py ~520 LOC runner. Stdlib-only. Adapted from
│ WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py
│ v3 with the cluster-internal IP defaults
│ (10.8.0.x) replaced by 127.0.0.1:11434, the
│ cluster /v1/cluster/* endpoints removed, the
│ canonical-suite paths under ~/Documents/MyServers
│ replaced by harness/suites/ paths, the git-sha
│ enforcement on WeeyugaWeb dropped, and the
│ output written under submissions/<handle>/<tag>/
│ instead of docs/BENCHMARKS/runs/. Supports all
│ six suite phases via --phases, plus 'all'.
├── prompts.py Verbatim copy of the canonical 3 frozen prompts
│ (P-EASY/P-MEDIUM/P-HARD) from
│ WeeyugaWeb/scripts/benchmarks/prompts.py.
├── requirements.txt Empty by intent (stdlib-only); placeholder for
│ pip-tools / agent auto-install patterns.
├── .gitignore __pycache__/ etc.
└── suites/ Six bundled JSON suites copied verbatim from
Sloba's MyServers/instances/vps-81-17-99-14/telemetry/:
small_model_eval_questions.json, python_task_suite_questions.json,
parallel_qwen_same_model_20q_suite.json,
parallel_qwen_mixed_model_20q_suite.json,
python_context_edge_append_questions.json,
python_context_edge_suite_only.json.
submissions/
README.md Folder convention + naming + reviewability rules
EXAMPLE/mac-m1-8gb/run-00000000-...-000000000000/
Synthetic-but-shape-complete contribution template:
manifest.json, hardware.json, run.jsonl (5 example lines),
metadata.json, run.md (with privacy attestation, methodology
deviations, reproducibility command). Marked as synthetic at
the top so future analysis doesn't accidentally cite it.
LICENSE-MIT
MIT for harness/*.py and future helper code. Existing LICENSE
(CC-BY-4.0) covers data files.
README.md (modified)
Updated to reflect dual purpose. Layout diagram updated.
Maintainer credits: Ben for catalogue/methodology + Bane for harness.
Contributor quick-start added. Status table extended.
Privacy posture:
- All 6 suite JSON files privacy-scanned for cluster IPs / hostnames /
paths / tokens. Two prompts contain project names ("MyBoard" auth
debugging in 20Q-Q14, generic SSH troubleshooting in 5Q-Q03);
flagged in chat for Sloba's review. Otherwise clean.
- run_benchmark.py default target_url is 127.0.0.1:11434 (no internal
IPs leaked).
- manifest.json captures host_hostname_short via socket.gethostname()
.split('.')[0] — agent should review before PR if hostname is
sensitive.
- CLAUDE.md §8 spells out the privacy-grep before push.
Verification:
- py_compile run_benchmark.py: OK
- --help renders cleanly
- All 6 suite JSON files: valid
- All 4 example JSON files: valid
- Example run.jsonl (5 lines): valid
This commit lands on branch feature/runner-and-agent-instructions.
NOT pushed to main; staying on the feature branch until Sloba reviews
on Gitea and merges. Bus dispatch to Ben + Sam announcing the
architectural pivot lives in the WeeyugaWeb coordination repo.
This commit is contained in:
@@ -0,0 +1,29 @@
|
||||
{
|
||||
"schema_version": "hardware-1.0",
|
||||
"device_tag": "mac-m1-8gb",
|
||||
"manufacturer_model": "Apple MacBook Air (Mac14,2) — example, not a real submission",
|
||||
"os": {"name": "macOS", "version": "14.5", "kernel": "23.5.0"},
|
||||
"cpu": {
|
||||
"name": "Apple M1",
|
||||
"cores": 8,
|
||||
"threads": 8,
|
||||
"max_ghz": 3.2,
|
||||
"arch": "arm64",
|
||||
"isa": ["NEON"]
|
||||
},
|
||||
"memory_gb_total": 8,
|
||||
"memory_gb_available_at_run_start": 4.2,
|
||||
"gpu": [
|
||||
{
|
||||
"name": "Apple M1 GPU",
|
||||
"kind": "integrated",
|
||||
"vram_gb": null,
|
||||
"driver": "Metal/macOS-14",
|
||||
"compute_cap": null
|
||||
}
|
||||
],
|
||||
"storage": {"kind": "ssd", "free_gb_at_run_start": 220},
|
||||
"thermal_or_power_notes": "default OS thermal mgmt; on AC power throughout the run; no swap pressure observed",
|
||||
"network_used_for_model_fetch": "wifi-100mbps (only used for `ollama pull` before benchmark; not on the timing path)",
|
||||
"container_or_vm": null
|
||||
}
|
||||
@@ -0,0 +1,23 @@
|
||||
{
|
||||
"schema_version": "manifest-1.0",
|
||||
"run_id": "00000000-0000-0000-0000-000000000000",
|
||||
"harness_version": "public-1",
|
||||
"submitter_handle": "EXAMPLE",
|
||||
"device_tag": "mac-m1-8gb",
|
||||
"cell_id_prefix": "mac-m1:ollama",
|
||||
"target_url": "http://127.0.0.1:11434",
|
||||
"phases_run": ["hello", "5q", "20q"],
|
||||
"models_run": ["qwen3.5:0.8b"],
|
||||
"canonical_options": {
|
||||
"temperature": 0.1,
|
||||
"num_ctx": 4096,
|
||||
"num_predict": 2048
|
||||
},
|
||||
"canonical_options_overrides": {},
|
||||
"timeout_seconds": 360,
|
||||
"started_at_utc": "2026-05-12T14:32:11Z",
|
||||
"host_hostname_short": "alices-mbp",
|
||||
"platform_system": "Darwin",
|
||||
"platform_release": "23.5.0",
|
||||
"python_version": "3.12.4"
|
||||
}
|
||||
@@ -0,0 +1,58 @@
|
||||
{
|
||||
"schema_version": "metadata-1.0",
|
||||
"run_id": "00000000-0000-0000-0000-000000000000",
|
||||
"submitter_handle": "EXAMPLE",
|
||||
"device_tag": "mac-m1-8gb",
|
||||
"computed_at_utc": "2026-05-12T14:48:30Z",
|
||||
"computed_by": "agent (Claude Code 4.6) — see run.md §Methodology",
|
||||
"cells": [
|
||||
{
|
||||
"cell_id": "mac-m1:ollama:qwen3.5:0.8b",
|
||||
"phase": "hello",
|
||||
"n_calls": 1,
|
||||
"n_errors": 0,
|
||||
"duration_ms_p50": 1847,
|
||||
"duration_ms_p95": 1847,
|
||||
"duration_ms_mean": 1847,
|
||||
"tokens_per_sec_p50": 22.74,
|
||||
"tokens_per_sec_p95": 22.74,
|
||||
"tokens_per_sec_mean": 22.74,
|
||||
"tokens_per_sec_max": 22.74,
|
||||
"completion_tokens_total": 42,
|
||||
"format_ok_rate": null,
|
||||
"marker_hit_rate_mean": null
|
||||
},
|
||||
{
|
||||
"cell_id": "mac-m1:ollama:qwen3.5:0.8b",
|
||||
"phase": "5q",
|
||||
"n_calls": 5,
|
||||
"n_errors": 0,
|
||||
"duration_ms_p50": 4210,
|
||||
"duration_ms_p95": 7800,
|
||||
"duration_ms_mean": 4900,
|
||||
"tokens_per_sec_p50": 22.3,
|
||||
"tokens_per_sec_p95": 18.7,
|
||||
"tokens_per_sec_mean": 21.4,
|
||||
"tokens_per_sec_max": 23.1,
|
||||
"completion_tokens_total": 487,
|
||||
"format_ok_rate": 0.8,
|
||||
"marker_hit_rate_mean": 0.92
|
||||
},
|
||||
{
|
||||
"cell_id": "mac-m1:ollama:qwen3.5:0.8b",
|
||||
"phase": "20q",
|
||||
"n_calls": 20,
|
||||
"n_errors": 0,
|
||||
"duration_ms_p50": 9612,
|
||||
"duration_ms_p95": 41200,
|
||||
"duration_ms_mean": 12180,
|
||||
"tokens_per_sec_p50": 20.9,
|
||||
"tokens_per_sec_p95": 7.4,
|
||||
"tokens_per_sec_mean": 17.0,
|
||||
"tokens_per_sec_max": 24.8,
|
||||
"completion_tokens_total": 4280,
|
||||
"format_ok_rate": 0.7,
|
||||
"marker_hit_rate_mean": 0.78
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,6 @@
|
||||
{"type":"meta","benchmark_run_id":"00000000-0000-0000-0000-000000000000","harness_version":"public-1","started_at_utc":"2026-05-12T14:32:11Z","host_hostname_short":"alices-mbp","load_avg_start":[1.2,1.4,1.6],"target_url":"http://127.0.0.1:11434","cell_id_prefix":"mac-m1:ollama","submitter_handle":"EXAMPLE","device_tag":"mac-m1-8gb","execution_shape":"per-model-block","phases_planned":["hello","5q","20q"],"models_planned":["qwen3.5:0.8b"],"canonical_options":{"temperature":0.1,"num_ctx":4096,"num_predict":2048},"canonical_options_effective":{"temperature":0.1,"num_ctx":4096,"num_predict":2048},"timeout_seconds":360,"platform_system":"Darwin","platform_release":"23.5.0","python_version":"3.12.4"}
|
||||
{"type":"call","ts_utc":"2026-05-12T14:32:13Z","cell_id":"mac-m1:ollama:qwen3.5:0.8b","model":"qwen3.5:0.8b","phase":"hello","question_id":"hello_check","run_idx":0,"duration_seconds":1.847,"prompt_tokens":17,"completion_tokens":42,"tokens_per_second":22.74,"finish_reason":"stop","status_code":200,"response_chars":167,"response_preview":"Hello! Of course, I'd be happy to help. What can I assist you with today? Whether it's a question, a task, or just a chat, I'm here to help.","required_markers":[],"markers_hit":[],"marker_hit_rate":null,"format_rule":"","format_ok":null,"usable_answer":true,"error":null}
|
||||
{"type":"call","ts_utc":"2026-05-12T14:32:21Z","cell_id":"mac-m1:ollama:qwen3.5:0.8b","model":"qwen3.5:0.8b","phase":"5q","question_id":"Q1","run_idx":0,"duration_seconds":4.21,"prompt_tokens":58,"completion_tokens":94,"tokens_per_second":22.32,"finish_reason":"stop","status_code":200,"response_chars":312,"response_preview":"#!/usr/bin/env bash\nset -euo pipefail\n\nif [[ ! -d \"$1\" ]]; then\n echo \"err: $1 not a directory\" >&2\n exit 1\nfi\n\nfor f in \"$1\"/*.log; do\n [[ -e \"$f\" ]] || continue\n gzip -k \"$f\"\ndone","required_markers":["gzip","#!/usr/bin/env bash"],"markers_hit":["gzip","#!/usr/bin/env bash"],"marker_hit_rate":1.0,"format_rule":"bash_code","format_ok":true,"usable_answer":true,"error":null}
|
||||
{"type":"call","ts_utc":"2026-05-12T14:33:48Z","cell_id":"mac-m1:ollama:qwen3.5:0.8b","model":"qwen3.5:0.8b","phase":"20q","question_id":"Q01","run_idx":0,"duration_seconds":9.612,"prompt_tokens":142,"completion_tokens":201,"tokens_per_second":20.91,"finish_reason":"stop","status_code":200,"response_chars":784,"response_preview":"def is_valid_ipv4(addr: str) -> bool:\n parts = addr.split('.')\n if len(parts) != 4:\n return False\n for p in parts:\n if not p.isdigit():\n return False\n n = int(p)","required_markers":["is_valid_ipv4","def test_"],"markers_hit":["is_valid_ipv4","def test_"],"marker_hit_rate":1.0,"format_rule":"python_code","format_ok":true,"usable_answer":true,"error":null}
|
||||
{"type":"call","ts_utc":"2026-05-12T14:36:42Z","cell_id":"mac-m1:ollama:qwen3.5:0.8b","model":"qwen3.5:0.8b","phase":"20q","question_id":"Q14","run_idx":13,"duration_seconds":42.118,"prompt_tokens":189,"completion_tokens":312,"tokens_per_second":7.41,"finish_reason":"stop","status_code":200,"response_chars":1240,"response_preview":"To debug this MyBoard auth issue, the triage should focus on…","required_markers":["/auth/login","/auth/me","myboard:post-login-redirect","tenant-missing"],"markers_hit":["/auth/login","/auth/me","tenant-missing"],"marker_hit_rate":0.75,"format_rule":"json_dict","format_ok":false,"usable_answer":true,"error":null}
|
||||
{"type":"footer","ts_utc":"2026-05-12T14:48:03Z","finished_at_utc":"2026-05-12T14:48:03Z","load_avg_end":[1.6,1.5,1.6]}
|
||||
@@ -0,0 +1,92 @@
|
||||
# EXAMPLE — mac-m1-8gb — qwen3.5:0.8b — 2026-05-12
|
||||
|
||||
> **This is a synthetic example so contributors can see the shape of a
|
||||
> submission end-to-end. The numbers are plausible but not from a real run.
|
||||
> Don't cite this directory in analysis. Don't copy-paste these numbers.
|
||||
> Real submissions live alongside this folder under `submissions/<handle>/`.**
|
||||
|
||||
**Run ID:** `00000000-0000-0000-0000-000000000000`
|
||||
**Submitter:** EXAMPLE (synthetic)
|
||||
**Hardware:** Apple MacBook Air M1, 8 GB unified, macOS 14.5
|
||||
**Runtime:** Ollama 0.5.13 (default settings; NUM_PARALLEL=1, KEEP_ALIVE=5m)
|
||||
**Models:** qwen3.5:0.8b
|
||||
**Phases run:** hello, 5q, 20q
|
||||
**Phases skipped:** parallel_same, parallel_mixed, edge_append, edge_suite — RAM constraint, parallel suites need ≥2 warm copies of the model and 8 GB unified didn't fit; edge suites time-budget skipped (would have been ~30 min more)
|
||||
|
||||
## Headline numbers
|
||||
|
||||
| Cell | Phase | n_calls | tok/s mean | tok/s p50 | duration p50 | format_ok rate |
|
||||
|---|---|---|---|---|---|---|
|
||||
| mac-m1:ollama:qwen3.5:0.8b | hello | 1 | 22.7 | 22.7 | 1.8 s | n/a |
|
||||
| mac-m1:ollama:qwen3.5:0.8b | 5q | 5 | 21.4 | 22.3 | 4.2 s | 80% |
|
||||
| mac-m1:ollama:qwen3.5:0.8b | 20q | 20 | 17.0 | 20.9 | 9.6 s | 70% |
|
||||
|
||||
## What I observed (qualitative)
|
||||
|
||||
- **Hello-call cold-start was fast** — 1.8 s including initial model load.
|
||||
Ollama reports the 0.8B GGUF as ~600 MB; on Apple Silicon unified memory
|
||||
this loads in well under 2 s.
|
||||
- **5Q tasks were uniformly handled** — all five formats (bash, python,
|
||||
shell, four-numbered-steps, json) parsed correctly except one
|
||||
(Q3, "shell_lines" — model started with `1.` numbered list instead of
|
||||
raw shell command).
|
||||
- **20Q tasks bifurcated** — the simple ones (Q01-Q08) ran at full
|
||||
~20 tok/s with high format-correctness; the longer ones (Q09+ with
|
||||
multi-paragraph context) saw throughput drop to ~12-15 tok/s, with
|
||||
format_ok dropping to ~60%. p95 duration of 41 s was Q14 (the MyBoard
|
||||
triage prompt — long context, mixed format).
|
||||
- **No errors, no timeouts.** Cleanest run was on AC power; the laptop
|
||||
fan never spun up.
|
||||
|
||||
## Methodology
|
||||
|
||||
Followed the canonical Pavilion methodology with these deviations:
|
||||
|
||||
- **NUM_PARALLEL=1** instead of canonical 3 — laptop, not server; one slot
|
||||
is enough for sequential per-model-block execution.
|
||||
- **KEEP_ALIVE=5m** instead of canonical 2400h — laptop, no need to pin.
|
||||
- **Phases `parallel_same`, `parallel_mixed`, `edge_append`, `edge_suite`
|
||||
skipped** — see top of file. Run not eligible for `flagship` grade,
|
||||
intended as `standard`.
|
||||
|
||||
## Caveats
|
||||
|
||||
- 8 GB unified RAM is below the comfort floor for parallel suites with this
|
||||
model; results above are NOT a refutation of the canonical parallel
|
||||
numbers — they're from a different shape of run.
|
||||
- macOS Spotlight indexing was disabled before the run started. If you
|
||||
rerun without disabling, expect ~5-10% additional variance from
|
||||
background I/O.
|
||||
- `format_ok` rate of 70% on 20Q is consistent with Sloba's flagship 20Q
|
||||
numbers for qwen3.5:0.8b on Pavilion (~74-78% in the v1 baseline) within
|
||||
measurement noise.
|
||||
|
||||
## Reproducibility
|
||||
|
||||
```
|
||||
ollama pull qwen3.5:0.8b
|
||||
ollama serve # in a separate terminal
|
||||
|
||||
python3 harness/run_benchmark.py \
|
||||
--target-url http://127.0.0.1:11434 \
|
||||
--models qwen3.5:0.8b \
|
||||
--cell-id-prefix mac-m1:ollama \
|
||||
--phases hello,5q,20q \
|
||||
--submitter-handle alice \
|
||||
--device-tag mac-m1-8gb
|
||||
```
|
||||
|
||||
Took ~16 minutes wall-clock on this hardware.
|
||||
|
||||
## Privacy attestation
|
||||
|
||||
I scanned `run.jsonl` for personal paths, API tokens, SSH keys, and
|
||||
home-directory leakage:
|
||||
```
|
||||
grep -nE "Bearer |sk-|api_key|/Users/|/home/|password|ssh-rsa|ssh-ed25519" \
|
||||
submissions/EXAMPLE/mac-m1-8gb/run-00000000-0000-0000-0000-000000000000/*
|
||||
```
|
||||
No matches outside the SSH-troubleshooting prompt in 5Q (Q3) which is
|
||||
intentional curriculum. Safe to ship.
|
||||
|
||||
— EXAMPLE (synthetic; not a real contributor)
|
||||
78
submissions/README.md
Normal file
78
submissions/README.md
Normal file
@@ -0,0 +1,78 @@
|
||||
# `submissions/`
|
||||
|
||||
Friends' benchmark contributions land here, one directory per submitter,
|
||||
one subdirectory per device, one sub-subdirectory per run.
|
||||
|
||||
## Layout
|
||||
|
||||
```
|
||||
submissions/
|
||||
├── README.md — this file
|
||||
├── EXAMPLE/ — template; see below
|
||||
│ └── mac-m1-8gb/
|
||||
│ └── run-00000000-...-000000000000/
|
||||
│ ├── manifest.json
|
||||
│ ├── hardware.json
|
||||
│ ├── run.jsonl
|
||||
│ ├── metadata.json
|
||||
│ └── run.md
|
||||
├── alice/ — first real friend's contributions
|
||||
│ └── mac-m1-8gb/
|
||||
│ └── run-<uuid>/...
|
||||
└── bob/ — etc.
|
||||
└── rtx-4090-pc/
|
||||
└── run-<uuid>/...
|
||||
```
|
||||
|
||||
## Per-submission contents
|
||||
|
||||
Five files inside each `run-<uuid>/`:
|
||||
|
||||
- **`manifest.json`** — automatic; `run_benchmark.py` writes it at run start. Contains submitter handle, device tag, target URL, model list, phase plan, canonical-options overrides, host hostname (short), platform, started-at timestamp.
|
||||
- **`hardware.json`** — agent fills from a hardware probe (see `CLAUDE.md` §2). Schema version `hardware-1.0`.
|
||||
- **`run.jsonl`** — automatic; the canonical event ledger. Line 1 is `type=meta`; subsequent lines are `type=call` or `type=skipped`; final line is `type=footer`.
|
||||
- **`metadata.json`** — agent fills with computed aggregates per `(cell_id, phase)` cell. Schema version `metadata-1.0`. The catalogue builder will recompute on Sloba's side; having it in the PR makes review fast.
|
||||
- **`run.md`** — agent fills using the `CLAUDE.md` §6b template. Honest narrative — methodology deviations, caveats, headline numbers.
|
||||
|
||||
## Why per-submitter folders?
|
||||
|
||||
- **Attribution** — your handle lives next to your data
|
||||
- **Reviewability** — a PR adds files only under `submissions/<your-handle>/...`; reviewer can see the whole contribution at a glance
|
||||
- **No collisions** — two friends submitting from "macbook-pro" don't overwrite each other
|
||||
- **History stays clean** — re-runs go into new `run-<uuid>/` subdirs, not on top of the old one
|
||||
|
||||
## Naming conventions
|
||||
|
||||
- **`<submitter-handle>`** — your Gitea username, or any other handle you'd like to be credited as. Lowercase; ASCII letters / digits / hyphens only.
|
||||
- **`<device-tag>`** — short descriptor of the hardware. Pattern: `<chip-or-platform>-<key-spec>`. Examples:
|
||||
- `mac-m1-8gb`, `mac-m2-pro-16gb`, `mac-m3-max-64gb`
|
||||
- `rtx-4090-pc`, `rtx-3060-laptop`, `gtx-1060-6gb`
|
||||
- `ryzen-7950x-cpu`, `intel-i9-13900k-cpu`
|
||||
- `pixel-8-pro`, `samsung-s24-ultra` (yes, phones — if you've got termux working)
|
||||
- `runpod-h100-pcie`, `runpod-rtx-a6000`
|
||||
- **`run-<uuid>`** — `run-` prefix + a UUID v4 from `run_benchmark.py`. Don't shorten.
|
||||
|
||||
## What the EXAMPLE folder is for
|
||||
|
||||
A complete-but-tiny submission you can read end-to-end to understand the
|
||||
shapes. **Don't modify the EXAMPLE folder in a benchmark-submission PR**; if
|
||||
you spot a bug in the example, that's a separate PR with the title
|
||||
`fix: submissions/EXAMPLE/...`.
|
||||
|
||||
## When a submission is merged
|
||||
|
||||
Sloba reviews and merges manually. After merge:
|
||||
1. The catalogue builder on Sloba's side picks up your run, computes a
|
||||
`cell_id` from your `device-tag` + model, and assigns it a `site_grade`
|
||||
(flagship / standard / archive-only based on the criteria in
|
||||
`methodology.md`).
|
||||
2. Janie (the benchmarks blogger) may write a `janie_blurb_md` for it.
|
||||
3. It appears on `benchmarks.weeyuga.com` (when the site is live).
|
||||
4. Your `device-tag` becomes a permanent comparison axis on the catalogue.
|
||||
|
||||
## What if I want to delete a submission later?
|
||||
|
||||
Open an issue, we'll honor the request promptly. We'll keep the run
|
||||
directory but mark it `visibility: redacted` in the catalogue overlay so
|
||||
the data still validates historical analysis claims while disappearing
|
||||
from the browse surface.
|
||||
Reference in New Issue
Block a user