crowdsourced runner
Sloba's chat directive 2026-05-06: "this project is preparation for
going public ... ship the harness along so others can join in."
The repo's original purpose (Ben's catalogue + 21 reference run
ledgers, shipped 2026-05-05) stays intact. This commit ADDS a second
purpose: a portable harness + agent runbook so a friend's coding agent
can clone, read CLAUDE.md, run the same suite on the friend's hardware,
and submit results back as a PR.
What landed:
CLAUDE.md + AGENTS.md (byte-identical, ~520 lines)
Full agent runbook: hardware probe, runtime + model selection,
canonical knob reference (Sloba's Pavilion methodology values),
hardware-adaptation decision rules, run-instructions, output-schema
templates for hardware.json + metadata.json + run.md, PR submission
flow (fork → branch → push → PR; nothing auto-merges), privacy
guardrails, methodology lineage. Per Sloba's Q3 directive: the
runbook explicitly tells the friend's agent to ADAPT to hardware
reality and document deviations rather than blindly run defaults.
CONTRIBUTING.md (~110 lines)
Human-readable companion for the friend (not the agent). What you
need, how it works, what we ask, what maintainers commit to,
license, code-of-conduct short version.
harness/
├── README.md Technical readme for the harness folder
├── run_benchmark.py ~520 LOC runner. Stdlib-only. Adapted from
│ WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py
│ v3 with the cluster-internal IP defaults
│ (10.8.0.x) replaced by 127.0.0.1:11434, the
│ cluster /v1/cluster/* endpoints removed, the
│ canonical-suite paths under ~/Documents/MyServers
│ replaced by harness/suites/ paths, the git-sha
│ enforcement on WeeyugaWeb dropped, and the
│ output written under submissions/<handle>/<tag>/
│ instead of docs/BENCHMARKS/runs/. Supports all
│ six suite phases via --phases, plus 'all'.
├── prompts.py Verbatim copy of the canonical 3 frozen prompts
│ (P-EASY/P-MEDIUM/P-HARD) from
│ WeeyugaWeb/scripts/benchmarks/prompts.py.
├── requirements.txt Empty by intent (stdlib-only); placeholder for
│ pip-tools / agent auto-install patterns.
├── .gitignore __pycache__/ etc.
└── suites/ Six bundled JSON suites copied verbatim from
Sloba's MyServers/instances/vps-81-17-99-14/telemetry/:
small_model_eval_questions.json, python_task_suite_questions.json,
parallel_qwen_same_model_20q_suite.json,
parallel_qwen_mixed_model_20q_suite.json,
python_context_edge_append_questions.json,
python_context_edge_suite_only.json.
submissions/
README.md Folder convention + naming + reviewability rules
EXAMPLE/mac-m1-8gb/run-00000000-...-000000000000/
Synthetic-but-shape-complete contribution template:
manifest.json, hardware.json, run.jsonl (5 example lines),
metadata.json, run.md (with privacy attestation, methodology
deviations, reproducibility command). Marked as synthetic at
the top so future analysis doesn't accidentally cite it.
LICENSE-MIT
MIT for harness/*.py and future helper code. Existing LICENSE
(CC-BY-4.0) covers data files.
README.md (modified)
Updated to reflect dual purpose. Layout diagram updated.
Maintainer credits: Ben for catalogue/methodology + Bane for harness.
Contributor quick-start added. Status table extended.
Privacy posture:
- All 6 suite JSON files privacy-scanned for cluster IPs / hostnames /
paths / tokens. Two prompts contain project names ("MyBoard" auth
debugging in 20Q-Q14, generic SSH troubleshooting in 5Q-Q03);
flagged in chat for Sloba's review. Otherwise clean.
- run_benchmark.py default target_url is 127.0.0.1:11434 (no internal
IPs leaked).
- manifest.json captures host_hostname_short via socket.gethostname()
.split('.')[0] — agent should review before PR if hostname is
sensitive.
- CLAUDE.md §8 spells out the privacy-grep before push.
Verification:
- py_compile run_benchmark.py: OK
- --help renders cleanly
- All 6 suite JSON files: valid
- All 4 example JSON files: valid
- Example run.jsonl (5 lines): valid
This commit lands on branch feature/runner-and-agent-instructions.
NOT pushed to main; staying on the feature branch until Sloba reviews
on Gitea and merges. Bus dispatch to Ben + Sam announcing the
architectural pivot lives in the WeeyugaWeb coordination repo.
152 lines
5.6 KiB
Markdown
152 lines
5.6 KiB
Markdown
# `harness/` — runner + suites + prompts
|
|
|
|
Self-contained, dependency-free Python 3 benchmark runner. Drives any
|
|
OpenAI-compatible `/v1/chat/completions` endpoint with the canonical
|
|
Weeyuga prompt set; emits a JSONL event ledger.
|
|
|
|
## Files
|
|
|
|
```
|
|
harness/
|
|
├── README.md — this file
|
|
├── run_benchmark.py — the runner (one Python 3 process, stdlib only)
|
|
├── prompts.py — 3 frozen reference prompts (P-EASY/P-MEDIUM/P-HARD)
|
|
├── requirements.txt — empty by intent (stdlib only); listed for tooling
|
|
└── suites/
|
|
├── small_model_eval_questions.json — 5Q (5 short tasks, format-checked)
|
|
├── python_task_suite_questions.json — 20Q (20 realistic Python prompts)
|
|
├── parallel_qwen_same_model_20q_suite.json — same-model parallel-lane stress
|
|
├── parallel_qwen_mixed_model_20q_suite.json — mixed-model parallel-lane stress
|
|
├── python_context_edge_append_questions.json — long-context append behavior
|
|
└── python_context_edge_suite_only.json — long-context whole-suite reasoning
|
|
```
|
|
|
|
## Quick reference
|
|
|
|
```bash
|
|
# Smoke (one hello call, end-to-end runtime check)
|
|
python3 harness/run_benchmark.py --smoke \
|
|
--target-url http://127.0.0.1:11434 \
|
|
--models qwen3.5:0.8b \
|
|
--cell-id-prefix mac:ollama \
|
|
--submitter-handle alice --device-tag mac-m1-8gb
|
|
|
|
# Default phases (hello + 5q + 20q)
|
|
python3 harness/run_benchmark.py \
|
|
--target-url http://127.0.0.1:11434 \
|
|
--models qwen3.5:0.8b \
|
|
--cell-id-prefix mac:ollama \
|
|
--submitter-handle alice --device-tag mac-m1-8gb
|
|
|
|
# Full sweep (all six suites)
|
|
python3 harness/run_benchmark.py --phases all \
|
|
--target-url http://127.0.0.1:11434 \
|
|
--models qwen3.5:0.8b \
|
|
--cell-id-prefix mac:ollama \
|
|
--submitter-handle alice --device-tag mac-m1-8gb
|
|
|
|
# Probe (list models + one hello, no ledger written)
|
|
python3 harness/run_benchmark.py --probe \
|
|
--target-url http://127.0.0.1:11434 \
|
|
--cell-id-prefix mac:ollama
|
|
```
|
|
|
|
## Output layout
|
|
|
|
The runner writes to `submissions/<submitter-handle>/<device-tag>/run-<uuid>/`:
|
|
|
|
```
|
|
submissions/alice/mac-m1-8gb/run-<uuid>/
|
|
├── run.jsonl — event ledger; one JSON object per line
|
|
├── manifest.json — automatic; written at run start
|
|
├── hardware.json — agent fills from hardware probe (see CLAUDE.md §2)
|
|
├── metadata.json — agent fills from aggregates (see CLAUDE.md §6)
|
|
└── run.md — agent fills from template (see CLAUDE.md §6)
|
|
```
|
|
|
|
## Knobs
|
|
|
|
CLI flags (see `--help`):
|
|
- `--target-url` — OpenAI-compat base URL (default `http://127.0.0.1:11434`)
|
|
- `--models` — comma-separated, or `auto` for `/v1/models` discovery
|
|
- `--cell-id-prefix` — `<node-tag>:<engine>` for the JSONL `cell_id` field
|
|
- `--phases` — subset of `hello, frozen, 5q, 20q, parallel_same, parallel_mixed, edge_append, edge_suite`, or `all`
|
|
- `--timeout` — per-call wall-clock cap (default 360 s)
|
|
- `--temperature` / `--num-ctx` / `--num-predict` — override canonical knobs
|
|
- `--probe` / `--smoke` — health-check shortcuts
|
|
- `--run-id` / `--out-dir` — resume / custom output
|
|
|
|
Canonical defaults (in code):
|
|
|
|
```python
|
|
CANONICAL_OPTIONS = {
|
|
"temperature": 0.1,
|
|
"num_ctx": 4096,
|
|
"num_predict": 2048,
|
|
}
|
|
```
|
|
|
|
Any deviation is recorded automatically in `manifest.json.canonical_options_overrides`.
|
|
|
|
## Dependencies
|
|
|
|
**None beyond Python 3 stdlib.** `urllib.request` does the HTTP, `json` does
|
|
serde, `uuid` makes the run-id. The empty `requirements.txt` exists so tools
|
|
like `pip-tools` and reproducibility scripts have a hook; if a future version
|
|
adds dependencies they'll land there with pinned versions.
|
|
|
|
Tested against Python 3.10, 3.11, 3.12. Earlier 3.x may work but isn't tested.
|
|
|
|
## Suite shapes
|
|
|
|
All six `suites/*.json` follow the same top-level shape:
|
|
|
|
```json
|
|
{
|
|
"suite_name": "...",
|
|
"version": "1",
|
|
"purpose": "...",
|
|
"models": ["..."], // advisory; runner uses --models flag
|
|
"questions": [
|
|
{
|
|
"id": "Q01",
|
|
"prompt": "...",
|
|
"required_markers": ["..."], // optional; lower-cased substring matches
|
|
"format_rule": "..." // optional; one of: bash_code, python_code, shell_lines, four_numbered_steps, five_bullets, json_dict, pytest_code
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
`required_markers` and `format_rule` are heuristic — they exist to flag
|
|
"obviously wrong shape" answers without claiming semantic correctness. Don't
|
|
treat them as ground truth; treat them as a sanity check.
|
|
|
|
The parallel and edge suites add more top-level fields (`run_mode`, `lanes`,
|
|
`question_assignment`, etc.) for advisory context; the runner reads only
|
|
`questions[]` from any suite.
|
|
|
|
## Adding a new suite
|
|
|
|
For now: don't. The six suites are stable and adding more in this branch
|
|
breaks comparability with the existing 21 reference runs. If you want a new
|
|
suite, open an issue on this repo proposing it; we'll discuss whether it
|
|
warrants a `HARNESS_VERSION = public-2` bump (suites would still need to be
|
|
backwards-compatible — adding new phase keys is fine, redefining existing
|
|
ones is not).
|
|
|
|
## Why no `pip install -e` / no Python package?
|
|
|
|
This is a **scripts directory**, not a library. The runner is one file. The
|
|
suites are data files. Friends running this from a fresh clone shouldn't have
|
|
to deal with packaging, virtualenvs (beyond what their agent recommends), or
|
|
upgrade flows. If/when this grows past one runner, we'll split it.
|
|
|
|
## License
|
|
|
|
`prompts.py`, `run_benchmark.py`, and any future `harness/*.py` code: MIT
|
|
(see [LICENSE-MIT](../LICENSE-MIT)).
|
|
|
|
`suites/*.json`: CC-BY-4.0 (see [LICENSE](../LICENSE)) — same as the bench
|
|
data they test against.
|