crowdsourced runner
Sloba's chat directive 2026-05-06: "this project is preparation for
going public ... ship the harness along so others can join in."
The repo's original purpose (Ben's catalogue + 21 reference run
ledgers, shipped 2026-05-05) stays intact. This commit ADDS a second
purpose: a portable harness + agent runbook so a friend's coding agent
can clone, read CLAUDE.md, run the same suite on the friend's hardware,
and submit results back as a PR.
What landed:
CLAUDE.md + AGENTS.md (byte-identical, ~520 lines)
Full agent runbook: hardware probe, runtime + model selection,
canonical knob reference (Sloba's Pavilion methodology values),
hardware-adaptation decision rules, run-instructions, output-schema
templates for hardware.json + metadata.json + run.md, PR submission
flow (fork → branch → push → PR; nothing auto-merges), privacy
guardrails, methodology lineage. Per Sloba's Q3 directive: the
runbook explicitly tells the friend's agent to ADAPT to hardware
reality and document deviations rather than blindly run defaults.
CONTRIBUTING.md (~110 lines)
Human-readable companion for the friend (not the agent). What you
need, how it works, what we ask, what maintainers commit to,
license, code-of-conduct short version.
harness/
├── README.md Technical readme for the harness folder
├── run_benchmark.py ~520 LOC runner. Stdlib-only. Adapted from
│ WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py
│ v3 with the cluster-internal IP defaults
│ (10.8.0.x) replaced by 127.0.0.1:11434, the
│ cluster /v1/cluster/* endpoints removed, the
│ canonical-suite paths under ~/Documents/MyServers
│ replaced by harness/suites/ paths, the git-sha
│ enforcement on WeeyugaWeb dropped, and the
│ output written under submissions/<handle>/<tag>/
│ instead of docs/BENCHMARKS/runs/. Supports all
│ six suite phases via --phases, plus 'all'.
├── prompts.py Verbatim copy of the canonical 3 frozen prompts
│ (P-EASY/P-MEDIUM/P-HARD) from
│ WeeyugaWeb/scripts/benchmarks/prompts.py.
├── requirements.txt Empty by intent (stdlib-only); placeholder for
│ pip-tools / agent auto-install patterns.
├── .gitignore __pycache__/ etc.
└── suites/ Six bundled JSON suites copied verbatim from
Sloba's MyServers/instances/vps-81-17-99-14/telemetry/:
small_model_eval_questions.json, python_task_suite_questions.json,
parallel_qwen_same_model_20q_suite.json,
parallel_qwen_mixed_model_20q_suite.json,
python_context_edge_append_questions.json,
python_context_edge_suite_only.json.
submissions/
README.md Folder convention + naming + reviewability rules
EXAMPLE/mac-m1-8gb/run-00000000-...-000000000000/
Synthetic-but-shape-complete contribution template:
manifest.json, hardware.json, run.jsonl (5 example lines),
metadata.json, run.md (with privacy attestation, methodology
deviations, reproducibility command). Marked as synthetic at
the top so future analysis doesn't accidentally cite it.
LICENSE-MIT
MIT for harness/*.py and future helper code. Existing LICENSE
(CC-BY-4.0) covers data files.
README.md (modified)
Updated to reflect dual purpose. Layout diagram updated.
Maintainer credits: Ben for catalogue/methodology + Bane for harness.
Contributor quick-start added. Status table extended.
Privacy posture:
- All 6 suite JSON files privacy-scanned for cluster IPs / hostnames /
paths / tokens. Two prompts contain project names ("MyBoard" auth
debugging in 20Q-Q14, generic SSH troubleshooting in 5Q-Q03);
flagged in chat for Sloba's review. Otherwise clean.
- run_benchmark.py default target_url is 127.0.0.1:11434 (no internal
IPs leaked).
- manifest.json captures host_hostname_short via socket.gethostname()
.split('.')[0] — agent should review before PR if hostname is
sensitive.
- CLAUDE.md §8 spells out the privacy-grep before push.
Verification:
- py_compile run_benchmark.py: OK
- --help renders cleanly
- All 6 suite JSON files: valid
- All 4 example JSON files: valid
- Example run.jsonl (5 lines): valid
This commit lands on branch feature/runner-and-agent-instructions.
NOT pushed to main; staying on the feature branch until Sloba reviews
on Gitea and merges. Bus dispatch to Ben + Sam announcing the
architectural pivot lives in the WeeyugaWeb coordination repo.
79 lines
3.9 KiB
Markdown
79 lines
3.9 KiB
Markdown
# `submissions/`
|
|
|
|
Friends' benchmark contributions land here, one directory per submitter,
|
|
one subdirectory per device, one sub-subdirectory per run.
|
|
|
|
## Layout
|
|
|
|
```
|
|
submissions/
|
|
├── README.md — this file
|
|
├── EXAMPLE/ — template; see below
|
|
│ └── mac-m1-8gb/
|
|
│ └── run-00000000-...-000000000000/
|
|
│ ├── manifest.json
|
|
│ ├── hardware.json
|
|
│ ├── run.jsonl
|
|
│ ├── metadata.json
|
|
│ └── run.md
|
|
├── alice/ — first real friend's contributions
|
|
│ └── mac-m1-8gb/
|
|
│ └── run-<uuid>/...
|
|
└── bob/ — etc.
|
|
└── rtx-4090-pc/
|
|
└── run-<uuid>/...
|
|
```
|
|
|
|
## Per-submission contents
|
|
|
|
Five files inside each `run-<uuid>/`:
|
|
|
|
- **`manifest.json`** — automatic; `run_benchmark.py` writes it at run start. Contains submitter handle, device tag, target URL, model list, phase plan, canonical-options overrides, host hostname (short), platform, started-at timestamp.
|
|
- **`hardware.json`** — agent fills from a hardware probe (see `CLAUDE.md` §2). Schema version `hardware-1.0`.
|
|
- **`run.jsonl`** — automatic; the canonical event ledger. Line 1 is `type=meta`; subsequent lines are `type=call` or `type=skipped`; final line is `type=footer`.
|
|
- **`metadata.json`** — agent fills with computed aggregates per `(cell_id, phase)` cell. Schema version `metadata-1.0`. The catalogue builder will recompute on Sloba's side; having it in the PR makes review fast.
|
|
- **`run.md`** — agent fills using the `CLAUDE.md` §6b template. Honest narrative — methodology deviations, caveats, headline numbers.
|
|
|
|
## Why per-submitter folders?
|
|
|
|
- **Attribution** — your handle lives next to your data
|
|
- **Reviewability** — a PR adds files only under `submissions/<your-handle>/...`; reviewer can see the whole contribution at a glance
|
|
- **No collisions** — two friends submitting from "macbook-pro" don't overwrite each other
|
|
- **History stays clean** — re-runs go into new `run-<uuid>/` subdirs, not on top of the old one
|
|
|
|
## Naming conventions
|
|
|
|
- **`<submitter-handle>`** — your Gitea username, or any other handle you'd like to be credited as. Lowercase; ASCII letters / digits / hyphens only.
|
|
- **`<device-tag>`** — short descriptor of the hardware. Pattern: `<chip-or-platform>-<key-spec>`. Examples:
|
|
- `mac-m1-8gb`, `mac-m2-pro-16gb`, `mac-m3-max-64gb`
|
|
- `rtx-4090-pc`, `rtx-3060-laptop`, `gtx-1060-6gb`
|
|
- `ryzen-7950x-cpu`, `intel-i9-13900k-cpu`
|
|
- `pixel-8-pro`, `samsung-s24-ultra` (yes, phones — if you've got termux working)
|
|
- `runpod-h100-pcie`, `runpod-rtx-a6000`
|
|
- **`run-<uuid>`** — `run-` prefix + a UUID v4 from `run_benchmark.py`. Don't shorten.
|
|
|
|
## What the EXAMPLE folder is for
|
|
|
|
A complete-but-tiny submission you can read end-to-end to understand the
|
|
shapes. **Don't modify the EXAMPLE folder in a benchmark-submission PR**; if
|
|
you spot a bug in the example, that's a separate PR with the title
|
|
`fix: submissions/EXAMPLE/...`.
|
|
|
|
## When a submission is merged
|
|
|
|
Sloba reviews and merges manually. After merge:
|
|
1. The catalogue builder on Sloba's side picks up your run, computes a
|
|
`cell_id` from your `device-tag` + model, and assigns it a `site_grade`
|
|
(flagship / standard / archive-only based on the criteria in
|
|
`methodology.md`).
|
|
2. Janie (the benchmarks blogger) may write a `janie_blurb_md` for it.
|
|
3. It appears on `benchmarks.weeyuga.com` (when the site is live).
|
|
4. Your `device-tag` becomes a permanent comparison axis on the catalogue.
|
|
|
|
## What if I want to delete a submission later?
|
|
|
|
Open an issue, we'll honor the request promptly. We'll keep the run
|
|
directory but mark it `visibility: redacted` in the catalogue overlay so
|
|
the data still validates historical analysis claims while disappearing
|
|
from the browse surface.
|