feat: harness + agent runbook — flip repo from archive-only to
crowdsourced runner
Sloba's chat directive 2026-05-06: "this project is preparation for
going public ... ship the harness along so others can join in."
The repo's original purpose (Ben's catalogue + 21 reference run
ledgers, shipped 2026-05-05) stays intact. This commit ADDS a second
purpose: a portable harness + agent runbook so a friend's coding agent
can clone, read CLAUDE.md, run the same suite on the friend's hardware,
and submit results back as a PR.
What landed:
CLAUDE.md + AGENTS.md (byte-identical, ~520 lines)
Full agent runbook: hardware probe, runtime + model selection,
canonical knob reference (Sloba's Pavilion methodology values),
hardware-adaptation decision rules, run-instructions, output-schema
templates for hardware.json + metadata.json + run.md, PR submission
flow (fork → branch → push → PR; nothing auto-merges), privacy
guardrails, methodology lineage. Per Sloba's Q3 directive: the
runbook explicitly tells the friend's agent to ADAPT to hardware
reality and document deviations rather than blindly run defaults.
CONTRIBUTING.md (~110 lines)
Human-readable companion for the friend (not the agent). What you
need, how it works, what we ask, what maintainers commit to,
license, code-of-conduct short version.
harness/
├── README.md Technical readme for the harness folder
├── run_benchmark.py ~520 LOC runner. Stdlib-only. Adapted from
│ WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py
│ v3 with the cluster-internal IP defaults
│ (10.8.0.x) replaced by 127.0.0.1:11434, the
│ cluster /v1/cluster/* endpoints removed, the
│ canonical-suite paths under ~/Documents/MyServers
│ replaced by harness/suites/ paths, the git-sha
│ enforcement on WeeyugaWeb dropped, and the
│ output written under submissions/<handle>/<tag>/
│ instead of docs/BENCHMARKS/runs/. Supports all
│ six suite phases via --phases, plus 'all'.
├── prompts.py Verbatim copy of the canonical 3 frozen prompts
│ (P-EASY/P-MEDIUM/P-HARD) from
│ WeeyugaWeb/scripts/benchmarks/prompts.py.
├── requirements.txt Empty by intent (stdlib-only); placeholder for
│ pip-tools / agent auto-install patterns.
├── .gitignore __pycache__/ etc.
└── suites/ Six bundled JSON suites copied verbatim from
Sloba's MyServers/instances/vps-81-17-99-14/telemetry/:
small_model_eval_questions.json, python_task_suite_questions.json,
parallel_qwen_same_model_20q_suite.json,
parallel_qwen_mixed_model_20q_suite.json,
python_context_edge_append_questions.json,
python_context_edge_suite_only.json.
submissions/
README.md Folder convention + naming + reviewability rules
EXAMPLE/mac-m1-8gb/run-00000000-...-000000000000/
Synthetic-but-shape-complete contribution template:
manifest.json, hardware.json, run.jsonl (5 example lines),
metadata.json, run.md (with privacy attestation, methodology
deviations, reproducibility command). Marked as synthetic at
the top so future analysis doesn't accidentally cite it.
LICENSE-MIT
MIT for harness/*.py and future helper code. Existing LICENSE
(CC-BY-4.0) covers data files.
README.md (modified)
Updated to reflect dual purpose. Layout diagram updated.
Maintainer credits: Ben for catalogue/methodology + Bane for harness.
Contributor quick-start added. Status table extended.
Privacy posture:
- All 6 suite JSON files privacy-scanned for cluster IPs / hostnames /
paths / tokens. Two prompts contain project names ("MyBoard" auth
debugging in 20Q-Q14, generic SSH troubleshooting in 5Q-Q03);
flagged in chat for Sloba's review. Otherwise clean.
- run_benchmark.py default target_url is 127.0.0.1:11434 (no internal
IPs leaked).
- manifest.json captures host_hostname_short via socket.gethostname()
.split('.')[0] — agent should review before PR if hostname is
sensitive.
- CLAUDE.md §8 spells out the privacy-grep before push.
Verification:
- py_compile run_benchmark.py: OK
- --help renders cleanly
- All 6 suite JSON files: valid
- All 4 example JSON files: valid
- Example run.jsonl (5 lines): valid
This commit lands on branch feature/runner-and-agent-instructions.
NOT pushed to main; staying on the feature branch until Sloba reviews
on Gitea and merges. Bus dispatch to Ben + Sam announcing the
architectural pivot lives in the WeeyugaWeb coordination repo.
This commit is contained in:
161
README.md
161
README.md
@@ -1,54 +1,112 @@
|
||||
# weeyuga-benchmarks-public
|
||||
|
||||
> **Status: PRIVATE STAGING** — this repo is not yet public. Flips to anonymous-read after [Miljan + Stevan's pre-launch security audit](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb/src/branch/main/coordination/messages) signs off. If you're reading this and you're not on the Weeyuga team, you got here too early.
|
||||
> **Status: PRIVATE STAGING** — this repo is not yet anonymous-readable.
|
||||
> Flips to public after the pre-launch security audit signs off.
|
||||
> If you got here too early, please hold; you'll be invited soon.
|
||||
|
||||
Canonical raw-data archive for **[benchmarks.weeyuga.com](https://benchmarks.weeyuga.com)** — every benchmark run we publish on the site is mirrored here as raw JSONL + log + human summary so anyone can clone, re-analyse, or cite.
|
||||
Open benchmarks for local LLMs — same prompts, same suites, run on
|
||||
whatever hardware you've got, results compose into one ladder.
|
||||
|
||||
This repo is two things in one:
|
||||
|
||||
1. **Canonical archive** — every benchmark Sloba's team publishes on
|
||||
[benchmarks.weeyuga.com](https://benchmarks.weeyuga.com) lives here as
|
||||
raw JSONL + computed metadata + human summary, so anyone can clone,
|
||||
re-analyse, or cite. This is the original purpose; see `runs/`.
|
||||
2. **Crowdsourced runner** — a portable harness + agent runbook so a
|
||||
friend's coding agent (Claude Code, Codex, Aider, …) can clone this
|
||||
repo, read `CLAUDE.md` / `AGENTS.md`, run the same suite on the
|
||||
friend's hardware, and submit the result back as a PR. This is the
|
||||
newer purpose; see `harness/` + `submissions/`.
|
||||
|
||||
Both purposes share one schema, one prompt set, one methodology.
|
||||
|
||||
## Quick start — for friends contributing a benchmark
|
||||
|
||||
```bash
|
||||
git clone https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public.git
|
||||
cd weeyuga-benchmarks-public
|
||||
# Hand this file path to your coding agent and say:
|
||||
# "Read CLAUDE.md, then run the Weeyuga benchmark on my hardware."
|
||||
```
|
||||
|
||||
Then read `CONTRIBUTING.md` for the human-side flow (what to expect,
|
||||
how reviews work, license, code of conduct).
|
||||
|
||||
## Layout
|
||||
|
||||
```
|
||||
.
|
||||
├── README.md — this file
|
||||
├── LICENSE — CC-BY-4.0 (data) + MIT (helper code)
|
||||
├── catalogue.json — index of every published benchmark (mirror of the site catalogue)
|
||||
├── methodology.md — how we benchmark + fairness rules + reproducibility notes
|
||||
└── runs/
|
||||
└── <run-id>/
|
||||
├── run.jsonl — canonical raw event stream (one JSON object per line)
|
||||
├── run.log — tee'd stdout/stderr from the harness (when captured)
|
||||
├── run.md — human-readable summary (when synthesis exists)
|
||||
└── metadata.json — computed snapshot: meta record + per-cell aggregates + status
|
||||
├── README.md — this file
|
||||
├── CLAUDE.md — full agent runbook (read this if you're an LLM-driven agent)
|
||||
├── AGENTS.md — byte-identical to CLAUDE.md (Codex / other tools that prefer this name)
|
||||
├── CONTRIBUTING.md — human-readable contribution guide
|
||||
├── LICENSE — CC-BY-4.0 for data
|
||||
├── LICENSE-MIT — MIT for harness/runner code
|
||||
├── catalogue.json — index of every published benchmark (canonical archive)
|
||||
├── methodology.md — how we benchmark + fairness rules + reproducibility notes
|
||||
├── harness/ — portable runner + suites + prompts (the crowdsourced piece)
|
||||
│ ├── README.md
|
||||
│ ├── run_benchmark.py
|
||||
│ ├── prompts.py
|
||||
│ ├── requirements.txt
|
||||
│ └── suites/ — six bundled JSON suites (5Q, 20Q, parallel × 2, edge × 2)
|
||||
├── runs/ — canonical archive: 21 reference runs from Sloba's cluster
|
||||
│ └── <run-id>/
|
||||
│ ├── run.jsonl
|
||||
│ ├── run.log — when captured
|
||||
│ ├── run.md — when synthesis exists
|
||||
│ └── metadata.json
|
||||
└── submissions/ — community contributions land here
|
||||
├── README.md
|
||||
├── EXAMPLE/ — one fully-filled-out template you can read
|
||||
│ └── mac-m1-8gb/
|
||||
│ └── run-<uuid>/
|
||||
│ ├── manifest.json
|
||||
│ ├── hardware.json
|
||||
│ ├── run.jsonl
|
||||
│ ├── metadata.json
|
||||
│ └── run.md
|
||||
└── <handle>/ — your contributions
|
||||
└── <device-tag>/
|
||||
└── run-<uuid>/...
|
||||
```
|
||||
|
||||
## Run-ID format
|
||||
|
||||
Every run gets a UUID v4 `<run-id>` assigned at harness startup. Run IDs are stable across re-runs of synthesis — the same run-id always points to the same raw `run.jsonl` event stream. Synthesis docs (`run.md`) and computed metadata (`metadata.json`) can be regenerated from the canonical jsonl at any time.
|
||||
Every run gets a UUID v4 `<run-id>` assigned at harness startup. Run IDs
|
||||
are stable across re-runs of synthesis — the same run-id always points
|
||||
to the same raw `run.jsonl` event stream. Synthesis docs (`run.md`) and
|
||||
computed metadata (`metadata.json`) can be regenerated from the
|
||||
canonical jsonl at any time.
|
||||
|
||||
## Schema
|
||||
|
||||
The `catalogue.json` index follows `schema_version = "1.0-draft"` (or later — check the value at the top of the file). Per-benchmark entries include:
|
||||
The `catalogue.json` index follows `schema_version = "1.0-draft"` (or
|
||||
later — check the value at the top of the file). Per-benchmark entries
|
||||
include:
|
||||
|
||||
- `id` — run-id
|
||||
- `title`, `headline`, `date`
|
||||
- `hardware` (pavilion / predator / mac / vps50 / runpod)
|
||||
- `hardware` (pavilion / predator / mac / vps50 / runpod / community-`<device-tag>`)
|
||||
- `engine` (llamacpp / ollama / vllm / mlx / cpu)
|
||||
- `harness` (which harness produced this — see `methodology.md` for the matrix)
|
||||
- `harness` (which harness produced this — see `methodology.md`)
|
||||
- `model_family`, `model_sizes`
|
||||
- `cells[]` — per-(machine × engine × model) summary: n_calls, n_errors, duration_ms (mean + p50), tokens_per_sec (mean + max)
|
||||
- `synthesis_doc` — filename of the synthesis prose for this run, if one exists
|
||||
- `tags`, `status`, `visibility`
|
||||
- `cells[]` — per-(machine × engine × model) summary
|
||||
- `synthesis_doc` — filename of the synthesis prose, if one exists
|
||||
- `tags`, `status`, `visibility`, `site_grade`
|
||||
|
||||
Per-run `metadata.json` adds `cells_full[]` with the full call list inline.
|
||||
|
||||
## How to consume
|
||||
## How to consume the archive
|
||||
|
||||
### Just download a single run
|
||||
### Single run
|
||||
|
||||
```bash
|
||||
curl -O https://benchmarks.weeyuga.com/data/runs/<run-id>/run.jsonl
|
||||
curl -O https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public/raw/branch/main/runs/<run-id>/run.jsonl
|
||||
```
|
||||
|
||||
### Clone the whole archive
|
||||
### Whole archive
|
||||
|
||||
```bash
|
||||
git clone https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public.git
|
||||
@@ -56,48 +114,57 @@ git clone https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public.
|
||||
|
||||
### Re-build catalogue from raw
|
||||
|
||||
The canonical builder lives in [WeeyugaWeb/scripts/benchmarks/build_catalogue.py](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb/src/branch/main/scripts/benchmarks/build_catalogue.py) and runs against `runs/*.jsonl`. If you want to regenerate the catalogue from your own clone of this repo:
|
||||
The canonical builder lives in
|
||||
[WeeyugaWeb/scripts/benchmarks/build_catalogue.py](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb/src/branch/main/scripts/benchmarks/build_catalogue.py)
|
||||
and runs against `runs/*/run.jsonl`.
|
||||
|
||||
```bash
|
||||
git clone https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb.git
|
||||
cd WeeyugaWeb
|
||||
python3 scripts/benchmarks/build_catalogue.py
|
||||
```
|
||||
## How to contribute a benchmark
|
||||
|
||||
See `CLAUDE.md` (or `AGENTS.md` — same content) for the full agent
|
||||
runbook, and `CONTRIBUTING.md` for the human-side flow. Short version:
|
||||
|
||||
1. Clone repo
|
||||
2. Hand `CLAUDE.md` to your coding agent
|
||||
3. Agent probes hardware, picks a model, runs benchmark, writes results
|
||||
4. You review what the agent produced
|
||||
5. You fork → push → open PR
|
||||
6. Maintainers review and merge
|
||||
|
||||
Read access is open. **Write access is via PR only — nothing auto-merges.**
|
||||
|
||||
## Citation
|
||||
|
||||
If you use this data, please cite as:
|
||||
|
||||
```
|
||||
Margetić, S. & contributors. (2026). Weeyuga cluster benchmarks (raw data archive).
|
||||
Margetić, S. & contributors. (2026). Weeyuga local-LLM benchmarks.
|
||||
https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public
|
||||
```
|
||||
|
||||
(A more formal citation form will land here once Mila weighs in on academic-attribution conventions.)
|
||||
|
||||
## License
|
||||
|
||||
- **Data** (`runs/`, `catalogue.json`, `methodology.md`): [Creative Commons Attribution 4.0 International (CC-BY-4.0)](LICENSE)
|
||||
- **Helper code** (any future scripts inside this repo): [MIT](LICENSE-MIT) (separate file added if/when code lands here)
|
||||
- **Data** (`runs/`, `submissions/`, `catalogue.json`, `methodology.md`,
|
||||
`harness/suites/`): [Creative Commons Attribution 4.0 International (CC-BY-4.0)](LICENSE).
|
||||
- **Helper code** (`harness/*.py`, future scripts): [MIT](LICENSE-MIT).
|
||||
|
||||
You are free to share, re-host, re-analyse, and remix the data with attribution.
|
||||
You're free to share, re-host, re-analyse, and remix the data with attribution.
|
||||
|
||||
## What's in here vs what's NOT
|
||||
## Reporting issues
|
||||
|
||||
This repo contains **bench-run output only**. No source code. No infrastructure config. No application internals. Reproducing a run requires the [WeeyugaWeb](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb) main repo (also Gitea-hosted; visibility separate).
|
||||
|
||||
## Reporting an issue with the data
|
||||
|
||||
If you spot a bench number that looks wrong, a methodology gap, or a privacy slip in published metadata: open an issue on this repo, or email the Weeyuga team. We'd rather know.
|
||||
If you spot a bench number that looks wrong, a methodology gap, or a
|
||||
privacy slip in published metadata: open an issue on this repo, or
|
||||
email the team at `slobodan@weeyuga.com`. We'd rather know.
|
||||
|
||||
## Status
|
||||
|
||||
| What | State |
|
||||
|---|---|
|
||||
| Repo created | 2026-05-05 |
|
||||
| First 21 runs landed | 2026-05-05 |
|
||||
| Miljan + Stevan security audit | scheduled |
|
||||
| Canonical archive landed (21 runs) | 2026-05-05 |
|
||||
| Harness + agent runbook landed | 2026-05-06 |
|
||||
| Pre-launch security audit | scheduled |
|
||||
| Visibility flipped to public | pending audit sign-off |
|
||||
| Site `benchmarks.weeyuga.com` live | pending Bane DNS + nginx + Tomas site |
|
||||
| First friend's submission merged | pending |
|
||||
|
||||
Owner: `mac/benchmark-tester-ben` (Ben). For coordination, see the [WeeyugaWeb coordination bus](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb/src/branch/main/coordination).
|
||||
Maintainers: `mac/benchmark-tester-ben` (catalogue + methodology),
|
||||
`mac/devops-bane` (harness + runner + this README change).
|
||||
For coordination, see the
|
||||
[WeeyugaWeb coordination bus](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb/src/branch/main/coordination).
|
||||
|
||||
Reference in New Issue
Block a user