crowdsourced runner
Sloba's chat directive 2026-05-06: "this project is preparation for
going public ... ship the harness along so others can join in."
The repo's original purpose (Ben's catalogue + 21 reference run
ledgers, shipped 2026-05-05) stays intact. This commit ADDS a second
purpose: a portable harness + agent runbook so a friend's coding agent
can clone, read CLAUDE.md, run the same suite on the friend's hardware,
and submit results back as a PR.
What landed:
CLAUDE.md + AGENTS.md (byte-identical, ~520 lines)
Full agent runbook: hardware probe, runtime + model selection,
canonical knob reference (Sloba's Pavilion methodology values),
hardware-adaptation decision rules, run-instructions, output-schema
templates for hardware.json + metadata.json + run.md, PR submission
flow (fork → branch → push → PR; nothing auto-merges), privacy
guardrails, methodology lineage. Per Sloba's Q3 directive: the
runbook explicitly tells the friend's agent to ADAPT to hardware
reality and document deviations rather than blindly run defaults.
CONTRIBUTING.md (~110 lines)
Human-readable companion for the friend (not the agent). What you
need, how it works, what we ask, what maintainers commit to,
license, code-of-conduct short version.
harness/
├── README.md Technical readme for the harness folder
├── run_benchmark.py ~520 LOC runner. Stdlib-only. Adapted from
│ WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py
│ v3 with the cluster-internal IP defaults
│ (10.8.0.x) replaced by 127.0.0.1:11434, the
│ cluster /v1/cluster/* endpoints removed, the
│ canonical-suite paths under ~/Documents/MyServers
│ replaced by harness/suites/ paths, the git-sha
│ enforcement on WeeyugaWeb dropped, and the
│ output written under submissions/<handle>/<tag>/
│ instead of docs/BENCHMARKS/runs/. Supports all
│ six suite phases via --phases, plus 'all'.
├── prompts.py Verbatim copy of the canonical 3 frozen prompts
│ (P-EASY/P-MEDIUM/P-HARD) from
│ WeeyugaWeb/scripts/benchmarks/prompts.py.
├── requirements.txt Empty by intent (stdlib-only); placeholder for
│ pip-tools / agent auto-install patterns.
├── .gitignore __pycache__/ etc.
└── suites/ Six bundled JSON suites copied verbatim from
Sloba's MyServers/instances/vps-81-17-99-14/telemetry/:
small_model_eval_questions.json, python_task_suite_questions.json,
parallel_qwen_same_model_20q_suite.json,
parallel_qwen_mixed_model_20q_suite.json,
python_context_edge_append_questions.json,
python_context_edge_suite_only.json.
submissions/
README.md Folder convention + naming + reviewability rules
EXAMPLE/mac-m1-8gb/run-00000000-...-000000000000/
Synthetic-but-shape-complete contribution template:
manifest.json, hardware.json, run.jsonl (5 example lines),
metadata.json, run.md (with privacy attestation, methodology
deviations, reproducibility command). Marked as synthetic at
the top so future analysis doesn't accidentally cite it.
LICENSE-MIT
MIT for harness/*.py and future helper code. Existing LICENSE
(CC-BY-4.0) covers data files.
README.md (modified)
Updated to reflect dual purpose. Layout diagram updated.
Maintainer credits: Ben for catalogue/methodology + Bane for harness.
Contributor quick-start added. Status table extended.
Privacy posture:
- All 6 suite JSON files privacy-scanned for cluster IPs / hostnames /
paths / tokens. Two prompts contain project names ("MyBoard" auth
debugging in 20Q-Q14, generic SSH troubleshooting in 5Q-Q03);
flagged in chat for Sloba's review. Otherwise clean.
- run_benchmark.py default target_url is 127.0.0.1:11434 (no internal
IPs leaked).
- manifest.json captures host_hostname_short via socket.gethostname()
.split('.')[0] — agent should review before PR if hostname is
sensitive.
- CLAUDE.md §8 spells out the privacy-grep before push.
Verification:
- py_compile run_benchmark.py: OK
- --help renders cleanly
- All 6 suite JSON files: valid
- All 4 example JSON files: valid
- Example run.jsonl (5 lines): valid
This commit lands on branch feature/runner-and-agent-instructions.
NOT pushed to main; staying on the feature branch until Sloba reviews
on Gitea and merges. Bus dispatch to Ben + Sam announcing the
architectural pivot lives in the WeeyugaWeb coordination repo.
100 lines
4.3 KiB
Markdown
100 lines
4.3 KiB
Markdown
# Contributing benchmarks from your hardware
|
||
|
||
Welcome — and thank you for adding a data point. This file is the
|
||
human-readable companion to `CLAUDE.md` / `AGENTS.md`. The agent files have
|
||
the full mechanical detail; this one has the humans-in-the-loop story.
|
||
|
||
## What you're contributing
|
||
|
||
You're running the same Weeyuga benchmark suite that powers
|
||
[benchmarks.weeyuga.com](https://benchmarks.weeyuga.com), on your hardware,
|
||
and submitting the raw output as a PR. Your numbers join Sloba's
|
||
cluster numbers as a comparison point. More devices = more honest
|
||
ladder.
|
||
|
||
## What you need
|
||
|
||
- A device with ≥3 GB free disk space (model files are 0.5–10 GB depending on size)
|
||
- An OpenAI-compatible LLM runtime — Ollama (easiest), llama.cpp, vLLM, or MLX
|
||
- A coding agent — Claude Code, Codex, Aider, Cursor, etc. — to read the runbook and adapt the harness to your hardware
|
||
- A Gitea account on `git.weeyuga.com` (free; you'll need it to fork + PR)
|
||
- Maybe 1–4 hours wall-clock, mostly idle while the benchmark runs
|
||
|
||
## How it works
|
||
|
||
1. **You clone** this repo:
|
||
```bash
|
||
git clone https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public.git
|
||
cd weeyuga-benchmarks-public
|
||
```
|
||
2. **You point your agent at it** with one sentence:
|
||
> "Read `CLAUDE.md` (or `AGENTS.md`), then run the Weeyuga benchmark on
|
||
> my hardware. Pick a model that fits, document everything, and prepare
|
||
> a PR. I'll review before pushing."
|
||
3. **The agent does the work** — probes your hardware, picks a model size,
|
||
adapts the runner's parameters, runs the suite, generates a JSONL ledger
|
||
plus a human-readable summary.
|
||
4. **You review** what the agent produced — especially `run.md` for honesty
|
||
and `run.jsonl` for any accidentally-leaked secrets.
|
||
5. **You fork → push → open PR** (the agent can do this too; you click
|
||
the merge button on Gitea after).
|
||
6. **Sloba reviews the PR** and merges. He may have a question or two; the
|
||
agent can address them on the same branch.
|
||
|
||
## Why crowdsource benchmarks?
|
||
|
||
Hardware variance is huge. Sloba's published numbers come from a Mac M1, a
|
||
laptop with a GTX 1060, and three VPSes — that's a thin slice of the world.
|
||
Your RTX 4090, your Snapdragon X Elite, your Ryzen 7950X3D, your old Xeon
|
||
all have stories worth recording.
|
||
|
||
Same prompts. Same suites. Different hardware. The numbers compose.
|
||
|
||
## What we ask
|
||
|
||
- Run the canonical suite as-is (5q + 20q minimum; the rest if you have time)
|
||
- **Document deviations honestly.** If you had to skip parallel suites because
|
||
of RAM, say so. If you tweaked NGL because the default OOM'd, say so. The
|
||
point is comparable runs, not perfect runs.
|
||
- **Privacy-scan before pushing.** `run.jsonl` stores response previews — if
|
||
the model echoed your home directory or an API key from your shell history,
|
||
redact before PR.
|
||
- **One PR per device per session.** Don't bundle "my laptop AND my desktop
|
||
AND my friend's PC" — separate PRs are easier to review.
|
||
|
||
## What the maintainers (Sloba + team) commit to
|
||
|
||
- We respond to PRs within ~3 days
|
||
- We don't merge without reading; if your run.md has clear caveats we'll usually merge
|
||
- We credit you by handle in `catalogue.json` if/when your run becomes a flagship
|
||
- We never expose anything from your `run.md` or `manifest.json` beyond what
|
||
you submitted; if you used a pseudonym, that's the name that ships
|
||
- If we ask for a re-run with different parameters, that's a separate dispatch — we don't silently reinterpret your run
|
||
|
||
## License of your contribution
|
||
|
||
By PR-ing data into this repo, you license it under
|
||
[CC-BY-4.0](LICENSE) (data) and the harness/runner code under
|
||
[MIT](LICENSE-MIT). Attribution stays with you (your handle becomes part of
|
||
the run record).
|
||
|
||
## What this repo is NOT
|
||
|
||
- A leaderboard with prizes
|
||
- A way to "win" against other devices (the point is honest measurement, not bragging rights)
|
||
- A vehicle for marketing claims (vendor PR runs need a separate flow we
|
||
haven't designed yet — please don't astroturf the catalogue)
|
||
|
||
## Found a bug or methodology gap?
|
||
|
||
Open an issue. We'd rather hear about a flawed prompt or a misleading metric
|
||
than ship more data using it.
|
||
|
||
## Code of conduct, the short version
|
||
|
||
Be kind, be honest about your data, don't try to game the catalogue, and
|
||
don't dox other contributors. Sloba reserves the right to remove submissions
|
||
that violate spirit-of-the-thing — but we'll say why.
|
||
|
||
— The Weeyuga team
|