Files
weeyuga-benchmarks-public/README.md
Slobodan Margetic 97a9245d9e feat: harness + agent runbook — flip repo from archive-only to
crowdsourced runner

Sloba's chat directive 2026-05-06: "this project is preparation for
going public ... ship the harness along so others can join in."

The repo's original purpose (Ben's catalogue + 21 reference run
ledgers, shipped 2026-05-05) stays intact. This commit ADDS a second
purpose: a portable harness + agent runbook so a friend's coding agent
can clone, read CLAUDE.md, run the same suite on the friend's hardware,
and submit results back as a PR.

What landed:

CLAUDE.md + AGENTS.md (byte-identical, ~520 lines)
  Full agent runbook: hardware probe, runtime + model selection,
  canonical knob reference (Sloba's Pavilion methodology values),
  hardware-adaptation decision rules, run-instructions, output-schema
  templates for hardware.json + metadata.json + run.md, PR submission
  flow (fork → branch → push → PR; nothing auto-merges), privacy
  guardrails, methodology lineage. Per Sloba's Q3 directive: the
  runbook explicitly tells the friend's agent to ADAPT to hardware
  reality and document deviations rather than blindly run defaults.

CONTRIBUTING.md (~110 lines)
  Human-readable companion for the friend (not the agent). What you
  need, how it works, what we ask, what maintainers commit to,
  license, code-of-conduct short version.

harness/
  ├── README.md        Technical readme for the harness folder
  ├── run_benchmark.py ~520 LOC runner. Stdlib-only. Adapted from
  │                    WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py
  │                    v3 with the cluster-internal IP defaults
  │                    (10.8.0.x) replaced by 127.0.0.1:11434, the
  │                    cluster /v1/cluster/* endpoints removed, the
  │                    canonical-suite paths under ~/Documents/MyServers
  │                    replaced by harness/suites/ paths, the git-sha
  │                    enforcement on WeeyugaWeb dropped, and the
  │                    output written under submissions/<handle>/<tag>/
  │                    instead of docs/BENCHMARKS/runs/. Supports all
  │                    six suite phases via --phases, plus 'all'.
  ├── prompts.py       Verbatim copy of the canonical 3 frozen prompts
  │                    (P-EASY/P-MEDIUM/P-HARD) from
  │                    WeeyugaWeb/scripts/benchmarks/prompts.py.
  ├── requirements.txt Empty by intent (stdlib-only); placeholder for
  │                    pip-tools / agent auto-install patterns.
  ├── .gitignore       __pycache__/ etc.
  └── suites/          Six bundled JSON suites copied verbatim from
       Sloba's MyServers/instances/vps-81-17-99-14/telemetry/:
       small_model_eval_questions.json, python_task_suite_questions.json,
       parallel_qwen_same_model_20q_suite.json,
       parallel_qwen_mixed_model_20q_suite.json,
       python_context_edge_append_questions.json,
       python_context_edge_suite_only.json.

submissions/
  README.md            Folder convention + naming + reviewability rules
  EXAMPLE/mac-m1-8gb/run-00000000-...-000000000000/
       Synthetic-but-shape-complete contribution template:
       manifest.json, hardware.json, run.jsonl (5 example lines),
       metadata.json, run.md (with privacy attestation, methodology
       deviations, reproducibility command). Marked as synthetic at
       the top so future analysis doesn't accidentally cite it.

LICENSE-MIT
  MIT for harness/*.py and future helper code. Existing LICENSE
  (CC-BY-4.0) covers data files.

README.md (modified)
  Updated to reflect dual purpose. Layout diagram updated.
  Maintainer credits: Ben for catalogue/methodology + Bane for harness.
  Contributor quick-start added. Status table extended.

Privacy posture:
  - All 6 suite JSON files privacy-scanned for cluster IPs / hostnames /
    paths / tokens. Two prompts contain project names ("MyBoard" auth
    debugging in 20Q-Q14, generic SSH troubleshooting in 5Q-Q03);
    flagged in chat for Sloba's review. Otherwise clean.
  - run_benchmark.py default target_url is 127.0.0.1:11434 (no internal
    IPs leaked).
  - manifest.json captures host_hostname_short via socket.gethostname()
    .split('.')[0] — agent should review before PR if hostname is
    sensitive.
  - CLAUDE.md §8 spells out the privacy-grep before push.

Verification:
  - py_compile run_benchmark.py: OK
  - --help renders cleanly
  - All 6 suite JSON files: valid
  - All 4 example JSON files: valid
  - Example run.jsonl (5 lines): valid

This commit lands on branch feature/runner-and-agent-instructions.
NOT pushed to main; staying on the feature branch until Sloba reviews
on Gitea and merges. Bus dispatch to Ben + Sam announcing the
architectural pivot lives in the WeeyugaWeb coordination repo.
2026-05-06 19:05:22 +02:00

171 lines
6.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# weeyuga-benchmarks-public
> **Status: PRIVATE STAGING** — this repo is not yet anonymous-readable.
> Flips to public after the pre-launch security audit signs off.
> If you got here too early, please hold; you'll be invited soon.
Open benchmarks for local LLMs — same prompts, same suites, run on
whatever hardware you've got, results compose into one ladder.
This repo is two things in one:
1. **Canonical archive** — every benchmark Sloba's team publishes on
[benchmarks.weeyuga.com](https://benchmarks.weeyuga.com) lives here as
raw JSONL + computed metadata + human summary, so anyone can clone,
re-analyse, or cite. This is the original purpose; see `runs/`.
2. **Crowdsourced runner** — a portable harness + agent runbook so a
friend's coding agent (Claude Code, Codex, Aider, …) can clone this
repo, read `CLAUDE.md` / `AGENTS.md`, run the same suite on the
friend's hardware, and submit the result back as a PR. This is the
newer purpose; see `harness/` + `submissions/`.
Both purposes share one schema, one prompt set, one methodology.
## Quick start — for friends contributing a benchmark
```bash
git clone https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public.git
cd weeyuga-benchmarks-public
# Hand this file path to your coding agent and say:
# "Read CLAUDE.md, then run the Weeyuga benchmark on my hardware."
```
Then read `CONTRIBUTING.md` for the human-side flow (what to expect,
how reviews work, license, code of conduct).
## Layout
```
.
├── README.md — this file
├── CLAUDE.md — full agent runbook (read this if you're an LLM-driven agent)
├── AGENTS.md — byte-identical to CLAUDE.md (Codex / other tools that prefer this name)
├── CONTRIBUTING.md — human-readable contribution guide
├── LICENSE — CC-BY-4.0 for data
├── LICENSE-MIT — MIT for harness/runner code
├── catalogue.json — index of every published benchmark (canonical archive)
├── methodology.md — how we benchmark + fairness rules + reproducibility notes
├── harness/ — portable runner + suites + prompts (the crowdsourced piece)
│ ├── README.md
│ ├── run_benchmark.py
│ ├── prompts.py
│ ├── requirements.txt
│ └── suites/ — six bundled JSON suites (5Q, 20Q, parallel × 2, edge × 2)
├── runs/ — canonical archive: 21 reference runs from Sloba's cluster
│ └── <run-id>/
│ ├── run.jsonl
│ ├── run.log — when captured
│ ├── run.md — when synthesis exists
│ └── metadata.json
└── submissions/ — community contributions land here
├── README.md
├── EXAMPLE/ — one fully-filled-out template you can read
│ └── mac-m1-8gb/
│ └── run-<uuid>/
│ ├── manifest.json
│ ├── hardware.json
│ ├── run.jsonl
│ ├── metadata.json
│ └── run.md
└── <handle>/ — your contributions
└── <device-tag>/
└── run-<uuid>/...
```
## Run-ID format
Every run gets a UUID v4 `<run-id>` assigned at harness startup. Run IDs
are stable across re-runs of synthesis — the same run-id always points
to the same raw `run.jsonl` event stream. Synthesis docs (`run.md`) and
computed metadata (`metadata.json`) can be regenerated from the
canonical jsonl at any time.
## Schema
The `catalogue.json` index follows `schema_version = "1.0-draft"` (or
later — check the value at the top of the file). Per-benchmark entries
include:
- `id` — run-id
- `title`, `headline`, `date`
- `hardware` (pavilion / predator / mac / vps50 / runpod / community-`<device-tag>`)
- `engine` (llamacpp / ollama / vllm / mlx / cpu)
- `harness` (which harness produced this — see `methodology.md`)
- `model_family`, `model_sizes`
- `cells[]` — per-(machine × engine × model) summary
- `synthesis_doc` — filename of the synthesis prose, if one exists
- `tags`, `status`, `visibility`, `site_grade`
Per-run `metadata.json` adds `cells_full[]` with the full call list inline.
## How to consume the archive
### Single run
```bash
curl -O https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public/raw/branch/main/runs/<run-id>/run.jsonl
```
### Whole archive
```bash
git clone https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public.git
```
### Re-build catalogue from raw
The canonical builder lives in
[WeeyugaWeb/scripts/benchmarks/build_catalogue.py](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb/src/branch/main/scripts/benchmarks/build_catalogue.py)
and runs against `runs/*/run.jsonl`.
## How to contribute a benchmark
See `CLAUDE.md` (or `AGENTS.md` — same content) for the full agent
runbook, and `CONTRIBUTING.md` for the human-side flow. Short version:
1. Clone repo
2. Hand `CLAUDE.md` to your coding agent
3. Agent probes hardware, picks a model, runs benchmark, writes results
4. You review what the agent produced
5. You fork → push → open PR
6. Maintainers review and merge
Read access is open. **Write access is via PR only — nothing auto-merges.**
## Citation
```
Margetić, S. & contributors. (2026). Weeyuga local-LLM benchmarks.
https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public
```
## License
- **Data** (`runs/`, `submissions/`, `catalogue.json`, `methodology.md`,
`harness/suites/`): [Creative Commons Attribution 4.0 International (CC-BY-4.0)](LICENSE).
- **Helper code** (`harness/*.py`, future scripts): [MIT](LICENSE-MIT).
You're free to share, re-host, re-analyse, and remix the data with attribution.
## Reporting issues
If you spot a bench number that looks wrong, a methodology gap, or a
privacy slip in published metadata: open an issue on this repo, or
email the team at `slobodan@weeyuga.com`. We'd rather know.
## Status
| What | State |
|---|---|
| Repo created | 2026-05-05 |
| Canonical archive landed (21 runs) | 2026-05-05 |
| Harness + agent runbook landed | 2026-05-06 |
| Pre-launch security audit | scheduled |
| Visibility flipped to public | pending audit sign-off |
| First friend's submission merged | pending |
Maintainers: `mac/benchmark-tester-ben` (catalogue + methodology),
`mac/devops-bane` (harness + runner + this README change).
For coordination, see the
[WeeyugaWeb coordination bus](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb/src/branch/main/coordination).