# weeyuga-benchmarks-public > **Status: PRIVATE STAGING** — this repo is not yet anonymous-readable. > Flips to public after the pre-launch security audit signs off. > If you got here too early, please hold; you'll be invited soon. Open benchmarks for local LLMs — same prompts, same suites, run on whatever hardware you've got, results compose into one ladder. This repo is two things in one: 1. **Canonical archive** — every benchmark Sloba's team publishes on [benchmarks.weeyuga.com](https://benchmarks.weeyuga.com) lives here as raw JSONL + computed metadata + human summary, so anyone can clone, re-analyse, or cite. This is the original purpose; see `runs/`. 2. **Crowdsourced runner** — a portable harness + agent runbook so a friend's coding agent (Claude Code, Codex, Aider, …) can clone this repo, read `CLAUDE.md` / `AGENTS.md`, run the same suite on the friend's hardware, and submit the result back as a PR. This is the newer purpose; see `harness/` + `submissions/`. Both purposes share one schema, one prompt set, one methodology. ## Quick start — for friends contributing a benchmark ```bash git clone https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public.git cd weeyuga-benchmarks-public # Hand this file path to your coding agent and say: # "Read CLAUDE.md, then run the Weeyuga benchmark on my hardware." ``` Then read `CONTRIBUTING.md` for the human-side flow (what to expect, how reviews work, license, code of conduct). ## Layout ``` . ├── README.md — this file ├── CLAUDE.md — full agent runbook (read this if you're an LLM-driven agent) ├── AGENTS.md — byte-identical to CLAUDE.md (Codex / other tools that prefer this name) ├── CONTRIBUTING.md — human-readable contribution guide ├── LICENSE — CC-BY-4.0 for data ├── LICENSE-MIT — MIT for harness/runner code ├── catalogue.json — index of every published benchmark (canonical archive) ├── methodology.md — how we benchmark + fairness rules + reproducibility notes ├── harness/ — portable runner + suites + prompts (the crowdsourced piece) │ ├── README.md │ ├── run_benchmark.py │ ├── prompts.py │ ├── requirements.txt │ └── suites/ — six bundled JSON suites (5Q, 20Q, parallel × 2, edge × 2) ├── runs/ — canonical archive: 21 reference runs from Sloba's cluster │ └── / │ ├── run.jsonl │ ├── run.log — when captured │ ├── run.md — when synthesis exists │ └── metadata.json └── submissions/ — community contributions land here ├── README.md ├── EXAMPLE/ — one fully-filled-out template you can read │ └── mac-m1-8gb/ │ └── run-/ │ ├── manifest.json │ ├── hardware.json │ ├── run.jsonl │ ├── metadata.json │ └── run.md └── / — your contributions └── / └── run-/... ``` ## Run-ID format Every run gets a UUID v4 `` assigned at harness startup. Run IDs are stable across re-runs of synthesis — the same run-id always points to the same raw `run.jsonl` event stream. Synthesis docs (`run.md`) and computed metadata (`metadata.json`) can be regenerated from the canonical jsonl at any time. ## Schema The `catalogue.json` index follows `schema_version = "1.0-draft"` (or later — check the value at the top of the file). Per-benchmark entries include: - `id` — run-id - `title`, `headline`, `date` - `hardware` (pavilion / predator / mac / vps50 / runpod / community-``) - `engine` (llamacpp / ollama / vllm / mlx / cpu) - `harness` (which harness produced this — see `methodology.md`) - `model_family`, `model_sizes` - `cells[]` — per-(machine × engine × model) summary - `synthesis_doc` — filename of the synthesis prose, if one exists - `tags`, `status`, `visibility`, `site_grade` Per-run `metadata.json` adds `cells_full[]` with the full call list inline. ## How to consume the archive ### Single run ```bash curl -O https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public/raw/branch/main/runs//run.jsonl ``` ### Whole archive ```bash git clone https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public.git ``` ### Re-build catalogue from raw The canonical builder lives in [WeeyugaWeb/scripts/benchmarks/build_catalogue.py](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb/src/branch/main/scripts/benchmarks/build_catalogue.py) and runs against `runs/*/run.jsonl`. ## How to contribute a benchmark See `CLAUDE.md` (or `AGENTS.md` — same content) for the full agent runbook, and `CONTRIBUTING.md` for the human-side flow. Short version: 1. Clone repo 2. Hand `CLAUDE.md` to your coding agent 3. Agent probes hardware, picks a model, runs benchmark, writes results 4. You review what the agent produced 5. You fork → push → open PR 6. Maintainers review and merge Read access is open. **Write access is via PR only — nothing auto-merges.** ## Citation ``` Margetić, S. & contributors. (2026). Weeyuga local-LLM benchmarks. https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public ``` ## License - **Data** (`runs/`, `submissions/`, `catalogue.json`, `methodology.md`, `harness/suites/`): [Creative Commons Attribution 4.0 International (CC-BY-4.0)](LICENSE). - **Helper code** (`harness/*.py`, future scripts): [MIT](LICENSE-MIT). You're free to share, re-host, re-analyse, and remix the data with attribution. ## Reporting issues If you spot a bench number that looks wrong, a methodology gap, or a privacy slip in published metadata: open an issue on this repo, or email the team at `slobodan@weeyuga.com`. We'd rather know. ## Status | What | State | |---|---| | Repo created | 2026-05-05 | | Canonical archive landed (21 runs) | 2026-05-05 | | Harness + agent runbook landed | 2026-05-06 | | Pre-launch security audit | scheduled | | Visibility flipped to public | pending audit sign-off | | First friend's submission merged | pending | Maintainers: `mac/benchmark-tester-ben` (catalogue + methodology), `mac/devops-bane` (harness + runner + this README change). For coordination, see the [WeeyugaWeb coordination bus](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb/src/branch/main/coordination).