weeyuga-benchmarks-public/README.md

# weeyuga-benchmarks-public

> **Status: PRIVATE STAGING** — this repo is not yet anonymous-readable.
> Flips to public after the pre-launch security audit signs off.
> If you got here too early, please hold; you'll be invited soon.

Open benchmarks for local LLMs — same prompts, same suites, run on
whatever hardware you've got, results compose into one ladder.

This repo is two things in one:

1. **Canonical archive** — every benchmark Sloba's team publishes on
   [benchmarks.weeyuga.com](https://benchmarks.weeyuga.com) lives here as
   raw JSONL + computed metadata + human summary, so anyone can clone,
   re-analyse, or cite. This is the original purpose; see `runs/`.
2. **Crowdsourced runner** — a portable harness + agent runbook so a
   friend's coding agent (Claude Code, Codex, Aider, …) can clone this
   repo, read `CLAUDE.md` / `AGENTS.md`, run the same suite on the
   friend's hardware, and submit the result back as a PR. This is the
   newer purpose; see `harness/` + `submissions/`.

Both purposes share one schema, one prompt set, one methodology.

## Quick start — for friends contributing a benchmark

```bash
git clone https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public.git
cd weeyuga-benchmarks-public
# Hand this file path to your coding agent and say:
#   "Read CLAUDE.md, then run the Weeyuga benchmark on my hardware."
```

Then read `CONTRIBUTING.md` for the human-side flow (what to expect,
how reviews work, license, code of conduct).

## Layout

```
.
├── README.md                — this file
├── CLAUDE.md                — full agent runbook (read this if you're an LLM-driven agent)
├── AGENTS.md                — byte-identical to CLAUDE.md (Codex / other tools that prefer this name)
├── CONTRIBUTING.md          — human-readable contribution guide
├── LICENSE                  — CC-BY-4.0 for data
├── LICENSE-MIT              — MIT for harness/runner code
├── catalogue.json           — index of every published benchmark (canonical archive)
├── methodology.md           — how we benchmark + fairness rules + reproducibility notes
├── harness/                 — portable runner + suites + prompts (the crowdsourced piece)
│   ├── README.md
│   ├── run_benchmark.py
│   ├── prompts.py
│   ├── requirements.txt
│   └── suites/              — six bundled JSON suites (5Q, 20Q, parallel × 2, edge × 2)
├── runs/                    — canonical archive: 21 reference runs from Sloba's cluster
│   └── <run-id>/
│       ├── run.jsonl
│       ├── run.log          — when captured
│       ├── run.md           — when synthesis exists
│       └── metadata.json
└── submissions/             — community contributions land here
    ├── README.md
    ├── EXAMPLE/             — one fully-filled-out template you can read
    │   └── mac-m1-8gb/
    │       └── run-<uuid>/
    │           ├── manifest.json
    │           ├── hardware.json
    │           ├── run.jsonl
    │           ├── metadata.json
    │           └── run.md
    └── <handle>/            — your contributions
        └── <device-tag>/
            └── run-<uuid>/...
```

## Run-ID format

Every run gets a UUID v4 `<run-id>` assigned at harness startup. Run IDs
are stable across re-runs of synthesis — the same run-id always points
to the same raw `run.jsonl` event stream. Synthesis docs (`run.md`) and
computed metadata (`metadata.json`) can be regenerated from the
canonical jsonl at any time.

## Schema

The `catalogue.json` index follows `schema_version = "1.0-draft"` (or
later — check the value at the top of the file). Per-benchmark entries
include:

- `id` — run-id
- `title`, `headline`, `date`
- `hardware` (pavilion / predator / mac / vps50 / runpod / community-`<device-tag>`)
- `engine` (llamacpp / ollama / vllm / mlx / cpu)
- `harness` (which harness produced this — see `methodology.md`)
- `model_family`, `model_sizes`
- `cells[]` — per-(machine × engine × model) summary
- `synthesis_doc` — filename of the synthesis prose, if one exists
- `tags`, `status`, `visibility`, `site_grade`

Per-run `metadata.json` adds `cells_full[]` with the full call list inline.

## How to consume the archive

### Single run

```bash
curl -O https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public/raw/branch/main/runs/<run-id>/run.jsonl
```

### Whole archive

```bash
git clone https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public.git
```

### Re-build catalogue from raw

The canonical builder lives in
[WeeyugaWeb/scripts/benchmarks/build_catalogue.py](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb/src/branch/main/scripts/benchmarks/build_catalogue.py)
and runs against `runs/*/run.jsonl`.

## How to contribute a benchmark

See `CLAUDE.md` (or `AGENTS.md` — same content) for the full agent
runbook, and `CONTRIBUTING.md` for the human-side flow. Short version:

1. Clone repo
2. Hand `CLAUDE.md` to your coding agent
3. Agent probes hardware, picks a model, runs benchmark, writes results
4. You review what the agent produced
5. You fork → push → open PR
6. Maintainers review and merge

Read access is open. **Write access is via PR only — nothing auto-merges.**

## Citation

```
Margetić, S. & contributors. (2026). Weeyuga local-LLM benchmarks.
https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public
```

## License

- **Data** (`runs/`, `submissions/`, `catalogue.json`, `methodology.md`,
  `harness/suites/`): [Creative Commons Attribution 4.0 International (CC-BY-4.0)](LICENSE).
- **Helper code** (`harness/*.py`, future scripts): [MIT](LICENSE-MIT).

You're free to share, re-host, re-analyse, and remix the data with attribution.

## Reporting issues

If you spot a bench number that looks wrong, a methodology gap, or a
privacy slip in published metadata: open an issue on this repo, or
email the team at `slobodan@weeyuga.com`. We'd rather know.

## Status

| What | State |
|---|---|
| Repo created | 2026-05-05 |
| Canonical archive landed (21 runs) | 2026-05-05 |
| Harness + agent runbook landed | 2026-05-06 |
| Pre-launch security audit | scheduled |
| Visibility flipped to public | pending audit sign-off |
| First friend's submission merged | pending |

Maintainers: `mac/benchmark-tester-ben` (catalogue + methodology),
`mac/devops-bane` (harness + runner + this README change).
For coordination, see the
[WeeyugaWeb coordination bus](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb/src/branch/main/coordination).