Files
weeyuga-benchmarks-public/README.md
slobodanmargetic988 98c3d4fa5b chore: flip repo to public — drop PRIVATE STAGING banner
Sloba 2026-05-06T16:55Z chat: "lets flip the switch i want the repo public".

Pre-launch security audit signed off:
  - G2-P0-1 (/Users/slobodan/ paths)        cleared end-to-end
  - G2-P0-2 (10.8.0.x WG mesh IPs)          cleared end-to-end
  - G2-P0-3 (Slobodans-MacBook-Air mDNS)    cleared end-to-end
  - G2-P0-4 (MyBoard / TruthGraph names)    cleared end-to-end

All four verified clean on the live https://benchmarks.weeyuga.com/
deploy at SHA 0ba4451 by Miljan's pen-test re-grep on
/index.html, /catalog.html, /benchmarks/09d8fbde.html (0 hits per
pattern per page). Full live security probe earlier the same day
also cleared: path traversal blocked, methods 405-locked, autoindex
OFF, hidden files (.git/.DS_Store/.env/.htaccess) all 404,
_template.html 404 via vhost SPA-fallback fix, all 6 security
headers + HSTS holding.

Updated:
  - README banner: "PRIVATE STAGING — pre-launch audit pending" →
    "PUBLIC — anonymous-readable since 2026-05-06"
  - Status table: "Pre-launch security audit | scheduled" →
    "cleared 2026-05-06 (G2 P0-1/2/3/4 all verified clean on live
    benchmarks.weeyuga.com at SHA 0ba4451)"
  - Status table: "Visibility flipped to public | pending audit
    sign-off" → "2026-05-06 ✓"

This commit lands the README copy update inside the repo. The
Gitea-side visibility flip (Settings → Visibility → Make Public)
is a UI click Sloba does himself; no Gitea API token available
locally to drive it from this session.
2026-05-06 19:06:16 +02:00

172 lines
6.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# weeyuga-benchmarks-public
> **Status: PUBLIC** — anonymous-readable since 2026-05-06. Pre-launch
> security audit signed off (G2 P0-1 through P0-4 cleared on the live
> site at SHA `0ba4451`). Welcome — see `CLAUDE.md` / `AGENTS.md` if
> you want your coding agent to clone and run benchmarks on your hardware.
Open benchmarks for local LLMs — same prompts, same suites, run on
whatever hardware you've got, results compose into one ladder.
This repo is two things in one:
1. **Canonical archive** — every benchmark Sloba's team publishes on
[benchmarks.weeyuga.com](https://benchmarks.weeyuga.com) lives here as
raw JSONL + computed metadata + human summary, so anyone can clone,
re-analyse, or cite. This is the original purpose; see `runs/`.
2. **Crowdsourced runner** — a portable harness + agent runbook so a
friend's coding agent (Claude Code, Codex, Aider, …) can clone this
repo, read `CLAUDE.md` / `AGENTS.md`, run the same suite on the
friend's hardware, and submit the result back as a PR. This is the
newer purpose; see `harness/` + `submissions/`.
Both purposes share one schema, one prompt set, one methodology.
## Quick start — for friends contributing a benchmark
```bash
git clone https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public.git
cd weeyuga-benchmarks-public
# Hand this file path to your coding agent and say:
# "Read CLAUDE.md, then run the Weeyuga benchmark on my hardware."
```
Then read `CONTRIBUTING.md` for the human-side flow (what to expect,
how reviews work, license, code of conduct).
## Layout
```
.
├── README.md — this file
├── CLAUDE.md — full agent runbook (read this if you're an LLM-driven agent)
├── AGENTS.md — byte-identical to CLAUDE.md (Codex / other tools that prefer this name)
├── CONTRIBUTING.md — human-readable contribution guide
├── LICENSE — CC-BY-4.0 for data
├── LICENSE-MIT — MIT for harness/runner code
├── catalogue.json — index of every published benchmark (canonical archive)
├── methodology.md — how we benchmark + fairness rules + reproducibility notes
├── harness/ — portable runner + suites + prompts (the crowdsourced piece)
│ ├── README.md
│ ├── run_benchmark.py
│ ├── prompts.py
│ ├── requirements.txt
│ └── suites/ — six bundled JSON suites (5Q, 20Q, parallel × 2, edge × 2)
├── runs/ — canonical archive: 21 reference runs from Sloba's cluster
│ └── <run-id>/
│ ├── run.jsonl
│ ├── run.log — when captured
│ ├── run.md — when synthesis exists
│ └── metadata.json
└── submissions/ — community contributions land here
├── README.md
├── EXAMPLE/ — one fully-filled-out template you can read
│ └── mac-m1-8gb/
│ └── run-<uuid>/
│ ├── manifest.json
│ ├── hardware.json
│ ├── run.jsonl
│ ├── metadata.json
│ └── run.md
└── <handle>/ — your contributions
└── <device-tag>/
└── run-<uuid>/...
```
## Run-ID format
Every run gets a UUID v4 `<run-id>` assigned at harness startup. Run IDs
are stable across re-runs of synthesis — the same run-id always points
to the same raw `run.jsonl` event stream. Synthesis docs (`run.md`) and
computed metadata (`metadata.json`) can be regenerated from the
canonical jsonl at any time.
## Schema
The `catalogue.json` index follows `schema_version = "1.0-draft"` (or
later — check the value at the top of the file). Per-benchmark entries
include:
- `id` — run-id
- `title`, `headline`, `date`
- `hardware` (pavilion / predator / mac / vps50 / runpod / community-`<device-tag>`)
- `engine` (llamacpp / ollama / vllm / mlx / cpu)
- `harness` (which harness produced this — see `methodology.md`)
- `model_family`, `model_sizes`
- `cells[]` — per-(machine × engine × model) summary
- `synthesis_doc` — filename of the synthesis prose, if one exists
- `tags`, `status`, `visibility`, `site_grade`
Per-run `metadata.json` adds `cells_full[]` with the full call list inline.
## How to consume the archive
### Single run
```bash
curl -O https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public/raw/branch/main/runs/<run-id>/run.jsonl
```
### Whole archive
```bash
git clone https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public.git
```
### Re-build catalogue from raw
The canonical builder lives in
[WeeyugaWeb/scripts/benchmarks/build_catalogue.py](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb/src/branch/main/scripts/benchmarks/build_catalogue.py)
and runs against `runs/*/run.jsonl`.
## How to contribute a benchmark
See `CLAUDE.md` (or `AGENTS.md` — same content) for the full agent
runbook, and `CONTRIBUTING.md` for the human-side flow. Short version:
1. Clone repo
2. Hand `CLAUDE.md` to your coding agent
3. Agent probes hardware, picks a model, runs benchmark, writes results
4. You review what the agent produced
5. You fork → push → open PR
6. Maintainers review and merge
Read access is open. **Write access is via PR only — nothing auto-merges.**
## Citation
```
Margetić, S. & contributors. (2026). Weeyuga local-LLM benchmarks.
https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public
```
## License
- **Data** (`runs/`, `submissions/`, `catalogue.json`, `methodology.md`,
`harness/suites/`): [Creative Commons Attribution 4.0 International (CC-BY-4.0)](LICENSE).
- **Helper code** (`harness/*.py`, future scripts): [MIT](LICENSE-MIT).
You're free to share, re-host, re-analyse, and remix the data with attribution.
## Reporting issues
If you spot a bench number that looks wrong, a methodology gap, or a
privacy slip in published metadata: open an issue on this repo, or
email the team at `slobodan@weeyuga.com`. We'd rather know.
## Status
| What | State |
|---|---|
| Repo created | 2026-05-05 |
| Canonical archive landed (21 runs) | 2026-05-05 |
| Harness + agent runbook landed | 2026-05-06 |
| Pre-launch security audit | **cleared 2026-05-06** (G2 P0-1/2/3/4 all verified clean on live `benchmarks.weeyuga.com` at SHA `0ba4451`) |
| Visibility flipped to public | **2026-05-06** ✓ |
| First friend's submission merged | pending |
Maintainers: `mac/benchmark-tester-ben` (catalogue + methodology),
`mac/devops-bane` (harness + runner + this README change).
For coordination, see the
[WeeyugaWeb coordination bus](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb/src/branch/main/coordination).