Sloba 2026-05-06T16:55Z chat: "lets flip the switch i want the repo public". Pre-launch security audit signed off: - G2-P0-1 (/Users/slobodan/ paths) cleared end-to-end - G2-P0-2 (10.8.0.x WG mesh IPs) cleared end-to-end - G2-P0-3 (Slobodans-MacBook-Air mDNS) cleared end-to-end - G2-P0-4 (MyBoard / TruthGraph names) cleared end-to-end All four verified clean on the live https://benchmarks.weeyuga.com/ deploy at SHA 0ba4451 by Miljan's pen-test re-grep on /index.html, /catalog.html, /benchmarks/09d8fbde.html (0 hits per pattern per page). Full live security probe earlier the same day also cleared: path traversal blocked, methods 405-locked, autoindex OFF, hidden files (.git/.DS_Store/.env/.htaccess) all 404, _template.html 404 via vhost SPA-fallback fix, all 6 security headers + HSTS holding. Updated: - README banner: "PRIVATE STAGING — pre-launch audit pending" → "PUBLIC — anonymous-readable since 2026-05-06" - Status table: "Pre-launch security audit | scheduled" → "cleared 2026-05-06 (G2 P0-1/2/3/4 all verified clean on live benchmarks.weeyuga.com at SHA 0ba4451)" - Status table: "Visibility flipped to public | pending audit sign-off" → "2026-05-06 ✓" This commit lands the README copy update inside the repo. The Gitea-side visibility flip (Settings → Visibility → Make Public) is a UI click Sloba does himself; no Gitea API token available locally to drive it from this session.
172 lines
6.7 KiB
Markdown
172 lines
6.7 KiB
Markdown
# weeyuga-benchmarks-public
|
||
|
||
> **Status: PUBLIC** — anonymous-readable since 2026-05-06. Pre-launch
|
||
> security audit signed off (G2 P0-1 through P0-4 cleared on the live
|
||
> site at SHA `0ba4451`). Welcome — see `CLAUDE.md` / `AGENTS.md` if
|
||
> you want your coding agent to clone and run benchmarks on your hardware.
|
||
|
||
Open benchmarks for local LLMs — same prompts, same suites, run on
|
||
whatever hardware you've got, results compose into one ladder.
|
||
|
||
This repo is two things in one:
|
||
|
||
1. **Canonical archive** — every benchmark Sloba's team publishes on
|
||
[benchmarks.weeyuga.com](https://benchmarks.weeyuga.com) lives here as
|
||
raw JSONL + computed metadata + human summary, so anyone can clone,
|
||
re-analyse, or cite. This is the original purpose; see `runs/`.
|
||
2. **Crowdsourced runner** — a portable harness + agent runbook so a
|
||
friend's coding agent (Claude Code, Codex, Aider, …) can clone this
|
||
repo, read `CLAUDE.md` / `AGENTS.md`, run the same suite on the
|
||
friend's hardware, and submit the result back as a PR. This is the
|
||
newer purpose; see `harness/` + `submissions/`.
|
||
|
||
Both purposes share one schema, one prompt set, one methodology.
|
||
|
||
## Quick start — for friends contributing a benchmark
|
||
|
||
```bash
|
||
git clone https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public.git
|
||
cd weeyuga-benchmarks-public
|
||
# Hand this file path to your coding agent and say:
|
||
# "Read CLAUDE.md, then run the Weeyuga benchmark on my hardware."
|
||
```
|
||
|
||
Then read `CONTRIBUTING.md` for the human-side flow (what to expect,
|
||
how reviews work, license, code of conduct).
|
||
|
||
## Layout
|
||
|
||
```
|
||
.
|
||
├── README.md — this file
|
||
├── CLAUDE.md — full agent runbook (read this if you're an LLM-driven agent)
|
||
├── AGENTS.md — byte-identical to CLAUDE.md (Codex / other tools that prefer this name)
|
||
├── CONTRIBUTING.md — human-readable contribution guide
|
||
├── LICENSE — CC-BY-4.0 for data
|
||
├── LICENSE-MIT — MIT for harness/runner code
|
||
├── catalogue.json — index of every published benchmark (canonical archive)
|
||
├── methodology.md — how we benchmark + fairness rules + reproducibility notes
|
||
├── harness/ — portable runner + suites + prompts (the crowdsourced piece)
|
||
│ ├── README.md
|
||
│ ├── run_benchmark.py
|
||
│ ├── prompts.py
|
||
│ ├── requirements.txt
|
||
│ └── suites/ — six bundled JSON suites (5Q, 20Q, parallel × 2, edge × 2)
|
||
├── runs/ — canonical archive: 21 reference runs from Sloba's cluster
|
||
│ └── <run-id>/
|
||
│ ├── run.jsonl
|
||
│ ├── run.log — when captured
|
||
│ ├── run.md — when synthesis exists
|
||
│ └── metadata.json
|
||
└── submissions/ — community contributions land here
|
||
├── README.md
|
||
├── EXAMPLE/ — one fully-filled-out template you can read
|
||
│ └── mac-m1-8gb/
|
||
│ └── run-<uuid>/
|
||
│ ├── manifest.json
|
||
│ ├── hardware.json
|
||
│ ├── run.jsonl
|
||
│ ├── metadata.json
|
||
│ └── run.md
|
||
└── <handle>/ — your contributions
|
||
└── <device-tag>/
|
||
└── run-<uuid>/...
|
||
```
|
||
|
||
## Run-ID format
|
||
|
||
Every run gets a UUID v4 `<run-id>` assigned at harness startup. Run IDs
|
||
are stable across re-runs of synthesis — the same run-id always points
|
||
to the same raw `run.jsonl` event stream. Synthesis docs (`run.md`) and
|
||
computed metadata (`metadata.json`) can be regenerated from the
|
||
canonical jsonl at any time.
|
||
|
||
## Schema
|
||
|
||
The `catalogue.json` index follows `schema_version = "1.0-draft"` (or
|
||
later — check the value at the top of the file). Per-benchmark entries
|
||
include:
|
||
|
||
- `id` — run-id
|
||
- `title`, `headline`, `date`
|
||
- `hardware` (pavilion / predator / mac / vps50 / runpod / community-`<device-tag>`)
|
||
- `engine` (llamacpp / ollama / vllm / mlx / cpu)
|
||
- `harness` (which harness produced this — see `methodology.md`)
|
||
- `model_family`, `model_sizes`
|
||
- `cells[]` — per-(machine × engine × model) summary
|
||
- `synthesis_doc` — filename of the synthesis prose, if one exists
|
||
- `tags`, `status`, `visibility`, `site_grade`
|
||
|
||
Per-run `metadata.json` adds `cells_full[]` with the full call list inline.
|
||
|
||
## How to consume the archive
|
||
|
||
### Single run
|
||
|
||
```bash
|
||
curl -O https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public/raw/branch/main/runs/<run-id>/run.jsonl
|
||
```
|
||
|
||
### Whole archive
|
||
|
||
```bash
|
||
git clone https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public.git
|
||
```
|
||
|
||
### Re-build catalogue from raw
|
||
|
||
The canonical builder lives in
|
||
[WeeyugaWeb/scripts/benchmarks/build_catalogue.py](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb/src/branch/main/scripts/benchmarks/build_catalogue.py)
|
||
and runs against `runs/*/run.jsonl`.
|
||
|
||
## How to contribute a benchmark
|
||
|
||
See `CLAUDE.md` (or `AGENTS.md` — same content) for the full agent
|
||
runbook, and `CONTRIBUTING.md` for the human-side flow. Short version:
|
||
|
||
1. Clone repo
|
||
2. Hand `CLAUDE.md` to your coding agent
|
||
3. Agent probes hardware, picks a model, runs benchmark, writes results
|
||
4. You review what the agent produced
|
||
5. You fork → push → open PR
|
||
6. Maintainers review and merge
|
||
|
||
Read access is open. **Write access is via PR only — nothing auto-merges.**
|
||
|
||
## Citation
|
||
|
||
```
|
||
Margetić, S. & contributors. (2026). Weeyuga local-LLM benchmarks.
|
||
https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public
|
||
```
|
||
|
||
## License
|
||
|
||
- **Data** (`runs/`, `submissions/`, `catalogue.json`, `methodology.md`,
|
||
`harness/suites/`): [Creative Commons Attribution 4.0 International (CC-BY-4.0)](LICENSE).
|
||
- **Helper code** (`harness/*.py`, future scripts): [MIT](LICENSE-MIT).
|
||
|
||
You're free to share, re-host, re-analyse, and remix the data with attribution.
|
||
|
||
## Reporting issues
|
||
|
||
If you spot a bench number that looks wrong, a methodology gap, or a
|
||
privacy slip in published metadata: open an issue on this repo, or
|
||
email the team at `slobodan@weeyuga.com`. We'd rather know.
|
||
|
||
## Status
|
||
|
||
| What | State |
|
||
|---|---|
|
||
| Repo created | 2026-05-05 |
|
||
| Canonical archive landed (21 runs) | 2026-05-05 |
|
||
| Harness + agent runbook landed | 2026-05-06 |
|
||
| Pre-launch security audit | **cleared 2026-05-06** (G2 P0-1/2/3/4 all verified clean on live `benchmarks.weeyuga.com` at SHA `0ba4451`) |
|
||
| Visibility flipped to public | **2026-05-06** ✓ |
|
||
| First friend's submission merged | pending |
|
||
|
||
Maintainers: `mac/benchmark-tester-ben` (catalogue + methodology),
|
||
`mac/devops-bane` (harness + runner + this README change).
|
||
For coordination, see the
|
||
[WeeyugaWeb coordination bus](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb/src/branch/main/coordination).
|