Files
weeyuga-benchmarks-public/README.md
slobodanmargetic988 a18db6a3da B3 staging seed — 21 runs + catalogue v1.0-draft + methodology + README
Initial population of the weeyuga-benchmarks-public archive (PRIVATE
staging visibility — flips public after Miljan + Stevan security audit
sign-off per Sloba's 17:34Z dispatch).

Contents:
- README.md       — public-facing intro (warns staging state, schema overview, citation pattern, license split)
- LICENSE         — CC-BY-4.0 default (auto-init from Gitea)
- catalogue.json  — schema_version=1.0-draft (locked once Tomas ratifies); 21 benchmarks indexed, 13 complete + 8 meta-only
- methodology.md  — mirror of WeeyugaWeb docs/BENCHMARKS/HARNESS.md (canonical methodology)
- runs/<id>/run.jsonl|run.log|run.md|metadata.json — packaged copies of every run in WeeyugaWeb docs/BENCHMARKS/runs/*

Run set covers:
- Mission 1 (2026-04-28/29): pavilion-weeyuga-v1 + reconstructed v3 (96 calls, 16 models routed via weeyuga :11435)
- Predator trio (2026-05-04): granite-4.1-8B + gemma-4-E4B-it + qwen3.5-9B
- Predator qwen rerun (2026-05-04): qwen3.5-9B think500/nothink + qwen3-14B feasibility
- A3B campaign (2026-05-04/05): pavilion-a3b + predator-a3b NGL matrix + ctx sweep + NGL+ctx 2D + NGL=6 deep dive
- VPS50 CPU matrix + gemma-e4b CPU lane (2026-05-04/05)

Visibility GATE: this repo stays private until Miljan G1-G4 audit and
Stevan G3 credential audit both green. After sign-off, single API call
flips visibility=public, anonymous read on, push-protection requires
auth, issues moderate by default.

No raw IPs, no SSH user@host strings, no /Users/ paths, no whisper
transcripts in any of these files. Hardware names (pavilion, predator,
vps50) are intentional and fine to share.

Builder: WeeyugaWeb/scripts/benchmarks/build_catalogue.py (deterministic,
idempotent, ~5s wall on 21 runs).
Publish flow: WeeyugaWeb/scripts/benchmarks/publish_bench_run.py
(builds packaged dirs, regenerates catalogue, optional --push to mirror
into this repo, optional --deploy stub for cicd rsync).

Owner: mac/benchmark-tester-ben (Ben).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 19:46:01 +02:00

104 lines
4.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# weeyuga-benchmarks-public
> **Status: PRIVATE STAGING** — this repo is not yet public. Flips to anonymous-read after [Miljan + Stevan's pre-launch security audit](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb/src/branch/main/coordination/messages) signs off. If you're reading this and you're not on the Weeyuga team, you got here too early.
Canonical raw-data archive for **[benchmarks.weeyuga.com](https://benchmarks.weeyuga.com)** — every benchmark run we publish on the site is mirrored here as raw JSONL + log + human summary so anyone can clone, re-analyse, or cite.
## Layout
```
.
├── README.md — this file
├── LICENSE — CC-BY-4.0 (data) + MIT (helper code)
├── catalogue.json — index of every published benchmark (mirror of the site catalogue)
├── methodology.md — how we benchmark + fairness rules + reproducibility notes
└── runs/
└── <run-id>/
├── run.jsonl — canonical raw event stream (one JSON object per line)
├── run.log — tee'd stdout/stderr from the harness (when captured)
├── run.md — human-readable summary (when synthesis exists)
└── metadata.json — computed snapshot: meta record + per-cell aggregates + status
```
## Run-ID format
Every run gets a UUID v4 `<run-id>` assigned at harness startup. Run IDs are stable across re-runs of synthesis — the same run-id always points to the same raw `run.jsonl` event stream. Synthesis docs (`run.md`) and computed metadata (`metadata.json`) can be regenerated from the canonical jsonl at any time.
## Schema
The `catalogue.json` index follows `schema_version = "1.0-draft"` (or later — check the value at the top of the file). Per-benchmark entries include:
- `id` — run-id
- `title`, `headline`, `date`
- `hardware` (pavilion / predator / mac / vps50 / runpod)
- `engine` (llamacpp / ollama / vllm / mlx / cpu)
- `harness` (which harness produced this — see `methodology.md` for the matrix)
- `model_family`, `model_sizes`
- `cells[]` — per-(machine × engine × model) summary: n_calls, n_errors, duration_ms (mean + p50), tokens_per_sec (mean + max)
- `synthesis_doc` — filename of the synthesis prose for this run, if one exists
- `tags`, `status`, `visibility`
Per-run `metadata.json` adds `cells_full[]` with the full call list inline.
## How to consume
### Just download a single run
```bash
curl -O https://benchmarks.weeyuga.com/data/runs/<run-id>/run.jsonl
```
### Clone the whole archive
```bash
git clone https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public.git
```
### Re-build catalogue from raw
The canonical builder lives in [WeeyugaWeb/scripts/benchmarks/build_catalogue.py](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb/src/branch/main/scripts/benchmarks/build_catalogue.py) and runs against `runs/*.jsonl`. If you want to regenerate the catalogue from your own clone of this repo:
```bash
git clone https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb.git
cd WeeyugaWeb
python3 scripts/benchmarks/build_catalogue.py
```
## Citation
If you use this data, please cite as:
```
Margetić, S. & contributors. (2026). Weeyuga cluster benchmarks (raw data archive).
https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public
```
(A more formal citation form will land here once Mila weighs in on academic-attribution conventions.)
## License
- **Data** (`runs/`, `catalogue.json`, `methodology.md`): [Creative Commons Attribution 4.0 International (CC-BY-4.0)](LICENSE)
- **Helper code** (any future scripts inside this repo): [MIT](LICENSE-MIT) (separate file added if/when code lands here)
You are free to share, re-host, re-analyse, and remix the data with attribution.
## What's in here vs what's NOT
This repo contains **bench-run output only**. No source code. No infrastructure config. No application internals. Reproducing a run requires the [WeeyugaWeb](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb) main repo (also Gitea-hosted; visibility separate).
## Reporting an issue with the data
If you spot a bench number that looks wrong, a methodology gap, or a privacy slip in published metadata: open an issue on this repo, or email the Weeyuga team. We'd rather know.
## Status
| What | State |
|---|---|
| Repo created | 2026-05-05 |
| First 21 runs landed | 2026-05-05 |
| Miljan + Stevan security audit | scheduled |
| Visibility flipped to public | pending audit sign-off |
| Site `benchmarks.weeyuga.com` live | pending Bane DNS + nginx + Tomas site |
Owner: `mac/benchmark-tester-ben` (Ben). For coordination, see the [WeeyugaWeb coordination bus](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb/src/branch/main/coordination).