slobodanmargetic988/weeyuga-benchmarks-public

Files

Slobodan Margetic 97a9245d9e feat: harness + agent runbook — flip repo from archive-only to

crowdsourced runner

Sloba's chat directive 2026-05-06: "this project is preparation for
going public ... ship the harness along so others can join in."

The repo's original purpose (Ben's catalogue + 21 reference run
ledgers, shipped 2026-05-05) stays intact. This commit ADDS a second
purpose: a portable harness + agent runbook so a friend's coding agent
can clone, read CLAUDE.md, run the same suite on the friend's hardware,
and submit results back as a PR.

What landed:

CLAUDE.md + AGENTS.md (byte-identical, ~520 lines)
  Full agent runbook: hardware probe, runtime + model selection,
  canonical knob reference (Sloba's Pavilion methodology values),
  hardware-adaptation decision rules, run-instructions, output-schema
  templates for hardware.json + metadata.json + run.md, PR submission
  flow (fork → branch → push → PR; nothing auto-merges), privacy
  guardrails, methodology lineage. Per Sloba's Q3 directive: the
  runbook explicitly tells the friend's agent to ADAPT to hardware
  reality and document deviations rather than blindly run defaults.

CONTRIBUTING.md (~110 lines)
  Human-readable companion for the friend (not the agent). What you
  need, how it works, what we ask, what maintainers commit to,
  license, code-of-conduct short version.

harness/
  ├── README.md        Technical readme for the harness folder
  ├── run_benchmark.py ~520 LOC runner. Stdlib-only. Adapted from
  │                    WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py
  │                    v3 with the cluster-internal IP defaults
  │                    (10.8.0.x) replaced by 127.0.0.1:11434, the
  │                    cluster /v1/cluster/* endpoints removed, the
  │                    canonical-suite paths under ~/Documents/MyServers
  │                    replaced by harness/suites/ paths, the git-sha
  │                    enforcement on WeeyugaWeb dropped, and the
  │                    output written under submissions/<handle>/<tag>/
  │                    instead of docs/BENCHMARKS/runs/. Supports all
  │                    six suite phases via --phases, plus 'all'.
  ├── prompts.py       Verbatim copy of the canonical 3 frozen prompts
  │                    (P-EASY/P-MEDIUM/P-HARD) from
  │                    WeeyugaWeb/scripts/benchmarks/prompts.py.
  ├── requirements.txt Empty by intent (stdlib-only); placeholder for
  │                    pip-tools / agent auto-install patterns.
  ├── .gitignore       __pycache__/ etc.
  └── suites/          Six bundled JSON suites copied verbatim from
       Sloba's MyServers/instances/vps-81-17-99-14/telemetry/:
       small_model_eval_questions.json, python_task_suite_questions.json,
       parallel_qwen_same_model_20q_suite.json,
       parallel_qwen_mixed_model_20q_suite.json,
       python_context_edge_append_questions.json,
       python_context_edge_suite_only.json.

submissions/
  README.md            Folder convention + naming + reviewability rules
  EXAMPLE/mac-m1-8gb/run-00000000-...-000000000000/
       Synthetic-but-shape-complete contribution template:
       manifest.json, hardware.json, run.jsonl (5 example lines),
       metadata.json, run.md (with privacy attestation, methodology
       deviations, reproducibility command). Marked as synthetic at
       the top so future analysis doesn't accidentally cite it.

LICENSE-MIT
  MIT for harness/*.py and future helper code. Existing LICENSE
  (CC-BY-4.0) covers data files.

README.md (modified)
  Updated to reflect dual purpose. Layout diagram updated.
  Maintainer credits: Ben for catalogue/methodology + Bane for harness.
  Contributor quick-start added. Status table extended.

Privacy posture:
  - All 6 suite JSON files privacy-scanned for cluster IPs / hostnames /
    paths / tokens. Two prompts contain project names ("MyBoard" auth
    debugging in 20Q-Q14, generic SSH troubleshooting in 5Q-Q03);
    flagged in chat for Sloba's review. Otherwise clean.
  - run_benchmark.py default target_url is 127.0.0.1:11434 (no internal
    IPs leaked).
  - manifest.json captures host_hostname_short via socket.gethostname()
    .split('.')[0] — agent should review before PR if hostname is
    sensitive.
  - CLAUDE.md §8 spells out the privacy-grep before push.

Verification:
  - py_compile run_benchmark.py: OK
  - --help renders cleanly
  - All 6 suite JSON files: valid
  - All 4 example JSON files: valid
  - Example run.jsonl (5 lines): valid

This commit lands on branch feature/runner-and-agent-instructions.
NOT pushed to main; staying on the feature branch until Sloba reviews
on Gitea and merges. Bus dispatch to Ben + Sam announcing the
architectural pivot lives in the WeeyugaWeb coordination repo.

2026-05-06 19:05:22 +02:00

EXAMPLE/mac-m1-8gb/run-00000000-0000-0000-0000-000000000000

feat: harness + agent runbook — flip repo from archive-only to

2026-05-06 19:05:22 +02:00

README.md

feat: harness + agent runbook — flip repo from archive-only to

2026-05-06 19:05:22 +02:00

README.md

`submissions/`

Friends' benchmark contributions land here, one directory per submitter, one subdirectory per device, one sub-subdirectory per run.

Layout

submissions/
├── README.md                            — this file
├── EXAMPLE/                             — template; see below
│   └── mac-m1-8gb/
│       └── run-00000000-...-000000000000/
│           ├── manifest.json
│           ├── hardware.json
│           ├── run.jsonl
│           ├── metadata.json
│           └── run.md
├── alice/                               — first real friend's contributions
│   └── mac-m1-8gb/
│       └── run-<uuid>/...
└── bob/                                 — etc.
    └── rtx-4090-pc/
        └── run-<uuid>/...

Per-submission contents

Five files inside each run-<uuid>/:

manifest.json — automatic; run_benchmark.py writes it at run start. Contains submitter handle, device tag, target URL, model list, phase plan, canonical-options overrides, host hostname (short), platform, started-at timestamp.
hardware.json — agent fills from a hardware probe (see CLAUDE.md §2). Schema version hardware-1.0.
run.jsonl — automatic; the canonical event ledger. Line 1 is type=meta; subsequent lines are type=call or type=skipped; final line is type=footer.
metadata.json — agent fills with computed aggregates per (cell_id, phase) cell. Schema version metadata-1.0. The catalogue builder will recompute on Sloba's side; having it in the PR makes review fast.
run.md — agent fills using the CLAUDE.md §6b template. Honest narrative — methodology deviations, caveats, headline numbers.

Why per-submitter folders?

Attribution — your handle lives next to your data
Reviewability — a PR adds files only under submissions/<your-handle>/...; reviewer can see the whole contribution at a glance
No collisions — two friends submitting from "macbook-pro" don't overwrite each other
History stays clean — re-runs go into new run-<uuid>/ subdirs, not on top of the old one

Naming conventions

<submitter-handle> — your Gitea username, or any other handle you'd like to be credited as. Lowercase; ASCII letters / digits / hyphens only.
<device-tag> — short descriptor of the hardware. Pattern: <chip-or-platform>-<key-spec>. Examples:
- mac-m1-8gb, mac-m2-pro-16gb, mac-m3-max-64gb
- rtx-4090-pc, rtx-3060-laptop, gtx-1060-6gb
- ryzen-7950x-cpu, intel-i9-13900k-cpu
- pixel-8-pro, samsung-s24-ultra (yes, phones — if you've got termux working)
- runpod-h100-pcie, runpod-rtx-a6000
run-<uuid> — run- prefix + a UUID v4 from run_benchmark.py. Don't shorten.

What the EXAMPLE folder is for

A complete-but-tiny submission you can read end-to-end to understand the shapes. Don't modify the EXAMPLE folder in a benchmark-submission PR; if you spot a bug in the example, that's a separate PR with the title fix: submissions/EXAMPLE/....

When a submission is merged

Sloba reviews and merges manually. After merge:

The catalogue builder on Sloba's side picks up your run, computes a cell_id from your device-tag + model, and assigns it a site_grade (flagship / standard / archive-only based on the criteria in methodology.md).
Janie (the benchmarks blogger) may write a janie_blurb_md for it.
It appears on benchmarks.weeyuga.com (when the site is live).
Your device-tag becomes a permanent comparison axis on the catalogue.

What if I want to delete a submission later?

Open an issue, we'll honor the request promptly. We'll keep the run directory but mark it visibility: redacted in the catalogue overlay so the data still validates historical analysis claims while disappearing from the browse surface.