slobodanmargetic988/weeyuga-benchmarks-public

Files

Slobodan Margetic 97a9245d9e feat: harness + agent runbook — flip repo from archive-only to

crowdsourced runner

Sloba's chat directive 2026-05-06: "this project is preparation for
going public ... ship the harness along so others can join in."

The repo's original purpose (Ben's catalogue + 21 reference run
ledgers, shipped 2026-05-05) stays intact. This commit ADDS a second
purpose: a portable harness + agent runbook so a friend's coding agent
can clone, read CLAUDE.md, run the same suite on the friend's hardware,
and submit results back as a PR.

What landed:

CLAUDE.md + AGENTS.md (byte-identical, ~520 lines)
  Full agent runbook: hardware probe, runtime + model selection,
  canonical knob reference (Sloba's Pavilion methodology values),
  hardware-adaptation decision rules, run-instructions, output-schema
  templates for hardware.json + metadata.json + run.md, PR submission
  flow (fork → branch → push → PR; nothing auto-merges), privacy
  guardrails, methodology lineage. Per Sloba's Q3 directive: the
  runbook explicitly tells the friend's agent to ADAPT to hardware
  reality and document deviations rather than blindly run defaults.

CONTRIBUTING.md (~110 lines)
  Human-readable companion for the friend (not the agent). What you
  need, how it works, what we ask, what maintainers commit to,
  license, code-of-conduct short version.

harness/
  ├── README.md        Technical readme for the harness folder
  ├── run_benchmark.py ~520 LOC runner. Stdlib-only. Adapted from
  │                    WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py
  │                    v3 with the cluster-internal IP defaults
  │                    (10.8.0.x) replaced by 127.0.0.1:11434, the
  │                    cluster /v1/cluster/* endpoints removed, the
  │                    canonical-suite paths under ~/Documents/MyServers
  │                    replaced by harness/suites/ paths, the git-sha
  │                    enforcement on WeeyugaWeb dropped, and the
  │                    output written under submissions/<handle>/<tag>/
  │                    instead of docs/BENCHMARKS/runs/. Supports all
  │                    six suite phases via --phases, plus 'all'.
  ├── prompts.py       Verbatim copy of the canonical 3 frozen prompts
  │                    (P-EASY/P-MEDIUM/P-HARD) from
  │                    WeeyugaWeb/scripts/benchmarks/prompts.py.
  ├── requirements.txt Empty by intent (stdlib-only); placeholder for
  │                    pip-tools / agent auto-install patterns.
  ├── .gitignore       __pycache__/ etc.
  └── suites/          Six bundled JSON suites copied verbatim from
       Sloba's MyServers/instances/vps-81-17-99-14/telemetry/:
       small_model_eval_questions.json, python_task_suite_questions.json,
       parallel_qwen_same_model_20q_suite.json,
       parallel_qwen_mixed_model_20q_suite.json,
       python_context_edge_append_questions.json,
       python_context_edge_suite_only.json.

submissions/
  README.md            Folder convention + naming + reviewability rules
  EXAMPLE/mac-m1-8gb/run-00000000-...-000000000000/
       Synthetic-but-shape-complete contribution template:
       manifest.json, hardware.json, run.jsonl (5 example lines),
       metadata.json, run.md (with privacy attestation, methodology
       deviations, reproducibility command). Marked as synthetic at
       the top so future analysis doesn't accidentally cite it.

LICENSE-MIT
  MIT for harness/*.py and future helper code. Existing LICENSE
  (CC-BY-4.0) covers data files.

README.md (modified)
  Updated to reflect dual purpose. Layout diagram updated.
  Maintainer credits: Ben for catalogue/methodology + Bane for harness.
  Contributor quick-start added. Status table extended.

Privacy posture:
  - All 6 suite JSON files privacy-scanned for cluster IPs / hostnames /
    paths / tokens. Two prompts contain project names ("MyBoard" auth
    debugging in 20Q-Q14, generic SSH troubleshooting in 5Q-Q03);
    flagged in chat for Sloba's review. Otherwise clean.
  - run_benchmark.py default target_url is 127.0.0.1:11434 (no internal
    IPs leaked).
  - manifest.json captures host_hostname_short via socket.gethostname()
    .split('.')[0] — agent should review before PR if hostname is
    sensitive.
  - CLAUDE.md §8 spells out the privacy-grep before push.

Verification:
  - py_compile run_benchmark.py: OK
  - --help renders cleanly
  - All 6 suite JSON files: valid
  - All 4 example JSON files: valid
  - Example run.jsonl (5 lines): valid

This commit lands on branch feature/runner-and-agent-instructions.
NOT pushed to main; staying on the feature branch until Sloba reviews
on Gitea and merges. Bus dispatch to Ben + Sam announcing the
architectural pivot lives in the WeeyugaWeb coordination repo.

2026-05-06 19:05:22 +02:00

4.3 KiB

Raw Permalink Blame History

Contributing benchmarks from your hardware

Welcome — and thank you for adding a data point. This file is the human-readable companion to CLAUDE.md / AGENTS.md. The agent files have the full mechanical detail; this one has the humans-in-the-loop story.

What you're contributing

You're running the same Weeyuga benchmark suite that powers benchmarks.weeyuga.com, on your hardware, and submitting the raw output as a PR. Your numbers join Sloba's cluster numbers as a comparison point. More devices = more honest ladder.

What you need

A device with ≥3 GB free disk space (model files are 0.5–10 GB depending on size)
An OpenAI-compatible LLM runtime — Ollama (easiest), llama.cpp, vLLM, or MLX
A coding agent — Claude Code, Codex, Aider, Cursor, etc. — to read the runbook and adapt the harness to your hardware
A Gitea account on git.weeyuga.com (free; you'll need it to fork + PR)
Maybe 1–4 hours wall-clock, mostly idle while the benchmark runs

How it works

You clone this repo:

git clone https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public.git
cd weeyuga-benchmarks-public

You point your agent at it with one sentence:

"Read CLAUDE.md (or AGENTS.md), then run the Weeyuga benchmark on my hardware. Pick a model that fits, document everything, and prepare a PR. I'll review before pushing."
The agent does the work — probes your hardware, picks a model size, adapts the runner's parameters, runs the suite, generates a JSONL ledger plus a human-readable summary.
You review what the agent produced — especially run.md for honesty and run.jsonl for any accidentally-leaked secrets.
You fork → push → open PR (the agent can do this too; you click the merge button on Gitea after).
Sloba reviews the PR and merges. He may have a question or two; the agent can address them on the same branch.

Why crowdsource benchmarks?

Hardware variance is huge. Sloba's published numbers come from a Mac M1, a laptop with a GTX 1060, and three VPSes — that's a thin slice of the world. Your RTX 4090, your Snapdragon X Elite, your Ryzen 7950X3D, your old Xeon all have stories worth recording.

Same prompts. Same suites. Different hardware. The numbers compose.

What we ask

Run the canonical suite as-is (5q + 20q minimum; the rest if you have time)
Document deviations honestly. If you had to skip parallel suites because of RAM, say so. If you tweaked NGL because the default OOM'd, say so. The point is comparable runs, not perfect runs.
Privacy-scan before pushing. run.jsonl stores response previews — if the model echoed your home directory or an API key from your shell history, redact before PR.
One PR per device per session. Don't bundle "my laptop AND my desktop AND my friend's PC" — separate PRs are easier to review.

What the maintainers (Sloba + team) commit to

We respond to PRs within ~3 days
We don't merge without reading; if your run.md has clear caveats we'll usually merge
We credit you by handle in catalogue.json if/when your run becomes a flagship
We never expose anything from your run.md or manifest.json beyond what you submitted; if you used a pseudonym, that's the name that ships
If we ask for a re-run with different parameters, that's a separate dispatch — we don't silently reinterpret your run

License of your contribution

By PR-ing data into this repo, you license it under CC-BY-4.0 (data) and the harness/runner code under MIT. Attribution stays with you (your handle becomes part of the run record).

What this repo is NOT

A leaderboard with prizes
A way to "win" against other devices (the point is honest measurement, not bragging rights)
A vehicle for marketing claims (vendor PR runs need a separate flow we haven't designed yet — please don't astroturf the catalogue)

Found a bug or methodology gap?

Open an issue. We'd rather hear about a flawed prompt or a misleading metric than ship more data using it.

Code of conduct, the short version

Be kind, be honest about your data, don't try to game the catalogue, and don't dox other contributors. Sloba reserves the right to remove submissions that violate spirit-of-the-thing — but we'll say why.

— The Weeyuga team

4.3 KiB Raw Permalink Blame History Unescape Escape