crowdsourced runner
Sloba's chat directive 2026-05-06: "this project is preparation for
going public ... ship the harness along so others can join in."
The repo's original purpose (Ben's catalogue + 21 reference run
ledgers, shipped 2026-05-05) stays intact. This commit ADDS a second
purpose: a portable harness + agent runbook so a friend's coding agent
can clone, read CLAUDE.md, run the same suite on the friend's hardware,
and submit results back as a PR.
What landed:
CLAUDE.md + AGENTS.md (byte-identical, ~520 lines)
Full agent runbook: hardware probe, runtime + model selection,
canonical knob reference (Sloba's Pavilion methodology values),
hardware-adaptation decision rules, run-instructions, output-schema
templates for hardware.json + metadata.json + run.md, PR submission
flow (fork → branch → push → PR; nothing auto-merges), privacy
guardrails, methodology lineage. Per Sloba's Q3 directive: the
runbook explicitly tells the friend's agent to ADAPT to hardware
reality and document deviations rather than blindly run defaults.
CONTRIBUTING.md (~110 lines)
Human-readable companion for the friend (not the agent). What you
need, how it works, what we ask, what maintainers commit to,
license, code-of-conduct short version.
harness/
├── README.md Technical readme for the harness folder
├── run_benchmark.py ~520 LOC runner. Stdlib-only. Adapted from
│ WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py
│ v3 with the cluster-internal IP defaults
│ (10.8.0.x) replaced by 127.0.0.1:11434, the
│ cluster /v1/cluster/* endpoints removed, the
│ canonical-suite paths under ~/Documents/MyServers
│ replaced by harness/suites/ paths, the git-sha
│ enforcement on WeeyugaWeb dropped, and the
│ output written under submissions/<handle>/<tag>/
│ instead of docs/BENCHMARKS/runs/. Supports all
│ six suite phases via --phases, plus 'all'.
├── prompts.py Verbatim copy of the canonical 3 frozen prompts
│ (P-EASY/P-MEDIUM/P-HARD) from
│ WeeyugaWeb/scripts/benchmarks/prompts.py.
├── requirements.txt Empty by intent (stdlib-only); placeholder for
│ pip-tools / agent auto-install patterns.
├── .gitignore __pycache__/ etc.
└── suites/ Six bundled JSON suites copied verbatim from
Sloba's MyServers/instances/vps-81-17-99-14/telemetry/:
small_model_eval_questions.json, python_task_suite_questions.json,
parallel_qwen_same_model_20q_suite.json,
parallel_qwen_mixed_model_20q_suite.json,
python_context_edge_append_questions.json,
python_context_edge_suite_only.json.
submissions/
README.md Folder convention + naming + reviewability rules
EXAMPLE/mac-m1-8gb/run-00000000-...-000000000000/
Synthetic-but-shape-complete contribution template:
manifest.json, hardware.json, run.jsonl (5 example lines),
metadata.json, run.md (with privacy attestation, methodology
deviations, reproducibility command). Marked as synthetic at
the top so future analysis doesn't accidentally cite it.
LICENSE-MIT
MIT for harness/*.py and future helper code. Existing LICENSE
(CC-BY-4.0) covers data files.
README.md (modified)
Updated to reflect dual purpose. Layout diagram updated.
Maintainer credits: Ben for catalogue/methodology + Bane for harness.
Contributor quick-start added. Status table extended.
Privacy posture:
- All 6 suite JSON files privacy-scanned for cluster IPs / hostnames /
paths / tokens. Two prompts contain project names ("MyBoard" auth
debugging in 20Q-Q14, generic SSH troubleshooting in 5Q-Q03);
flagged in chat for Sloba's review. Otherwise clean.
- run_benchmark.py default target_url is 127.0.0.1:11434 (no internal
IPs leaked).
- manifest.json captures host_hostname_short via socket.gethostname()
.split('.')[0] — agent should review before PR if hostname is
sensitive.
- CLAUDE.md §8 spells out the privacy-grep before push.
Verification:
- py_compile run_benchmark.py: OK
- --help renders cleanly
- All 6 suite JSON files: valid
- All 4 example JSON files: valid
- Example run.jsonl (5 lines): valid
This commit lands on branch feature/runner-and-agent-instructions.
NOT pushed to main; staying on the feature branch until Sloba reviews
on Gitea and merges. Bus dispatch to Ben + Sam announcing the
architectural pivot lives in the WeeyugaWeb coordination repo.
19 KiB
CLAUDE.md — runbook for the friend's coding agent
You are the coding agent (Claude Code, Codex, Aider, Cursor, …) of someone who agreed to run the Weeyuga benchmark suite on their hardware. This file is your briefing. Read it completely before doing anything.
If your runtime auto-loads
AGENTS.mdinstead ofCLAUDE.md, that file is byte-identical — same content, different filename for tools that prefer one over the other.
TL;DR (60 seconds)
You are going to:
- Probe the friend's hardware (OS, CPU, RAM, GPU + VRAM).
- Pick an OpenAI-compatible runtime (Ollama / llama.cpp / vLLM / MLX) and a model that fits.
- Adapt the bundled runner to the friend's reality (this is the part you can't skip — see §3).
- Run
harness/run_benchmark.pyagainst the runtime; it writes a JSONL ledger of every call. - Document the run in
submissions/<friend-handle>/<device-tag>/run-<uuid>/. - Submit a PR back to this repo. Sloba reviews and merges manually — nothing auto-merges.
You have read-only access to this repo. To submit, fork on Gitea, push to your fork, open a PR.
The harness is a template, not a one-click button. The friend's hardware will not match Sloba's; you are expected to adapt parameters, research best-known values online when you hit unfamiliar constraints, and write down what you changed and why in the manifest. Documented deviation is fine. Silent deviation breaks comparability.
0. Read this completely before doing anything
The rest of this file is structured as the order you'll work in. Reading the whole thing first gives you the shape; then the friend can ask "go" and you execute without circling back.
If you hit something genuinely ambiguous, ASK THE FRIEND. Don't guess at hardware-specific values; either verify with measurement or research them from the project / model authors' recommended-settings docs.
1. What you are running, exactly
Inputs (frozen across runs):
harness/prompts.py— three frozen prompts (P-EASY, P-MEDIUM, P-HARD). Never modified.harness/suites/*.json— six benchmark suites, all run sequentially per model:phase key suite file what it tests 5qsmall_model_eval_questions.json5 short-answer formatting + correctness questions 20qpython_task_suite_questions.json20 realistic Python task prompts parallel_sameparallel_qwen_same_model_20q_suite.jsonparallel-lane stress with one model parallel_mixedparallel_qwen_mixed_model_20q_suite.jsonparallel-lane stress with multiple models edge_appendpython_context_edge_append_questions.jsonlong-context append behavior edge_suitepython_context_edge_suite_only.jsonlong-context whole-suite reasoning
Driver: harness/run_benchmark.py — one process, sequential calls to your local OpenAI-compatible /v1/chat/completions endpoint, one JSONL line per call.
Output: submissions/<handle>/<device-tag>/run-<uuid>/ containing:
run.jsonl— every call recordedmanifest.json— written automatically by the runnerhardware.json— you fill this from the hardware probe (§2)metadata.json— computed aggregates (you generate, see §6)run.md— human-readable summary (you write, see §6)
Run order: ALL six suites run in sequence per model, per the canonical Pavilion methodology Sloba uses. Don't pick-and-choose unless the friend is explicitly time-constrained — partial runs are still useful but they're documented as "partial" in the manifest, and they show up as archive-only in Sloba's catalogue rather than full-grade runs.
2. Hardware probe — do this first, write hardware.json from the result
Before anything else, gather the friend's hardware truth. Pick the platform-appropriate commands; don't run all of them, just the ones that work on the friend's OS.
macOS:
system_profiler SPHardwareDataType SPDisplaysDataType
sysctl -n machdep.cpu.brand_string
sysctl -n hw.memsize
sw_vers
uname -a
Linux:
lscpu
cat /proc/meminfo | head -3
nvidia-smi --query-gpu=name,memory.total,driver_version,compute_cap --format=csv # if NVIDIA
lspci | grep -iE "vga|3d|display"
uname -a
cat /etc/os-release
Windows (PowerShell):
Get-CimInstance Win32_ComputerSystem | Select Manufacturer, Model, TotalPhysicalMemory
Get-CimInstance Win32_Processor | Select Name, NumberOfCores, MaxClockSpeed
Get-CimInstance Win32_VideoController | Select Name, AdapterRAM, DriverVersion
$PSVersionTable.OS
Write the canonical findings to hardware.json. Schema (every field present; null if not applicable):
{
"schema_version": "hardware-1.0",
"device_tag": "mac-m1-8gb",
"manufacturer_model": "Apple MacBook Air (Mac14,2)",
"os": {"name": "macOS", "version": "14.5", "kernel": "23.5.0"},
"cpu": {"name": "Apple M1", "cores": 8, "threads": 8, "max_ghz": 3.2,
"arch": "arm64", "isa": ["NEON"]},
"memory_gb_total": 8,
"memory_gb_available_at_run_start": 4.2,
"gpu": [
{"name": "Apple M1 GPU", "kind": "integrated", "vram_gb": null,
"driver": "Metal/macOS-14", "compute_cap": null}
],
"storage": {"kind": "ssd", "free_gb_at_run_start": 220},
"thermal_or_power_notes": "default OS thermal mgmt; on AC power",
"network_used_for_model_fetch": "wifi-100mbps",
"container_or_vm": null
}
Honest mode flags to mention in thermal_or_power_notes:
- "battery-only, low-power-mode active" → results may be artificially slow
- "thermal throttling observed mid-run" → tag any affected calls in
run.mdcaveats - "GPU shared with display compositor" → expect 5-15% throughput hit vs headless
3. Adapt to hardware reality — this is the part you cannot skip
The harness uses Sloba's canonical knobs as defaults. They are not guaranteed to be optimal for the friend's hardware. Your job:
3a. Canonical knobs (Sloba's reference values)
CANONICAL_OPTIONS = {
"temperature": 0.1, # near-deterministic; comparable across runs
"num_ctx": 4096, # context window
"num_predict": 2048, # max generated tokens per call
}
Plus runtime-level (Ollama-specific but apply equivalently to llama.cpp / vLLM):
KEEP_ALIVE— how long the loaded model stays warm. Sloba uses 2400h on cluster nodes (~100 days = effectively pinned). On a friend's laptop, 5m is gentler if RAM is tight.NUM_PARALLEL— concurrent slots. Sloba uses 3 on Pavilion. 1 is fine on tight RAM.MAX_LOADED_MODELS— how many models held in VRAM. Sloba uses 3 on a 12 GB GPU; default to 1 on anything ≤ 8 GB.- For llama.cpp:
--n-gpu-layers(NGL) — number of model layers offloaded to GPU. Critical on borderline VRAM. NGL=99 is full offload; NGL=0 is CPU-only. Sloba's Predator (GTX 1060 6 GB) runs Qwen3.5:9B at NGL=6 because higher offloads OOM with the KV cache.
3b. Decision rules
| Friend's hardware | Likely runtime | Likely model size | Likely NGL | Likely NUM_PARALLEL |
|---|---|---|---|---|
| Apple Silicon (M1/M2/M3, ≥8 GB unified) | Ollama OR llama.cpp w/ Metal OR MLX | 0.5B – 4B | n/a (Metal handles offload) | 1–2 |
| Apple Silicon (M-Pro/M-Max, ≥16 GB) | same, MLX preferred for 8B+ | 4B – 14B | n/a | 2–3 |
| NVIDIA GPU 6 GB VRAM | llama.cpp + CUDA | 0.5B – 4B (or 8B at NGL ~10–20) | tuned per model | 1 |
| NVIDIA GPU 8–12 GB VRAM | llama.cpp + CUDA, or vLLM | 4B – 14B | high (60–99) | 1–2 |
| NVIDIA GPU 24+ GB VRAM | vLLM or llama.cpp | up to 32B | 99 (full) | 4+ |
| AMD GPU | llama.cpp + ROCm | conservative one tier below NVIDIA-equivalent | tuned | 1 |
| CPU only | llama.cpp + CPU | 0.5B – 2B (Q4_K_M) | 0 | 1 |
These are starting points. Don't trust them blindly. For any model + hardware combination you're uncertain about:
- Check the model's HuggingFace card for "recommended quantization / hardware" notes.
- Check the runtime's GitHub for known issues with this model family.
- Look up llama.cpp issues for "VRAM OOM " — community usually finds the NGL sweet spot.
- If still uncertain, run a dry probe:
python3 harness/run_benchmark.py --probe --target-url ... --cell-id-prefix ... --models <model>and observe RSS / VRAM / tokens-per-sec.
3c. Document every deviation in manifest.json.canonical_options_overrides
The runner records overrides automatically when you pass --temperature / --num-ctx / --num-predict. For runtime-level deviations (NGL, NUM_PARALLEL, KEEP_ALIVE), add them to hardware.json.thermal_or_power_notes or to run.md § Methodology Deviations. Untracked deviations are the worst kind — silently make a run uncomparable. Honest-and-deviated > silent-and-clean.
4. Pick a runtime and a model
Sloba's instruction: use any model. The harness doesn't ship a fixed model list — the friend's agent picks based on hardware. Suggestions, in increasing size:
| Model | Size | When |
|---|---|---|
qwen2.5-coder:0.5b |
~400 MB | minimum-viable code benchmarks; runs anywhere |
qwen3.5:0.8b |
~600 MB | Sloba's reference smallest; matches his catalogue runs |
qwen2.5-coder:1.5b |
~1.1 GB | code-focused mid-tier |
qwen3.5:2b |
~1.5 GB | conversational mid-tier |
qwen3.5:4b |
~3 GB | flagship mid-tier; common comparison point |
qwen3.5:8b-q4km |
~5 GB | mid-tier flagship |
qwen3.5:9b-q4km |
~5.4 GB | Sloba's Predator flagship; 6 GB VRAM borderline (run with reduced NGL) |
qwen3.5:14b-q4km |
~9 GB | needs ≥10 GB VRAM or Apple Silicon ≥16 GB unified |
gemma-4:e4b-it-q4km |
~3 GB | non-Qwen comparison |
granite-4.1:8b-q4km |
~5 GB | non-Qwen comparison |
Models are pulled from:
- Ollama Hub:
ollama pull qwen3.5:0.8b, etc. - HuggingFace + llama.cpp: download GGUF directly via
wget/hf-download, then pointllama-serverat it.
Run more than one model in the same run if you can — comparability. The harness loops models inside one run; cell_ids encode the (node, engine, model) tuple.
5. Run the benchmark
5a. Smoke first (30 seconds)
python3 harness/run_benchmark.py --smoke \
--target-url http://127.0.0.1:11434 \
--models qwen3.5:0.8b \
--cell-id-prefix mac-m1:ollama \
--submitter-handle <friend-gitea-handle> \
--device-tag <short-device-tag>
If smoke 200s back, you have a working runtime. Run the real thing.
5b. Full run
python3 harness/run_benchmark.py \
--target-url http://127.0.0.1:11434 \
--models qwen3.5:0.8b,qwen3.5:4b \
--cell-id-prefix mac-m1:ollama \
--phases hello,5q,20q \
--submitter-handle alice \
--device-tag mac-m1-8gb
For the canonical full sweep across all six suites:
python3 harness/run_benchmark.py --phases all \
--target-url http://127.0.0.1:11434 \
--models qwen3.5:0.8b \
--cell-id-prefix mac-m1:ollama \
--submitter-handle alice --device-tag mac-m1-8gb
Expect minutes per cell. The 20Q + edge suites are the long ones (~10–40 minutes per model on a small box). If the friend is time-bounded, drop edge_* and parallel_* — but record what you skipped.
5c. Resume on interrupt
If interrupted, the JSONL ledger is preserved (every line is fsync'd). To resume the same run-id:
python3 harness/run_benchmark.py --run-id <previous-uuid> ...
This appends to a new ledger; you'll need to merge them by hand (or just submit them as two separate runs sharing the same device-tag).
6. Generate metadata.json and run.md
6a. metadata.json — computed aggregates per cell
Schema (one row per (cell_id, phase) pair):
{
"schema_version": "metadata-1.0",
"run_id": "<uuid>",
"submitter_handle": "alice",
"device_tag": "mac-m1-8gb",
"cells": [
{
"cell_id": "mac-m1:ollama:qwen3.5:0.8b",
"phase": "20q",
"n_calls": 20,
"n_errors": 0,
"duration_ms_p50": 9600,
"duration_ms_p95": 24000,
"duration_ms_mean": 11200,
"tokens_per_sec_p50": 16.4,
"tokens_per_sec_p95": 22.1,
"tokens_per_sec_mean": 17.0,
"tokens_per_sec_max": 24.8,
"completion_tokens_total": 18234,
"format_ok_rate": 0.85,
"marker_hit_rate_mean": 0.72
}
]
}
You can compute this in-line (small script) or use a quick Python REPL pass over run.jsonl. The catalogue builder on Sloba's side will recompute it anyway, but having it in the PR makes review fast.
6b. run.md — human-readable summary
Template (fill in every section honestly):
# <device-tag> — <model-set> — <YYYY-MM-DD>
**Run ID:** `<uuid>`
**Submitter:** <handle>
**Hardware:** Apple MacBook Air M1, 8 GB unified, macOS 14.5
**Runtime:** Ollama 0.5.x (default settings; NUM_PARALLEL=1, KEEP_ALIVE=5m)
**Models:** qwen3.5:0.8b, qwen3.5:4b
**Phases run:** hello, 5q, 20q
**Phases skipped:** parallel_same, parallel_mixed, edge_append, edge_suite — RAM constraint, 4 GB free at run start was insufficient for parallel suites.
## Headline numbers
| cell | n_calls | tok/s mean | tok/s p50 | duration p50 | format_ok rate |
|---|---|---|---|---|---|
| mac-m1:ollama:qwen3.5:0.8b 20q | 20 | 17.0 | 16.4 | 9.6 s | 85% |
| mac-m1:ollama:qwen3.5:4b 20q | 20 | 5.8 | 5.5 | 28.2 s | 70% |
## Methodology
Followed the canonical Pavilion methodology with these deviations:
- **NUM_PARALLEL=1** instead of canonical 3 — 8 GB unified RAM doesn't fit two warm copies of qwen3.5:4b.
- **KEEP_ALIVE=5m** instead of 2400h — laptop, not server.
- **edge_* and parallel_* phases skipped** — friend's time budget.
## Caveats
- Run started at 18% battery; one call (20q-q14, model qwen3.5:4b) coincided with macOS Spotlight indexing; flagged in run.jsonl with run_idx=14 — that data point is high-variance.
- Network was on hotel wifi; model pull took ~6 minutes for qwen3.5:4b. Did not affect benchmark timing (model warm before any timed call).
## Reproducibility
python3 harness/run_benchmark.py
--target-url http://127.0.0.1:11434
--models qwen3.5:0.8b,qwen3.5:4b
--cell-id-prefix mac-m1:ollama
--phases hello,5q,20q
--submitter-handle alice
--device-tag mac-m1-8gb
7. Submit the PR
- Fork
https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-publicto the friend's Gitea account (Gitea web UI → "Fork"). - Add the friend's fork as a remote on the local clone:
git remote add fork ssh://gitea@git.weeyuga.com/<friend-handle>/weeyuga-benchmarks-public.git - Create a topic branch off
main:git checkout -b submission/<handle>-<device-tag>-<short-date> - Stage only the new files under
submissions/<handle>/<device-tag>/run-<uuid>/. NEVER modify anything outside that directory in this PR.git add submissions/<handle>/<device-tag>/run-<uuid>/ git status # confirm: only files under your run-<uuid>/ are staged - Commit with a descriptive message:
submit: alice / mac-m1-8gb / 2026-05-12 — qwen3.5 0.8b+4b, hello+5q+20q First contribution from a friend's hardware. M1 8 GB unified, Ollama 0.5.x. Skipped edge_* + parallel_* due to RAM. Headline: qwen3.5:0.8b ~17 tok/s, qwen3.5:4b ~5.8 tok/s on 20Q. - Push to fork:
git push fork submission/<handle>-<device-tag>-<short-date> - Open a PR on Gitea with target =
slobodanmargetic988/weeyuga-benchmarks-public:main. PR description should include:- One-paragraph what-and-why
- Link to the friend's
run.md - Any methodology deviations the reviewer should know
- Privacy attestation: "I have reviewed run.jsonl and confirmed no PII / SSH keys / API tokens / personal home paths leaked"
Sloba reviews and merges. Nothing auto-merges. A typical review surfaces 0–2 follow-ups; address and force-push to the same branch.
8. Privacy guardrails — DO NOT submit any of these
- API keys (OpenAI, Anthropic, HuggingFace tokens, etc.)
- SSH private keys,
.ssh/paths - Personal home directory paths (
/Users/alice/secrets/...) - Real names if the friend prefers a handle
- Internal corporate IPs, hostnames, or SSO endpoints
- Bearer tokens in error messages (some runtimes echo headers in 4xx errors)
Before pushing, scan the run.jsonl for these patterns:
grep -nE "Bearer |sk-|api_key|/Users/|/home/|password|ssh-rsa|ssh-ed25519" submissions/<handle>/<device-tag>/run-<uuid>/*.{jsonl,md,json}
If anything matches, redact it from response_preview (the JSONL stores only the first 240 chars of each response, so leaks are rare — but please scan).
9. What if you get stuck
/v1/modelsreturns empty: the runtime isn't OpenAI-compat or no models are loaded. For Ollama:ollama list. For llama.cpp: it doesn't list models on/v1/modelshistorically; pass--models <name> --target-url http://127.0.0.1:11436and it'll work anyway.- Every call returns 500 / timeout: runtime is up but model isn't loading. Check VRAM with
nvidia-smior memory pressure withvm_stat/free. Smaller model. Or smallernum_ctx. - Tokens/sec absurdly low (<1 tok/s on hardware that should manage): thermal throttling, swap thrashing, or wrong-quantization. Check
free -hmid-run; if swap is being used, model is too big for RAM. - One question keeps getting
format_ok=false: model can't follow that instruction shape. NORMAL. Don't shorten the prompt or reword. Document in run.md and move on. - Ambiguous hardware setup (eGPU? VM? Container?): ASK the friend. Container/VM resource caps make benchmarks misleading.
10. The methodology lineage
This harness mirrors WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py v3 — Sloba's canonical Pavilion methodology established 2026-04-11. The 21 reference runs in runs/ (in this repo) are the comparable baseline. Sloba's locked catalogue lives at catalogue.json (this repo). When your run is merged, it'll be added to the catalogue under your device-tag and become a new comparison point.
The methodology and harness will evolve. Current canonical version: HARNESS_VERSION = "public-1". Future versions will be additive — older ledgers stay valid forever.
11. Coordinate-while-running checklist
Before you start:
- Read this whole file
- Read
methodology.mdfor the metric definitions (TTFT, p50/p95, format_ok, etc.) - Verify the friend has ≥3 GB free disk for model files
- Verify network is OK for model pull (the GGUFs are 0.5–10 GB)
While running:
- Smoke first
- Full run
- Watch for thermal throttling on laptops / phones / mini-PCs
- Don't open Chrome / Slack / Zoom mid-run if you can avoid it (VRAM pressure)
After running:
- Generate
metadata.jsonaggregates - Write
run.mdhonestly — including caveats - Privacy-scan
run.jsonl - Fork → branch → push → PR
Questions / blockers
If you hit something this runbook doesn't cover, the friend can email Sloba (slobodan@weeyuga.com) or open an issue on this repo. Don't burn an hour in a corner — ask. The whole point of crowdsourcing is the variance you'll see; that's data, not a problem.
Welcome aboard. 🦇
— The Weeyuga team
Maintainer note: if you edit this file, edit
AGENTS.mdto match (Codex loadsAGENTS.md, Claude Code loadsCLAUDE.md; identical content prevents two-tier rules).