slobodanmargetic988/weeyuga-benchmarks-public

Files

Slobodan Margetic 97a9245d9e feat: harness + agent runbook — flip repo from archive-only to

crowdsourced runner

Sloba's chat directive 2026-05-06: "this project is preparation for
going public ... ship the harness along so others can join in."

The repo's original purpose (Ben's catalogue + 21 reference run
ledgers, shipped 2026-05-05) stays intact. This commit ADDS a second
purpose: a portable harness + agent runbook so a friend's coding agent
can clone, read CLAUDE.md, run the same suite on the friend's hardware,
and submit results back as a PR.

What landed:

CLAUDE.md + AGENTS.md (byte-identical, ~520 lines)
  Full agent runbook: hardware probe, runtime + model selection,
  canonical knob reference (Sloba's Pavilion methodology values),
  hardware-adaptation decision rules, run-instructions, output-schema
  templates for hardware.json + metadata.json + run.md, PR submission
  flow (fork → branch → push → PR; nothing auto-merges), privacy
  guardrails, methodology lineage. Per Sloba's Q3 directive: the
  runbook explicitly tells the friend's agent to ADAPT to hardware
  reality and document deviations rather than blindly run defaults.

CONTRIBUTING.md (~110 lines)
  Human-readable companion for the friend (not the agent). What you
  need, how it works, what we ask, what maintainers commit to,
  license, code-of-conduct short version.

harness/
  ├── README.md        Technical readme for the harness folder
  ├── run_benchmark.py ~520 LOC runner. Stdlib-only. Adapted from
  │                    WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py
  │                    v3 with the cluster-internal IP defaults
  │                    (10.8.0.x) replaced by 127.0.0.1:11434, the
  │                    cluster /v1/cluster/* endpoints removed, the
  │                    canonical-suite paths under ~/Documents/MyServers
  │                    replaced by harness/suites/ paths, the git-sha
  │                    enforcement on WeeyugaWeb dropped, and the
  │                    output written under submissions/<handle>/<tag>/
  │                    instead of docs/BENCHMARKS/runs/. Supports all
  │                    six suite phases via --phases, plus 'all'.
  ├── prompts.py       Verbatim copy of the canonical 3 frozen prompts
  │                    (P-EASY/P-MEDIUM/P-HARD) from
  │                    WeeyugaWeb/scripts/benchmarks/prompts.py.
  ├── requirements.txt Empty by intent (stdlib-only); placeholder for
  │                    pip-tools / agent auto-install patterns.
  ├── .gitignore       __pycache__/ etc.
  └── suites/          Six bundled JSON suites copied verbatim from
       Sloba's MyServers/instances/vps-81-17-99-14/telemetry/:
       small_model_eval_questions.json, python_task_suite_questions.json,
       parallel_qwen_same_model_20q_suite.json,
       parallel_qwen_mixed_model_20q_suite.json,
       python_context_edge_append_questions.json,
       python_context_edge_suite_only.json.

submissions/
  README.md            Folder convention + naming + reviewability rules
  EXAMPLE/mac-m1-8gb/run-00000000-...-000000000000/
       Synthetic-but-shape-complete contribution template:
       manifest.json, hardware.json, run.jsonl (5 example lines),
       metadata.json, run.md (with privacy attestation, methodology
       deviations, reproducibility command). Marked as synthetic at
       the top so future analysis doesn't accidentally cite it.

LICENSE-MIT
  MIT for harness/*.py and future helper code. Existing LICENSE
  (CC-BY-4.0) covers data files.

README.md (modified)
  Updated to reflect dual purpose. Layout diagram updated.
  Maintainer credits: Ben for catalogue/methodology + Bane for harness.
  Contributor quick-start added. Status table extended.

Privacy posture:
  - All 6 suite JSON files privacy-scanned for cluster IPs / hostnames /
    paths / tokens. Two prompts contain project names ("MyBoard" auth
    debugging in 20Q-Q14, generic SSH troubleshooting in 5Q-Q03);
    flagged in chat for Sloba's review. Otherwise clean.
  - run_benchmark.py default target_url is 127.0.0.1:11434 (no internal
    IPs leaked).
  - manifest.json captures host_hostname_short via socket.gethostname()
    .split('.')[0] — agent should review before PR if hostname is
    sensitive.
  - CLAUDE.md §8 spells out the privacy-grep before push.

Verification:
  - py_compile run_benchmark.py: OK
  - --help renders cleanly
  - All 6 suite JSON files: valid
  - All 4 example JSON files: valid
  - Example run.jsonl (5 lines): valid

This commit lands on branch feature/runner-and-agent-instructions.
NOT pushed to main; staying on the feature branch until Sloba reviews
on Gitea and merges. Bus dispatch to Ben + Sam announcing the
architectural pivot lives in the WeeyugaWeb coordination repo.

2026-05-06 19:05:22 +02:00

19 KiB

Raw Permalink Blame History

CLAUDE.md — runbook for the friend's coding agent

You are the coding agent (Claude Code, Codex, Aider, Cursor, …) of someone who agreed to run the Weeyuga benchmark suite on their hardware. This file is your briefing. Read it completely before doing anything.

If your runtime auto-loads AGENTS.md instead of CLAUDE.md, that file is byte-identical — same content, different filename for tools that prefer one over the other.

TL;DR (60 seconds)

You are going to:

Probe the friend's hardware (OS, CPU, RAM, GPU + VRAM).
Pick an OpenAI-compatible runtime (Ollama / llama.cpp / vLLM / MLX) and a model that fits.
Adapt the bundled runner to the friend's reality (this is the part you can't skip — see §3).
Run harness/run_benchmark.py against the runtime; it writes a JSONL ledger of every call.
Document the run in submissions/<friend-handle>/<device-tag>/run-<uuid>/.
Submit a PR back to this repo. Sloba reviews and merges manually — nothing auto-merges.

You have read-only access to this repo. To submit, fork on Gitea, push to your fork, open a PR.

The harness is a template, not a one-click button. The friend's hardware will not match Sloba's; you are expected to adapt parameters, research best-known values online when you hit unfamiliar constraints, and write down what you changed and why in the manifest. Documented deviation is fine. Silent deviation breaks comparability.

0. Read this completely before doing anything

The rest of this file is structured as the order you'll work in. Reading the whole thing first gives you the shape; then the friend can ask "go" and you execute without circling back.

If you hit something genuinely ambiguous, ASK THE FRIEND. Don't guess at hardware-specific values; either verify with measurement or research them from the project / model authors' recommended-settings docs.

1. What you are running, exactly

Inputs (frozen across runs):

harness/prompts.py — three frozen prompts (P-EASY, P-MEDIUM, P-HARD). Never modified.

harness/suites/*.json — six benchmark suites, all run sequentially per model:

phase key	suite file	what it tests
`5q`	`small_model_eval_questions.json`	5 short-answer formatting + correctness questions
`20q`	`python_task_suite_questions.json`	20 realistic Python task prompts
`parallel_same`	`parallel_qwen_same_model_20q_suite.json`	parallel-lane stress with one model
`parallel_mixed`	`parallel_qwen_mixed_model_20q_suite.json`	parallel-lane stress with multiple models
`edge_append`	`python_context_edge_append_questions.json`	long-context append behavior
`edge_suite`	`python_context_edge_suite_only.json`	long-context whole-suite reasoning

Driver: harness/run_benchmark.py — one process, sequential calls to your local OpenAI-compatible /v1/chat/completions endpoint, one JSONL line per call.

Output: submissions/<handle>/<device-tag>/run-<uuid>/ containing:

run.jsonl — every call recorded
manifest.json — written automatically by the runner
hardware.json — you fill this from the hardware probe (§2)
metadata.json — computed aggregates (you generate, see §6)
run.md — human-readable summary (you write, see §6)

Run order: ALL six suites run in sequence per model, per the canonical Pavilion methodology Sloba uses. Don't pick-and-choose unless the friend is explicitly time-constrained — partial runs are still useful but they're documented as "partial" in the manifest, and they show up as archive-only in Sloba's catalogue rather than full-grade runs.

2. Hardware probe — do this first, write `hardware.json` from the result

Before anything else, gather the friend's hardware truth. Pick the platform-appropriate commands; don't run all of them, just the ones that work on the friend's OS.

macOS:

system_profiler SPHardwareDataType SPDisplaysDataType
sysctl -n machdep.cpu.brand_string
sysctl -n hw.memsize
sw_vers
uname -a

Linux:

lscpu
cat /proc/meminfo | head -3
nvidia-smi --query-gpu=name,memory.total,driver_version,compute_cap --format=csv  # if NVIDIA
lspci | grep -iE "vga|3d|display"
uname -a
cat /etc/os-release

Windows (PowerShell):

Get-CimInstance Win32_ComputerSystem | Select Manufacturer, Model, TotalPhysicalMemory
Get-CimInstance Win32_Processor | Select Name, NumberOfCores, MaxClockSpeed
Get-CimInstance Win32_VideoController | Select Name, AdapterRAM, DriverVersion
$PSVersionTable.OS

Write the canonical findings to hardware.json. Schema (every field present; null if not applicable):

{
  "schema_version": "hardware-1.0",
  "device_tag": "mac-m1-8gb",
  "manufacturer_model": "Apple MacBook Air (Mac14,2)",
  "os": {"name": "macOS", "version": "14.5", "kernel": "23.5.0"},
  "cpu": {"name": "Apple M1", "cores": 8, "threads": 8, "max_ghz": 3.2,
          "arch": "arm64", "isa": ["NEON"]},
  "memory_gb_total": 8,
  "memory_gb_available_at_run_start": 4.2,
  "gpu": [
    {"name": "Apple M1 GPU", "kind": "integrated", "vram_gb": null,
     "driver": "Metal/macOS-14", "compute_cap": null}
  ],
  "storage": {"kind": "ssd", "free_gb_at_run_start": 220},
  "thermal_or_power_notes": "default OS thermal mgmt; on AC power",
  "network_used_for_model_fetch": "wifi-100mbps",
  "container_or_vm": null
}

Honest mode flags to mention in thermal_or_power_notes:

"battery-only, low-power-mode active" → results may be artificially slow
"thermal throttling observed mid-run" → tag any affected calls in run.md caveats
"GPU shared with display compositor" → expect 5-15% throughput hit vs headless

3. Adapt to hardware reality — this is the part you cannot skip

The harness uses Sloba's canonical knobs as defaults. They are not guaranteed to be optimal for the friend's hardware. Your job:

3a. Canonical knobs (Sloba's reference values)

CANONICAL_OPTIONS = {
    "temperature": 0.1,    # near-deterministic; comparable across runs
    "num_ctx": 4096,       # context window
    "num_predict": 2048,   # max generated tokens per call
}

Plus runtime-level (Ollama-specific but apply equivalently to llama.cpp / vLLM):

KEEP_ALIVE — how long the loaded model stays warm. Sloba uses 2400h on cluster nodes (~100 days = effectively pinned). On a friend's laptop, 5m is gentler if RAM is tight.
NUM_PARALLEL — concurrent slots. Sloba uses 3 on Pavilion. 1 is fine on tight RAM.
MAX_LOADED_MODELS — how many models held in VRAM. Sloba uses 3 on a 12 GB GPU; default to 1 on anything ≤ 8 GB.
For llama.cpp: --n-gpu-layers (NGL) — number of model layers offloaded to GPU. Critical on borderline VRAM. NGL=99 is full offload; NGL=0 is CPU-only. Sloba's Predator (GTX 1060 6 GB) runs Qwen3.5:9B at NGL=6 because higher offloads OOM with the KV cache.

3b. Decision rules

Friend's hardware	Likely runtime	Likely model size	Likely NGL	Likely NUM_PARALLEL
Apple Silicon (M1/M2/M3, ≥8 GB unified)	Ollama OR llama.cpp w/ Metal OR MLX	0.5B – 4B	n/a (Metal handles offload)	1–2
Apple Silicon (M-Pro/M-Max, ≥16 GB)	same, MLX preferred for 8B+	4B – 14B	n/a	2–3
NVIDIA GPU 6 GB VRAM	llama.cpp + CUDA	0.5B – 4B (or 8B at NGL ~10–20)	tuned per model	1
NVIDIA GPU 8–12 GB VRAM	llama.cpp + CUDA, or vLLM	4B – 14B	high (60–99)	1–2
NVIDIA GPU 24+ GB VRAM	vLLM or llama.cpp	up to 32B	99 (full)	4+
AMD GPU	llama.cpp + ROCm	conservative one tier below NVIDIA-equivalent	tuned	1
CPU only	llama.cpp + CPU	0.5B – 2B (Q4_K_M)	0	1

These are starting points. Don't trust them blindly. For any model + hardware combination you're uncertain about:

Check the model's HuggingFace card for "recommended quantization / hardware" notes.
Check the runtime's GitHub for known issues with this model family.
Look up llama.cpp issues for "VRAM OOM " — community usually finds the NGL sweet spot.
If still uncertain, run a dry probe: python3 harness/run_benchmark.py --probe --target-url ... --cell-id-prefix ... --models <model> and observe RSS / VRAM / tokens-per-sec.

3c. Document every deviation in `manifest.json.canonical_options_overrides`

The runner records overrides automatically when you pass --temperature / --num-ctx / --num-predict. For runtime-level deviations (NGL, NUM_PARALLEL, KEEP_ALIVE), add them to hardware.json.thermal_or_power_notes or to run.md § Methodology Deviations. Untracked deviations are the worst kind — silently make a run uncomparable. Honest-and-deviated > silent-and-clean.

4. Pick a runtime and a model

Sloba's instruction: use any model. The harness doesn't ship a fixed model list — the friend's agent picks based on hardware. Suggestions, in increasing size:

Model	Size	When
`qwen2.5-coder:0.5b`	~400 MB	minimum-viable code benchmarks; runs anywhere
`qwen3.5:0.8b`	~600 MB	Sloba's reference smallest; matches his catalogue runs
`qwen2.5-coder:1.5b`	~1.1 GB	code-focused mid-tier
`qwen3.5:2b`	~1.5 GB	conversational mid-tier
`qwen3.5:4b`	~3 GB	flagship mid-tier; common comparison point
`qwen3.5:8b-q4km`	~5 GB	mid-tier flagship
`qwen3.5:9b-q4km`	~5.4 GB	Sloba's Predator flagship; 6 GB VRAM borderline (run with reduced NGL)
`qwen3.5:14b-q4km`	~9 GB	needs ≥10 GB VRAM or Apple Silicon ≥16 GB unified
`gemma-4:e4b-it-q4km`	~3 GB	non-Qwen comparison
`granite-4.1:8b-q4km`	~5 GB	non-Qwen comparison

Models are pulled from:

Ollama Hub: ollama pull qwen3.5:0.8b, etc.
HuggingFace + llama.cpp: download GGUF directly via wget/hf-download, then point llama-server at it.

Run more than one model in the same run if you can — comparability. The harness loops models inside one run; cell_ids encode the (node, engine, model) tuple.

5. Run the benchmark

5a. Smoke first (30 seconds)

python3 harness/run_benchmark.py --smoke \
    --target-url http://127.0.0.1:11434 \
    --models qwen3.5:0.8b \
    --cell-id-prefix mac-m1:ollama \
    --submitter-handle <friend-gitea-handle> \
    --device-tag <short-device-tag>

If smoke 200s back, you have a working runtime. Run the real thing.

5b. Full run

python3 harness/run_benchmark.py \
    --target-url http://127.0.0.1:11434 \
    --models qwen3.5:0.8b,qwen3.5:4b \
    --cell-id-prefix mac-m1:ollama \
    --phases hello,5q,20q \
    --submitter-handle alice \
    --device-tag mac-m1-8gb

For the canonical full sweep across all six suites:

python3 harness/run_benchmark.py --phases all \
    --target-url http://127.0.0.1:11434 \
    --models qwen3.5:0.8b \
    --cell-id-prefix mac-m1:ollama \
    --submitter-handle alice --device-tag mac-m1-8gb

Expect minutes per cell. The 20Q + edge suites are the long ones (~10–40 minutes per model on a small box). If the friend is time-bounded, drop edge_* and parallel_* — but record what you skipped.

5c. Resume on interrupt

If interrupted, the JSONL ledger is preserved (every line is fsync'd). To resume the same run-id:

python3 harness/run_benchmark.py --run-id <previous-uuid> ...

This appends to a new ledger; you'll need to merge them by hand (or just submit them as two separate runs sharing the same device-tag).

6. Generate `metadata.json` and `run.md`

6a. `metadata.json` — computed aggregates per cell

Schema (one row per (cell_id, phase) pair):

{
  "schema_version": "metadata-1.0",
  "run_id": "<uuid>",
  "submitter_handle": "alice",
  "device_tag": "mac-m1-8gb",
  "cells": [
    {
      "cell_id": "mac-m1:ollama:qwen3.5:0.8b",
      "phase": "20q",
      "n_calls": 20,
      "n_errors": 0,
      "duration_ms_p50": 9600,
      "duration_ms_p95": 24000,
      "duration_ms_mean": 11200,
      "tokens_per_sec_p50": 16.4,
      "tokens_per_sec_p95": 22.1,
      "tokens_per_sec_mean": 17.0,
      "tokens_per_sec_max": 24.8,
      "completion_tokens_total": 18234,
      "format_ok_rate": 0.85,
      "marker_hit_rate_mean": 0.72
    }
  ]
}

You can compute this in-line (small script) or use a quick Python REPL pass over run.jsonl. The catalogue builder on Sloba's side will recompute it anyway, but having it in the PR makes review fast.

6b. `run.md` — human-readable summary

Template (fill in every section honestly):

# <device-tag> — <model-set> — <YYYY-MM-DD>

**Run ID:** `<uuid>`
**Submitter:** <handle>
**Hardware:** Apple MacBook Air M1, 8 GB unified, macOS 14.5
**Runtime:** Ollama 0.5.x (default settings; NUM_PARALLEL=1, KEEP_ALIVE=5m)
**Models:** qwen3.5:0.8b, qwen3.5:4b
**Phases run:** hello, 5q, 20q
**Phases skipped:** parallel_same, parallel_mixed, edge_append, edge_suite — RAM constraint, 4 GB free at run start was insufficient for parallel suites.

## Headline numbers

| cell | n_calls | tok/s mean | tok/s p50 | duration p50 | format_ok rate |
|---|---|---|---|---|---|
| mac-m1:ollama:qwen3.5:0.8b 20q | 20 | 17.0 | 16.4 | 9.6 s | 85% |
| mac-m1:ollama:qwen3.5:4b 20q | 20 | 5.8 | 5.5 | 28.2 s | 70% |

## Methodology

Followed the canonical Pavilion methodology with these deviations:

- **NUM_PARALLEL=1** instead of canonical 3 — 8 GB unified RAM doesn't fit two warm copies of qwen3.5:4b.
- **KEEP_ALIVE=5m** instead of 2400h — laptop, not server.
- **edge_* and parallel_* phases skipped** — friend's time budget.

## Caveats

- Run started at 18% battery; one call (20q-q14, model qwen3.5:4b) coincided with macOS Spotlight indexing; flagged in run.jsonl with run_idx=14 — that data point is high-variance.
- Network was on hotel wifi; model pull took ~6 minutes for qwen3.5:4b. Did not affect benchmark timing (model warm before any timed call).

## Reproducibility

python3 harness/run_benchmark.py
--target-url http://127.0.0.1:11434
--models qwen3.5:0.8b,qwen3.5:4b
--cell-id-prefix mac-m1:ollama
--phases hello,5q,20q
--submitter-handle alice
--device-tag mac-m1-8gb

7. Submit the PR

Fork https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public to the friend's Gitea account (Gitea web UI → "Fork").

Add the friend's fork as a remote on the local clone:

git remote add fork ssh://gitea@git.weeyuga.com/<friend-handle>/weeyuga-benchmarks-public.git

Create a topic branch off main:

git checkout -b submission/<handle>-<device-tag>-<short-date>

Stage only the new files under submissions/<handle>/<device-tag>/run-<uuid>/. NEVER modify anything outside that directory in this PR.
```
git add submissions/<handle>/<device-tag>/run-<uuid>/
git status   # confirm: only files under your run-<uuid>/ are staged
```

Commit with a descriptive message:

submit: alice / mac-m1-8gb / 2026-05-12 — qwen3.5 0.8b+4b, hello+5q+20q

First contribution from a friend's hardware. M1 8 GB unified, Ollama 0.5.x.
Skipped edge_* + parallel_* due to RAM. Headline: qwen3.5:0.8b ~17 tok/s,
qwen3.5:4b ~5.8 tok/s on 20Q.

Push to fork:

git push fork submission/<handle>-<device-tag>-<short-date>

Open a PR on Gitea with target = slobodanmargetic988/weeyuga-benchmarks-public:main. PR description should include:
- One-paragraph what-and-why
- Link to the friend's run.md
- Any methodology deviations the reviewer should know
- Privacy attestation: "I have reviewed run.jsonl and confirmed no PII / SSH keys / API tokens / personal home paths leaked"

Sloba reviews and merges. Nothing auto-merges. A typical review surfaces 0–2 follow-ups; address and force-push to the same branch.

8. Privacy guardrails — DO NOT submit any of these

API keys (OpenAI, Anthropic, HuggingFace tokens, etc.)
SSH private keys, .ssh/ paths
Personal home directory paths (/Users/alice/secrets/...)
Real names if the friend prefers a handle
Internal corporate IPs, hostnames, or SSO endpoints
Bearer tokens in error messages (some runtimes echo headers in 4xx errors)

Before pushing, scan the run.jsonl for these patterns:

grep -nE "Bearer |sk-|api_key|/Users/|/home/|password|ssh-rsa|ssh-ed25519" submissions/<handle>/<device-tag>/run-<uuid>/*.{jsonl,md,json}

If anything matches, redact it from response_preview (the JSONL stores only the first 240 chars of each response, so leaks are rare — but please scan).

9. What if you get stuck

/v1/models returns empty: the runtime isn't OpenAI-compat or no models are loaded. For Ollama: ollama list. For llama.cpp: it doesn't list models on /v1/models historically; pass --models <name> --target-url http://127.0.0.1:11436 and it'll work anyway.
Every call returns 500 / timeout: runtime is up but model isn't loading. Check VRAM with nvidia-smi or memory pressure with vm_stat / free. Smaller model. Or smaller num_ctx.
Tokens/sec absurdly low (<1 tok/s on hardware that should manage): thermal throttling, swap thrashing, or wrong-quantization. Check free -h mid-run; if swap is being used, model is too big for RAM.
One question keeps getting format_ok=false: model can't follow that instruction shape. NORMAL. Don't shorten the prompt or reword. Document in run.md and move on.
Ambiguous hardware setup (eGPU? VM? Container?): ASK the friend. Container/VM resource caps make benchmarks misleading.

10. The methodology lineage

This harness mirrors WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py v3 — Sloba's canonical Pavilion methodology established 2026-04-11. The 21 reference runs in runs/ (in this repo) are the comparable baseline. Sloba's locked catalogue lives at catalogue.json (this repo). When your run is merged, it'll be added to the catalogue under your device-tag and become a new comparison point.

The methodology and harness will evolve. Current canonical version: HARNESS_VERSION = "public-1". Future versions will be additive — older ledgers stay valid forever.

11. Coordinate-while-running checklist

Before you start:

Read this whole file
Read methodology.md for the metric definitions (TTFT, p50/p95, format_ok, etc.)
Verify the friend has ≥3 GB free disk for model files
Verify network is OK for model pull (the GGUFs are 0.5–10 GB)

While running:

Smoke first
Full run
Watch for thermal throttling on laptops / phones / mini-PCs
Don't open Chrome / Slack / Zoom mid-run if you can avoid it (VRAM pressure)

After running:

Generate metadata.json aggregates
Write run.md honestly — including caveats
Privacy-scan run.jsonl
Fork → branch → push → PR

Questions / blockers

If you hit something this runbook doesn't cover, the friend can email Sloba (slobodan@weeyuga.com) or open an issue on this repo. Don't burn an hour in a corner — ask. The whole point of crowdsourcing is the variance you'll see; that's data, not a problem.

Welcome aboard. 🦇

— The Weeyuga team

Maintainer note: if you edit this file, edit AGENTS.md to match (Codex loads AGENTS.md, Claude Code loads CLAUDE.md; identical content prevents two-tier rules).

19 KiB Raw Permalink Blame History Unescape Escape