Files
Slobodan Margetic 97a9245d9e feat: harness + agent runbook — flip repo from archive-only to
crowdsourced runner

Sloba's chat directive 2026-05-06: "this project is preparation for
going public ... ship the harness along so others can join in."

The repo's original purpose (Ben's catalogue + 21 reference run
ledgers, shipped 2026-05-05) stays intact. This commit ADDS a second
purpose: a portable harness + agent runbook so a friend's coding agent
can clone, read CLAUDE.md, run the same suite on the friend's hardware,
and submit results back as a PR.

What landed:

CLAUDE.md + AGENTS.md (byte-identical, ~520 lines)
  Full agent runbook: hardware probe, runtime + model selection,
  canonical knob reference (Sloba's Pavilion methodology values),
  hardware-adaptation decision rules, run-instructions, output-schema
  templates for hardware.json + metadata.json + run.md, PR submission
  flow (fork → branch → push → PR; nothing auto-merges), privacy
  guardrails, methodology lineage. Per Sloba's Q3 directive: the
  runbook explicitly tells the friend's agent to ADAPT to hardware
  reality and document deviations rather than blindly run defaults.

CONTRIBUTING.md (~110 lines)
  Human-readable companion for the friend (not the agent). What you
  need, how it works, what we ask, what maintainers commit to,
  license, code-of-conduct short version.

harness/
  ├── README.md        Technical readme for the harness folder
  ├── run_benchmark.py ~520 LOC runner. Stdlib-only. Adapted from
  │                    WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py
  │                    v3 with the cluster-internal IP defaults
  │                    (10.8.0.x) replaced by 127.0.0.1:11434, the
  │                    cluster /v1/cluster/* endpoints removed, the
  │                    canonical-suite paths under ~/Documents/MyServers
  │                    replaced by harness/suites/ paths, the git-sha
  │                    enforcement on WeeyugaWeb dropped, and the
  │                    output written under submissions/<handle>/<tag>/
  │                    instead of docs/BENCHMARKS/runs/. Supports all
  │                    six suite phases via --phases, plus 'all'.
  ├── prompts.py       Verbatim copy of the canonical 3 frozen prompts
  │                    (P-EASY/P-MEDIUM/P-HARD) from
  │                    WeeyugaWeb/scripts/benchmarks/prompts.py.
  ├── requirements.txt Empty by intent (stdlib-only); placeholder for
  │                    pip-tools / agent auto-install patterns.
  ├── .gitignore       __pycache__/ etc.
  └── suites/          Six bundled JSON suites copied verbatim from
       Sloba's MyServers/instances/vps-81-17-99-14/telemetry/:
       small_model_eval_questions.json, python_task_suite_questions.json,
       parallel_qwen_same_model_20q_suite.json,
       parallel_qwen_mixed_model_20q_suite.json,
       python_context_edge_append_questions.json,
       python_context_edge_suite_only.json.

submissions/
  README.md            Folder convention + naming + reviewability rules
  EXAMPLE/mac-m1-8gb/run-00000000-...-000000000000/
       Synthetic-but-shape-complete contribution template:
       manifest.json, hardware.json, run.jsonl (5 example lines),
       metadata.json, run.md (with privacy attestation, methodology
       deviations, reproducibility command). Marked as synthetic at
       the top so future analysis doesn't accidentally cite it.

LICENSE-MIT
  MIT for harness/*.py and future helper code. Existing LICENSE
  (CC-BY-4.0) covers data files.

README.md (modified)
  Updated to reflect dual purpose. Layout diagram updated.
  Maintainer credits: Ben for catalogue/methodology + Bane for harness.
  Contributor quick-start added. Status table extended.

Privacy posture:
  - All 6 suite JSON files privacy-scanned for cluster IPs / hostnames /
    paths / tokens. Two prompts contain project names ("MyBoard" auth
    debugging in 20Q-Q14, generic SSH troubleshooting in 5Q-Q03);
    flagged in chat for Sloba's review. Otherwise clean.
  - run_benchmark.py default target_url is 127.0.0.1:11434 (no internal
    IPs leaked).
  - manifest.json captures host_hostname_short via socket.gethostname()
    .split('.')[0] — agent should review before PR if hostname is
    sensitive.
  - CLAUDE.md §8 spells out the privacy-grep before push.

Verification:
  - py_compile run_benchmark.py: OK
  - --help renders cleanly
  - All 6 suite JSON files: valid
  - All 4 example JSON files: valid
  - Example run.jsonl (5 lines): valid

This commit lands on branch feature/runner-and-agent-instructions.
NOT pushed to main; staying on the feature branch until Sloba reviews
on Gitea and merges. Bus dispatch to Ben + Sam announcing the
architectural pivot lives in the WeeyugaWeb coordination repo.
2026-05-06 19:05:22 +02:00

19 KiB
Raw Permalink Blame History

CLAUDE.md — runbook for the friend's coding agent

You are the coding agent (Claude Code, Codex, Aider, Cursor, …) of someone who agreed to run the Weeyuga benchmark suite on their hardware. This file is your briefing. Read it completely before doing anything.

If your runtime auto-loads AGENTS.md instead of CLAUDE.md, that file is byte-identical — same content, different filename for tools that prefer one over the other.


TL;DR (60 seconds)

You are going to:

  1. Probe the friend's hardware (OS, CPU, RAM, GPU + VRAM).
  2. Pick an OpenAI-compatible runtime (Ollama / llama.cpp / vLLM / MLX) and a model that fits.
  3. Adapt the bundled runner to the friend's reality (this is the part you can't skip — see §3).
  4. Run harness/run_benchmark.py against the runtime; it writes a JSONL ledger of every call.
  5. Document the run in submissions/<friend-handle>/<device-tag>/run-<uuid>/.
  6. Submit a PR back to this repo. Sloba reviews and merges manually — nothing auto-merges.

You have read-only access to this repo. To submit, fork on Gitea, push to your fork, open a PR.

The harness is a template, not a one-click button. The friend's hardware will not match Sloba's; you are expected to adapt parameters, research best-known values online when you hit unfamiliar constraints, and write down what you changed and why in the manifest. Documented deviation is fine. Silent deviation breaks comparability.


0. Read this completely before doing anything

The rest of this file is structured as the order you'll work in. Reading the whole thing first gives you the shape; then the friend can ask "go" and you execute without circling back.

If you hit something genuinely ambiguous, ASK THE FRIEND. Don't guess at hardware-specific values; either verify with measurement or research them from the project / model authors' recommended-settings docs.


1. What you are running, exactly

Inputs (frozen across runs):

  • harness/prompts.py — three frozen prompts (P-EASY, P-MEDIUM, P-HARD). Never modified.
  • harness/suites/*.json — six benchmark suites, all run sequentially per model:
    phase key suite file what it tests
    5q small_model_eval_questions.json 5 short-answer formatting + correctness questions
    20q python_task_suite_questions.json 20 realistic Python task prompts
    parallel_same parallel_qwen_same_model_20q_suite.json parallel-lane stress with one model
    parallel_mixed parallel_qwen_mixed_model_20q_suite.json parallel-lane stress with multiple models
    edge_append python_context_edge_append_questions.json long-context append behavior
    edge_suite python_context_edge_suite_only.json long-context whole-suite reasoning

Driver: harness/run_benchmark.py — one process, sequential calls to your local OpenAI-compatible /v1/chat/completions endpoint, one JSONL line per call.

Output: submissions/<handle>/<device-tag>/run-<uuid>/ containing:

  • run.jsonl — every call recorded
  • manifest.json — written automatically by the runner
  • hardware.jsonyou fill this from the hardware probe (§2)
  • metadata.json — computed aggregates (you generate, see §6)
  • run.md — human-readable summary (you write, see §6)

Run order: ALL six suites run in sequence per model, per the canonical Pavilion methodology Sloba uses. Don't pick-and-choose unless the friend is explicitly time-constrained — partial runs are still useful but they're documented as "partial" in the manifest, and they show up as archive-only in Sloba's catalogue rather than full-grade runs.


2. Hardware probe — do this first, write hardware.json from the result

Before anything else, gather the friend's hardware truth. Pick the platform-appropriate commands; don't run all of them, just the ones that work on the friend's OS.

macOS:

system_profiler SPHardwareDataType SPDisplaysDataType
sysctl -n machdep.cpu.brand_string
sysctl -n hw.memsize
sw_vers
uname -a

Linux:

lscpu
cat /proc/meminfo | head -3
nvidia-smi --query-gpu=name,memory.total,driver_version,compute_cap --format=csv  # if NVIDIA
lspci | grep -iE "vga|3d|display"
uname -a
cat /etc/os-release

Windows (PowerShell):

Get-CimInstance Win32_ComputerSystem | Select Manufacturer, Model, TotalPhysicalMemory
Get-CimInstance Win32_Processor | Select Name, NumberOfCores, MaxClockSpeed
Get-CimInstance Win32_VideoController | Select Name, AdapterRAM, DriverVersion
$PSVersionTable.OS

Write the canonical findings to hardware.json. Schema (every field present; null if not applicable):

{
  "schema_version": "hardware-1.0",
  "device_tag": "mac-m1-8gb",
  "manufacturer_model": "Apple MacBook Air (Mac14,2)",
  "os": {"name": "macOS", "version": "14.5", "kernel": "23.5.0"},
  "cpu": {"name": "Apple M1", "cores": 8, "threads": 8, "max_ghz": 3.2,
          "arch": "arm64", "isa": ["NEON"]},
  "memory_gb_total": 8,
  "memory_gb_available_at_run_start": 4.2,
  "gpu": [
    {"name": "Apple M1 GPU", "kind": "integrated", "vram_gb": null,
     "driver": "Metal/macOS-14", "compute_cap": null}
  ],
  "storage": {"kind": "ssd", "free_gb_at_run_start": 220},
  "thermal_or_power_notes": "default OS thermal mgmt; on AC power",
  "network_used_for_model_fetch": "wifi-100mbps",
  "container_or_vm": null
}

Honest mode flags to mention in thermal_or_power_notes:

  • "battery-only, low-power-mode active" → results may be artificially slow
  • "thermal throttling observed mid-run" → tag any affected calls in run.md caveats
  • "GPU shared with display compositor" → expect 5-15% throughput hit vs headless

3. Adapt to hardware reality — this is the part you cannot skip

The harness uses Sloba's canonical knobs as defaults. They are not guaranteed to be optimal for the friend's hardware. Your job:

3a. Canonical knobs (Sloba's reference values)

CANONICAL_OPTIONS = {
    "temperature": 0.1,    # near-deterministic; comparable across runs
    "num_ctx": 4096,       # context window
    "num_predict": 2048,   # max generated tokens per call
}

Plus runtime-level (Ollama-specific but apply equivalently to llama.cpp / vLLM):

  • KEEP_ALIVE — how long the loaded model stays warm. Sloba uses 2400h on cluster nodes (~100 days = effectively pinned). On a friend's laptop, 5m is gentler if RAM is tight.
  • NUM_PARALLEL — concurrent slots. Sloba uses 3 on Pavilion. 1 is fine on tight RAM.
  • MAX_LOADED_MODELS — how many models held in VRAM. Sloba uses 3 on a 12 GB GPU; default to 1 on anything ≤ 8 GB.
  • For llama.cpp: --n-gpu-layers (NGL) — number of model layers offloaded to GPU. Critical on borderline VRAM. NGL=99 is full offload; NGL=0 is CPU-only. Sloba's Predator (GTX 1060 6 GB) runs Qwen3.5:9B at NGL=6 because higher offloads OOM with the KV cache.

3b. Decision rules

Friend's hardware Likely runtime Likely model size Likely NGL Likely NUM_PARALLEL
Apple Silicon (M1/M2/M3, ≥8 GB unified) Ollama OR llama.cpp w/ Metal OR MLX 0.5B 4B n/a (Metal handles offload) 12
Apple Silicon (M-Pro/M-Max, ≥16 GB) same, MLX preferred for 8B+ 4B 14B n/a 23
NVIDIA GPU 6 GB VRAM llama.cpp + CUDA 0.5B 4B (or 8B at NGL ~1020) tuned per model 1
NVIDIA GPU 812 GB VRAM llama.cpp + CUDA, or vLLM 4B 14B high (6099) 12
NVIDIA GPU 24+ GB VRAM vLLM or llama.cpp up to 32B 99 (full) 4+
AMD GPU llama.cpp + ROCm conservative one tier below NVIDIA-equivalent tuned 1
CPU only llama.cpp + CPU 0.5B 2B (Q4_K_M) 0 1

These are starting points. Don't trust them blindly. For any model + hardware combination you're uncertain about:

  1. Check the model's HuggingFace card for "recommended quantization / hardware" notes.
  2. Check the runtime's GitHub for known issues with this model family.
  3. Look up llama.cpp issues for "VRAM OOM " — community usually finds the NGL sweet spot.
  4. If still uncertain, run a dry probe: python3 harness/run_benchmark.py --probe --target-url ... --cell-id-prefix ... --models <model> and observe RSS / VRAM / tokens-per-sec.

3c. Document every deviation in manifest.json.canonical_options_overrides

The runner records overrides automatically when you pass --temperature / --num-ctx / --num-predict. For runtime-level deviations (NGL, NUM_PARALLEL, KEEP_ALIVE), add them to hardware.json.thermal_or_power_notes or to run.md § Methodology Deviations. Untracked deviations are the worst kind — silently make a run uncomparable. Honest-and-deviated > silent-and-clean.


4. Pick a runtime and a model

Sloba's instruction: use any model. The harness doesn't ship a fixed model list — the friend's agent picks based on hardware. Suggestions, in increasing size:

Model Size When
qwen2.5-coder:0.5b ~400 MB minimum-viable code benchmarks; runs anywhere
qwen3.5:0.8b ~600 MB Sloba's reference smallest; matches his catalogue runs
qwen2.5-coder:1.5b ~1.1 GB code-focused mid-tier
qwen3.5:2b ~1.5 GB conversational mid-tier
qwen3.5:4b ~3 GB flagship mid-tier; common comparison point
qwen3.5:8b-q4km ~5 GB mid-tier flagship
qwen3.5:9b-q4km ~5.4 GB Sloba's Predator flagship; 6 GB VRAM borderline (run with reduced NGL)
qwen3.5:14b-q4km ~9 GB needs ≥10 GB VRAM or Apple Silicon ≥16 GB unified
gemma-4:e4b-it-q4km ~3 GB non-Qwen comparison
granite-4.1:8b-q4km ~5 GB non-Qwen comparison

Models are pulled from:

  • Ollama Hub: ollama pull qwen3.5:0.8b, etc.
  • HuggingFace + llama.cpp: download GGUF directly via wget/hf-download, then point llama-server at it.

Run more than one model in the same run if you can — comparability. The harness loops models inside one run; cell_ids encode the (node, engine, model) tuple.


5. Run the benchmark

5a. Smoke first (30 seconds)

python3 harness/run_benchmark.py --smoke \
    --target-url http://127.0.0.1:11434 \
    --models qwen3.5:0.8b \
    --cell-id-prefix mac-m1:ollama \
    --submitter-handle <friend-gitea-handle> \
    --device-tag <short-device-tag>

If smoke 200s back, you have a working runtime. Run the real thing.

5b. Full run

python3 harness/run_benchmark.py \
    --target-url http://127.0.0.1:11434 \
    --models qwen3.5:0.8b,qwen3.5:4b \
    --cell-id-prefix mac-m1:ollama \
    --phases hello,5q,20q \
    --submitter-handle alice \
    --device-tag mac-m1-8gb

For the canonical full sweep across all six suites:

python3 harness/run_benchmark.py --phases all \
    --target-url http://127.0.0.1:11434 \
    --models qwen3.5:0.8b \
    --cell-id-prefix mac-m1:ollama \
    --submitter-handle alice --device-tag mac-m1-8gb

Expect minutes per cell. The 20Q + edge suites are the long ones (~1040 minutes per model on a small box). If the friend is time-bounded, drop edge_* and parallel_* — but record what you skipped.

5c. Resume on interrupt

If interrupted, the JSONL ledger is preserved (every line is fsync'd). To resume the same run-id:

python3 harness/run_benchmark.py --run-id <previous-uuid> ...

This appends to a new ledger; you'll need to merge them by hand (or just submit them as two separate runs sharing the same device-tag).


6. Generate metadata.json and run.md

6a. metadata.json — computed aggregates per cell

Schema (one row per (cell_id, phase) pair):

{
  "schema_version": "metadata-1.0",
  "run_id": "<uuid>",
  "submitter_handle": "alice",
  "device_tag": "mac-m1-8gb",
  "cells": [
    {
      "cell_id": "mac-m1:ollama:qwen3.5:0.8b",
      "phase": "20q",
      "n_calls": 20,
      "n_errors": 0,
      "duration_ms_p50": 9600,
      "duration_ms_p95": 24000,
      "duration_ms_mean": 11200,
      "tokens_per_sec_p50": 16.4,
      "tokens_per_sec_p95": 22.1,
      "tokens_per_sec_mean": 17.0,
      "tokens_per_sec_max": 24.8,
      "completion_tokens_total": 18234,
      "format_ok_rate": 0.85,
      "marker_hit_rate_mean": 0.72
    }
  ]
}

You can compute this in-line (small script) or use a quick Python REPL pass over run.jsonl. The catalogue builder on Sloba's side will recompute it anyway, but having it in the PR makes review fast.

6b. run.md — human-readable summary

Template (fill in every section honestly):

# <device-tag> — <model-set> — <YYYY-MM-DD>

**Run ID:** `<uuid>`
**Submitter:** <handle>
**Hardware:** Apple MacBook Air M1, 8 GB unified, macOS 14.5
**Runtime:** Ollama 0.5.x (default settings; NUM_PARALLEL=1, KEEP_ALIVE=5m)
**Models:** qwen3.5:0.8b, qwen3.5:4b
**Phases run:** hello, 5q, 20q
**Phases skipped:** parallel_same, parallel_mixed, edge_append, edge_suite — RAM constraint, 4 GB free at run start was insufficient for parallel suites.

## Headline numbers

| cell | n_calls | tok/s mean | tok/s p50 | duration p50 | format_ok rate |
|---|---|---|---|---|---|
| mac-m1:ollama:qwen3.5:0.8b 20q | 20 | 17.0 | 16.4 | 9.6 s | 85% |
| mac-m1:ollama:qwen3.5:4b 20q | 20 | 5.8 | 5.5 | 28.2 s | 70% |

## Methodology

Followed the canonical Pavilion methodology with these deviations:

- **NUM_PARALLEL=1** instead of canonical 3 — 8 GB unified RAM doesn't fit two warm copies of qwen3.5:4b.
- **KEEP_ALIVE=5m** instead of 2400h — laptop, not server.
- **edge_* and parallel_* phases skipped** — friend's time budget.

## Caveats

- Run started at 18% battery; one call (20q-q14, model qwen3.5:4b) coincided with macOS Spotlight indexing; flagged in run.jsonl with run_idx=14 — that data point is high-variance.
- Network was on hotel wifi; model pull took ~6 minutes for qwen3.5:4b. Did not affect benchmark timing (model warm before any timed call).

## Reproducibility

python3 harness/run_benchmark.py
--target-url http://127.0.0.1:11434
--models qwen3.5:0.8b,qwen3.5:4b
--cell-id-prefix mac-m1:ollama
--phases hello,5q,20q
--submitter-handle alice
--device-tag mac-m1-8gb


7. Submit the PR

  1. Fork https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public to the friend's Gitea account (Gitea web UI → "Fork").
  2. Add the friend's fork as a remote on the local clone:
    git remote add fork ssh://gitea@git.weeyuga.com/<friend-handle>/weeyuga-benchmarks-public.git
    
  3. Create a topic branch off main:
    git checkout -b submission/<handle>-<device-tag>-<short-date>
    
  4. Stage only the new files under submissions/<handle>/<device-tag>/run-<uuid>/. NEVER modify anything outside that directory in this PR.
    git add submissions/<handle>/<device-tag>/run-<uuid>/
    git status   # confirm: only files under your run-<uuid>/ are staged
    
  5. Commit with a descriptive message:
    submit: alice / mac-m1-8gb / 2026-05-12 — qwen3.5 0.8b+4b, hello+5q+20q
    
    First contribution from a friend's hardware. M1 8 GB unified, Ollama 0.5.x.
    Skipped edge_* + parallel_* due to RAM. Headline: qwen3.5:0.8b ~17 tok/s,
    qwen3.5:4b ~5.8 tok/s on 20Q.
    
  6. Push to fork:
    git push fork submission/<handle>-<device-tag>-<short-date>
    
  7. Open a PR on Gitea with target = slobodanmargetic988/weeyuga-benchmarks-public:main. PR description should include:
    • One-paragraph what-and-why
    • Link to the friend's run.md
    • Any methodology deviations the reviewer should know
    • Privacy attestation: "I have reviewed run.jsonl and confirmed no PII / SSH keys / API tokens / personal home paths leaked"

Sloba reviews and merges. Nothing auto-merges. A typical review surfaces 02 follow-ups; address and force-push to the same branch.


8. Privacy guardrails — DO NOT submit any of these

  • API keys (OpenAI, Anthropic, HuggingFace tokens, etc.)
  • SSH private keys, .ssh/ paths
  • Personal home directory paths (/Users/alice/secrets/...)
  • Real names if the friend prefers a handle
  • Internal corporate IPs, hostnames, or SSO endpoints
  • Bearer tokens in error messages (some runtimes echo headers in 4xx errors)

Before pushing, scan the run.jsonl for these patterns:

grep -nE "Bearer |sk-|api_key|/Users/|/home/|password|ssh-rsa|ssh-ed25519" submissions/<handle>/<device-tag>/run-<uuid>/*.{jsonl,md,json}

If anything matches, redact it from response_preview (the JSONL stores only the first 240 chars of each response, so leaks are rare — but please scan).


9. What if you get stuck

  • /v1/models returns empty: the runtime isn't OpenAI-compat or no models are loaded. For Ollama: ollama list. For llama.cpp: it doesn't list models on /v1/models historically; pass --models <name> --target-url http://127.0.0.1:11436 and it'll work anyway.
  • Every call returns 500 / timeout: runtime is up but model isn't loading. Check VRAM with nvidia-smi or memory pressure with vm_stat / free. Smaller model. Or smaller num_ctx.
  • Tokens/sec absurdly low (<1 tok/s on hardware that should manage): thermal throttling, swap thrashing, or wrong-quantization. Check free -h mid-run; if swap is being used, model is too big for RAM.
  • One question keeps getting format_ok=false: model can't follow that instruction shape. NORMAL. Don't shorten the prompt or reword. Document in run.md and move on.
  • Ambiguous hardware setup (eGPU? VM? Container?): ASK the friend. Container/VM resource caps make benchmarks misleading.

10. The methodology lineage

This harness mirrors WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py v3 — Sloba's canonical Pavilion methodology established 2026-04-11. The 21 reference runs in runs/ (in this repo) are the comparable baseline. Sloba's locked catalogue lives at catalogue.json (this repo). When your run is merged, it'll be added to the catalogue under your device-tag and become a new comparison point.

The methodology and harness will evolve. Current canonical version: HARNESS_VERSION = "public-1". Future versions will be additive — older ledgers stay valid forever.


11. Coordinate-while-running checklist

Before you start:

  • Read this whole file
  • Read methodology.md for the metric definitions (TTFT, p50/p95, format_ok, etc.)
  • Verify the friend has ≥3 GB free disk for model files
  • Verify network is OK for model pull (the GGUFs are 0.510 GB)

While running:

  • Smoke first
  • Full run
  • Watch for thermal throttling on laptops / phones / mini-PCs
  • Don't open Chrome / Slack / Zoom mid-run if you can avoid it (VRAM pressure)

After running:

  • Generate metadata.json aggregates
  • Write run.md honestly — including caveats
  • Privacy-scan run.jsonl
  • Fork → branch → push → PR

Questions / blockers

If you hit something this runbook doesn't cover, the friend can email Sloba (slobodan@weeyuga.com) or open an issue on this repo. Don't burn an hour in a corner — ask. The whole point of crowdsourcing is the variance you'll see; that's data, not a problem.

Welcome aboard. 🦇

— The Weeyuga team


Maintainer note: if you edit this file, edit AGENTS.md to match (Codex loads AGENTS.md, Claude Code loads CLAUDE.md; identical content prevents two-tier rules).