# CLAUDE.md — runbook for the friend's coding agent > **You are the coding agent (Claude Code, Codex, Aider, Cursor, …) of > someone who agreed to run the Weeyuga benchmark suite on their hardware. > This file is your briefing. Read it completely before doing anything.** > > If your runtime auto-loads `AGENTS.md` instead of `CLAUDE.md`, that file > is byte-identical — same content, different filename for tools that > prefer one over the other. --- ## TL;DR (60 seconds) You are going to: 1. **Probe** the friend's hardware (OS, CPU, RAM, GPU + VRAM). 2. **Pick** an OpenAI-compatible runtime (Ollama / llama.cpp / vLLM / MLX) and a model that fits. 3. **Adapt** the bundled runner to the friend's reality (this is the part you can't skip — see §3). 4. **Run** `harness/run_benchmark.py` against the runtime; it writes a JSONL ledger of every call. 5. **Document** the run in `submissions///run-/`. 6. **Submit** a PR back to this repo. Sloba reviews and merges manually — nothing auto-merges. You have **read-only access** to this repo. To submit, fork on Gitea, push to your fork, open a PR. The harness is a **template**, not a one-click button. The friend's hardware will not match Sloba's; you are expected to adapt parameters, research best-known values online when you hit unfamiliar constraints, and **write down what you changed and why** in the manifest. Documented deviation is fine. Silent deviation breaks comparability. --- ## 0. Read this completely before doing anything The rest of this file is structured as the order you'll work in. Reading the whole thing first gives you the shape; then the friend can ask "go" and you execute without circling back. If you hit something genuinely ambiguous, ASK THE FRIEND. Don't guess at hardware-specific values; either verify with measurement or research them from the project / model authors' recommended-settings docs. --- ## 1. What you are running, exactly **Inputs (frozen across runs):** - `harness/prompts.py` — three frozen prompts (P-EASY, P-MEDIUM, P-HARD). Never modified. - `harness/suites/*.json` — six benchmark suites, all run sequentially per model: | phase key | suite file | what it tests | |---|---|---| | `5q` | `small_model_eval_questions.json` | 5 short-answer formatting + correctness questions | | `20q` | `python_task_suite_questions.json` | 20 realistic Python task prompts | | `parallel_same` | `parallel_qwen_same_model_20q_suite.json` | parallel-lane stress with one model | | `parallel_mixed` | `parallel_qwen_mixed_model_20q_suite.json` | parallel-lane stress with multiple models | | `edge_append` | `python_context_edge_append_questions.json` | long-context append behavior | | `edge_suite` | `python_context_edge_suite_only.json` | long-context whole-suite reasoning | **Driver:** `harness/run_benchmark.py` — one process, sequential calls to your local OpenAI-compatible `/v1/chat/completions` endpoint, one JSONL line per call. **Output:** `submissions///run-/` containing: - `run.jsonl` — every call recorded - `manifest.json` — written automatically by the runner - `hardware.json` — **you fill this** from the hardware probe (§2) - `metadata.json` — computed aggregates (you generate, see §6) - `run.md` — human-readable summary (you write, see §6) **Run order:** ALL six suites run in sequence per model, per the canonical Pavilion methodology Sloba uses. Don't pick-and-choose unless the friend is explicitly time-constrained — partial runs are still useful but they're documented as "partial" in the manifest, and they show up as `archive-only` in Sloba's catalogue rather than full-grade runs. --- ## 2. Hardware probe — do this first, write `hardware.json` from the result Before anything else, gather the friend's hardware truth. Pick the platform-appropriate commands; don't run all of them, just the ones that work on the friend's OS. **macOS:** ```bash system_profiler SPHardwareDataType SPDisplaysDataType sysctl -n machdep.cpu.brand_string sysctl -n hw.memsize sw_vers uname -a ``` **Linux:** ```bash lscpu cat /proc/meminfo | head -3 nvidia-smi --query-gpu=name,memory.total,driver_version,compute_cap --format=csv # if NVIDIA lspci | grep -iE "vga|3d|display" uname -a cat /etc/os-release ``` **Windows (PowerShell):** ```powershell Get-CimInstance Win32_ComputerSystem | Select Manufacturer, Model, TotalPhysicalMemory Get-CimInstance Win32_Processor | Select Name, NumberOfCores, MaxClockSpeed Get-CimInstance Win32_VideoController | Select Name, AdapterRAM, DriverVersion $PSVersionTable.OS ``` Write the canonical findings to `hardware.json`. Schema (every field present; `null` if not applicable): ```json { "schema_version": "hardware-1.0", "device_tag": "mac-m1-8gb", "manufacturer_model": "Apple MacBook Air (Mac14,2)", "os": {"name": "macOS", "version": "14.5", "kernel": "23.5.0"}, "cpu": {"name": "Apple M1", "cores": 8, "threads": 8, "max_ghz": 3.2, "arch": "arm64", "isa": ["NEON"]}, "memory_gb_total": 8, "memory_gb_available_at_run_start": 4.2, "gpu": [ {"name": "Apple M1 GPU", "kind": "integrated", "vram_gb": null, "driver": "Metal/macOS-14", "compute_cap": null} ], "storage": {"kind": "ssd", "free_gb_at_run_start": 220}, "thermal_or_power_notes": "default OS thermal mgmt; on AC power", "network_used_for_model_fetch": "wifi-100mbps", "container_or_vm": null } ``` Honest mode flags to mention in `thermal_or_power_notes`: - "battery-only, low-power-mode active" → results may be artificially slow - "thermal throttling observed mid-run" → tag any affected calls in `run.md` caveats - "GPU shared with display compositor" → expect 5-15% throughput hit vs headless --- ## 3. Adapt to hardware reality — this is the part you cannot skip The harness uses Sloba's canonical knobs as defaults. They are **not** guaranteed to be optimal for the friend's hardware. Your job: ### 3a. Canonical knobs (Sloba's reference values) ```python CANONICAL_OPTIONS = { "temperature": 0.1, # near-deterministic; comparable across runs "num_ctx": 4096, # context window "num_predict": 2048, # max generated tokens per call } ``` Plus runtime-level (Ollama-specific but apply equivalently to llama.cpp / vLLM): - `KEEP_ALIVE` — how long the loaded model stays warm. Sloba uses **2400h** on cluster nodes (~100 days = effectively pinned). On a friend's laptop, **5m** is gentler if RAM is tight. - `NUM_PARALLEL` — concurrent slots. Sloba uses **3** on Pavilion. **1** is fine on tight RAM. - `MAX_LOADED_MODELS` — how many models held in VRAM. Sloba uses **3** on a 12 GB GPU; default to **1** on anything ≤ 8 GB. - For llama.cpp: `--n-gpu-layers` (NGL) — number of model layers offloaded to GPU. **Critical** on borderline VRAM. NGL=99 is full offload; NGL=0 is CPU-only. Sloba's Predator (GTX 1060 6 GB) runs Qwen3.5:9B at NGL=6 because higher offloads OOM with the KV cache. ### 3b. Decision rules | Friend's hardware | Likely runtime | Likely model size | Likely NGL | Likely NUM_PARALLEL | |---|---|---|---|---| | Apple Silicon (M1/M2/M3, ≥8 GB unified) | Ollama OR llama.cpp w/ Metal OR MLX | 0.5B – 4B | n/a (Metal handles offload) | 1–2 | | Apple Silicon (M-Pro/M-Max, ≥16 GB) | same, MLX preferred for 8B+ | 4B – 14B | n/a | 2–3 | | NVIDIA GPU 6 GB VRAM | llama.cpp + CUDA | 0.5B – 4B (or 8B at NGL ~10–20) | tuned per model | 1 | | NVIDIA GPU 8–12 GB VRAM | llama.cpp + CUDA, or vLLM | 4B – 14B | high (60–99) | 1–2 | | NVIDIA GPU 24+ GB VRAM | vLLM or llama.cpp | up to 32B | 99 (full) | 4+ | | AMD GPU | llama.cpp + ROCm | conservative one tier below NVIDIA-equivalent | tuned | 1 | | CPU only | llama.cpp + CPU | 0.5B – 2B (Q4_K_M) | 0 | 1 | These are starting points. **Don't trust them blindly.** For any model + hardware combination you're uncertain about: 1. Check the model's HuggingFace card for "recommended quantization / hardware" notes. 2. Check the runtime's GitHub for known issues with this model family. 3. Look up llama.cpp issues for "VRAM OOM " — community usually finds the NGL sweet spot. 4. If still uncertain, run a dry probe: `python3 harness/run_benchmark.py --probe --target-url ... --cell-id-prefix ... --models ` and observe RSS / VRAM / tokens-per-sec. ### 3c. Document every deviation in `manifest.json.canonical_options_overrides` The runner records overrides automatically when you pass `--temperature` / `--num-ctx` / `--num-predict`. For runtime-level deviations (NGL, NUM_PARALLEL, KEEP_ALIVE), add them to `hardware.json.thermal_or_power_notes` or to `run.md` § Methodology Deviations. **Untracked deviations are the worst kind — silently make a run uncomparable.** Honest-and-deviated > silent-and-clean. --- ## 4. Pick a runtime and a model Sloba's instruction: **use any model**. The harness doesn't ship a fixed model list — the friend's agent picks based on hardware. Suggestions, in increasing size: | Model | Size | When | |---|---|---| | `qwen2.5-coder:0.5b` | ~400 MB | minimum-viable code benchmarks; runs anywhere | | `qwen3.5:0.8b` | ~600 MB | Sloba's reference smallest; matches his catalogue runs | | `qwen2.5-coder:1.5b` | ~1.1 GB | code-focused mid-tier | | `qwen3.5:2b` | ~1.5 GB | conversational mid-tier | | `qwen3.5:4b` | ~3 GB | flagship mid-tier; common comparison point | | `qwen3.5:8b-q4km` | ~5 GB | mid-tier flagship | | `qwen3.5:9b-q4km` | ~5.4 GB | Sloba's Predator flagship; 6 GB VRAM borderline (run with reduced NGL) | | `qwen3.5:14b-q4km` | ~9 GB | needs ≥10 GB VRAM or Apple Silicon ≥16 GB unified | | `gemma-4:e4b-it-q4km` | ~3 GB | non-Qwen comparison | | `granite-4.1:8b-q4km` | ~5 GB | non-Qwen comparison | Models are pulled from: - **Ollama Hub:** `ollama pull qwen3.5:0.8b`, etc. - **HuggingFace + llama.cpp:** download GGUF directly via `wget`/`hf-download`, then point `llama-server` at it. Run more than one model in the same run if you can — comparability. The harness loops models inside one run; cell_ids encode the (node, engine, model) tuple. --- ## 5. Run the benchmark ### 5a. Smoke first (30 seconds) ```bash python3 harness/run_benchmark.py --smoke \ --target-url http://127.0.0.1:11434 \ --models qwen3.5:0.8b \ --cell-id-prefix mac-m1:ollama \ --submitter-handle \ --device-tag ``` If smoke 200s back, you have a working runtime. Run the real thing. ### 5b. Full run ```bash python3 harness/run_benchmark.py \ --target-url http://127.0.0.1:11434 \ --models qwen3.5:0.8b,qwen3.5:4b \ --cell-id-prefix mac-m1:ollama \ --phases hello,5q,20q \ --submitter-handle alice \ --device-tag mac-m1-8gb ``` For the canonical full sweep across all six suites: ```bash python3 harness/run_benchmark.py --phases all \ --target-url http://127.0.0.1:11434 \ --models qwen3.5:0.8b \ --cell-id-prefix mac-m1:ollama \ --submitter-handle alice --device-tag mac-m1-8gb ``` Expect minutes per cell. The 20Q + edge suites are the long ones (~10–40 minutes per model on a small box). If the friend is time-bounded, drop edge_* and parallel_* — but record what you skipped. ### 5c. Resume on interrupt If interrupted, the JSONL ledger is preserved (every line is fsync'd). To resume the same `run-id`: ```bash python3 harness/run_benchmark.py --run-id ... ``` This appends to a new ledger; you'll need to merge them by hand (or just submit them as two separate runs sharing the same `device-tag`). --- ## 6. Generate `metadata.json` and `run.md` ### 6a. `metadata.json` — computed aggregates per cell Schema (one row per (cell_id, phase) pair): ```json { "schema_version": "metadata-1.0", "run_id": "", "submitter_handle": "alice", "device_tag": "mac-m1-8gb", "cells": [ { "cell_id": "mac-m1:ollama:qwen3.5:0.8b", "phase": "20q", "n_calls": 20, "n_errors": 0, "duration_ms_p50": 9600, "duration_ms_p95": 24000, "duration_ms_mean": 11200, "tokens_per_sec_p50": 16.4, "tokens_per_sec_p95": 22.1, "tokens_per_sec_mean": 17.0, "tokens_per_sec_max": 24.8, "completion_tokens_total": 18234, "format_ok_rate": 0.85, "marker_hit_rate_mean": 0.72 } ] } ``` You can compute this in-line (small script) or use a quick Python REPL pass over `run.jsonl`. The catalogue builder on Sloba's side will recompute it anyway, but having it in the PR makes review fast. ### 6b. `run.md` — human-readable summary Template (fill in every section honestly): ```markdown # — — **Run ID:** `` **Submitter:** **Hardware:** Apple MacBook Air M1, 8 GB unified, macOS 14.5 **Runtime:** Ollama 0.5.x (default settings; NUM_PARALLEL=1, KEEP_ALIVE=5m) **Models:** qwen3.5:0.8b, qwen3.5:4b **Phases run:** hello, 5q, 20q **Phases skipped:** parallel_same, parallel_mixed, edge_append, edge_suite — RAM constraint, 4 GB free at run start was insufficient for parallel suites. ## Headline numbers | cell | n_calls | tok/s mean | tok/s p50 | duration p50 | format_ok rate | |---|---|---|---|---|---| | mac-m1:ollama:qwen3.5:0.8b 20q | 20 | 17.0 | 16.4 | 9.6 s | 85% | | mac-m1:ollama:qwen3.5:4b 20q | 20 | 5.8 | 5.5 | 28.2 s | 70% | ## Methodology Followed the canonical Pavilion methodology with these deviations: - **NUM_PARALLEL=1** instead of canonical 3 — 8 GB unified RAM doesn't fit two warm copies of qwen3.5:4b. - **KEEP_ALIVE=5m** instead of 2400h — laptop, not server. - **edge_* and parallel_* phases skipped** — friend's time budget. ## Caveats - Run started at 18% battery; one call (20q-q14, model qwen3.5:4b) coincided with macOS Spotlight indexing; flagged in run.jsonl with run_idx=14 — that data point is high-variance. - Network was on hotel wifi; model pull took ~6 minutes for qwen3.5:4b. Did not affect benchmark timing (model warm before any timed call). ## Reproducibility ``` python3 harness/run_benchmark.py \ --target-url http://127.0.0.1:11434 \ --models qwen3.5:0.8b,qwen3.5:4b \ --cell-id-prefix mac-m1:ollama \ --phases hello,5q,20q \ --submitter-handle alice \ --device-tag mac-m1-8gb ``` ``` --- ## 7. Submit the PR 1. **Fork** `https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public` to the friend's Gitea account (Gitea web UI → "Fork"). 2. **Add the friend's fork as a remote on the local clone:** ```bash git remote add fork ssh://gitea@git.weeyuga.com//weeyuga-benchmarks-public.git ``` 3. **Create a topic branch** off `main`: ```bash git checkout -b submission/-- ``` 4. **Stage only the new files under `submissions///run-/`.** NEVER modify anything outside that directory in this PR. ```bash git add submissions///run-/ git status # confirm: only files under your run-/ are staged ``` 5. **Commit** with a descriptive message: ``` submit: alice / mac-m1-8gb / 2026-05-12 — qwen3.5 0.8b+4b, hello+5q+20q First contribution from a friend's hardware. M1 8 GB unified, Ollama 0.5.x. Skipped edge_* + parallel_* due to RAM. Headline: qwen3.5:0.8b ~17 tok/s, qwen3.5:4b ~5.8 tok/s on 20Q. ``` 6. **Push to fork:** ```bash git push fork submission/-- ``` 7. **Open a PR on Gitea** with target = `slobodanmargetic988/weeyuga-benchmarks-public:main`. PR description should include: - One-paragraph what-and-why - Link to the friend's `run.md` - Any methodology deviations the reviewer should know - Privacy attestation: "I have reviewed run.jsonl and confirmed no PII / SSH keys / API tokens / personal home paths leaked" Sloba reviews and merges. **Nothing auto-merges.** A typical review surfaces 0–2 follow-ups; address and force-push to the same branch. --- ## 8. Privacy guardrails — DO NOT submit any of these - API keys (OpenAI, Anthropic, HuggingFace tokens, etc.) - SSH private keys, `.ssh/` paths - Personal home directory paths (`/Users/alice/secrets/...`) - Real names if the friend prefers a handle - Internal corporate IPs, hostnames, or SSO endpoints - Bearer tokens in error messages (some runtimes echo headers in 4xx errors) Before pushing, **scan the run.jsonl** for these patterns: ```bash grep -nE "Bearer |sk-|api_key|/Users/|/home/|password|ssh-rsa|ssh-ed25519" submissions///run-/*.{jsonl,md,json} ``` If anything matches, redact it from `response_preview` (the JSONL stores only the first 240 chars of each response, so leaks are rare — but please scan). --- ## 9. What if you get stuck - **`/v1/models` returns empty:** the runtime isn't OpenAI-compat or no models are loaded. For Ollama: `ollama list`. For llama.cpp: it doesn't list models on `/v1/models` historically; pass `--models --target-url http://127.0.0.1:11436` and it'll work anyway. - **Every call returns 500 / timeout:** runtime is up but model isn't loading. Check VRAM with `nvidia-smi` or memory pressure with `vm_stat` / `free`. Smaller model. Or smaller `num_ctx`. - **Tokens/sec absurdly low (<1 tok/s on hardware that should manage):** thermal throttling, swap thrashing, or wrong-quantization. Check `free -h` mid-run; if swap is being used, model is too big for RAM. - **One question keeps getting `format_ok=false`:** model can't follow that instruction shape. NORMAL. Don't shorten the prompt or reword. Document in run.md and move on. - **Ambiguous hardware setup (eGPU? VM? Container?):** ASK the friend. Container/VM resource caps make benchmarks misleading. --- ## 10. The methodology lineage This harness mirrors `WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py` v3 — Sloba's canonical Pavilion methodology established 2026-04-11. The 21 reference runs in `runs/` (in this repo) are the comparable baseline. Sloba's locked catalogue lives at `catalogue.json` (this repo). When your run is merged, it'll be added to the catalogue under your `device-tag` and become a new comparison point. The methodology and harness will evolve. Current canonical version: `HARNESS_VERSION = "public-1"`. Future versions will be additive — older ledgers stay valid forever. --- ## 11. Coordinate-while-running checklist Before you start: - [ ] Read this whole file - [ ] Read `methodology.md` for the metric definitions (TTFT, p50/p95, format_ok, etc.) - [ ] Verify the friend has ≥3 GB free disk for model files - [ ] Verify network is OK for model pull (the GGUFs are 0.5–10 GB) While running: - [ ] Smoke first - [ ] Full run - [ ] Watch for thermal throttling on laptops / phones / mini-PCs - [ ] Don't open Chrome / Slack / Zoom mid-run if you can avoid it (VRAM pressure) After running: - [ ] Generate `metadata.json` aggregates - [ ] Write `run.md` honestly — including caveats - [ ] Privacy-scan `run.jsonl` - [ ] Fork → branch → push → PR --- ## Questions / blockers If you hit something this runbook doesn't cover, the friend can email Sloba (slobodan@weeyuga.com) or open an issue on this repo. Don't burn an hour in a corner — ask. The whole point of crowdsourcing is the variance you'll see; that's data, not a problem. Welcome aboard. 🦇 — The Weeyuga team --- > **Maintainer note:** if you edit this file, edit `AGENTS.md` to match > (Codex loads `AGENTS.md`, Claude Code loads `CLAUDE.md`; identical > content prevents two-tier rules).