weeyuga-benchmarks-public/CLAUDE.md

# CLAUDE.md — runbook for the friend's coding agent

> **You are the coding agent (Claude Code, Codex, Aider, Cursor, …) of
> someone who agreed to run the Weeyuga benchmark suite on their hardware.
> This file is your briefing. Read it completely before doing anything.**
>
> If your runtime auto-loads `AGENTS.md` instead of `CLAUDE.md`, that file
> is byte-identical — same content, different filename for tools that
> prefer one over the other.

---

## TL;DR (60 seconds)

You are going to:

1. **Probe** the friend's hardware (OS, CPU, RAM, GPU + VRAM).
2. **Pick** an OpenAI-compatible runtime (Ollama / llama.cpp / vLLM / MLX) and a model that fits.
3. **Adapt** the bundled runner to the friend's reality (this is the part you can't skip — see §3).
4. **Run** `harness/run_benchmark.py` against the runtime; it writes a JSONL ledger of every call.
5. **Document** the run in `submissions/<friend-handle>/<device-tag>/run-<uuid>/`.
6. **Submit** a PR back to this repo. Sloba reviews and merges manually — nothing auto-merges.

You have **read-only access** to this repo. To submit, fork on Gitea, push to your fork, open a PR.

The harness is a **template**, not a one-click button. The friend's hardware will not match Sloba's; you are expected to adapt parameters, research best-known values online when you hit unfamiliar constraints, and **write down what you changed and why** in the manifest. Documented deviation is fine. Silent deviation breaks comparability.

---

## 0. Read this completely before doing anything

The rest of this file is structured as the order you'll work in. Reading the whole thing first gives you the shape; then the friend can ask "go" and you execute without circling back.

If you hit something genuinely ambiguous, ASK THE FRIEND. Don't guess at hardware-specific values; either verify with measurement or research them from the project / model authors' recommended-settings docs.

---

## 1. What you are running, exactly

**Inputs (frozen across runs):**
- `harness/prompts.py` — three frozen prompts (P-EASY, P-MEDIUM, P-HARD). Never modified.
- `harness/suites/*.json` — six benchmark suites, all run sequentially per model:
  | phase key | suite file | what it tests |
  |---|---|---|
  | `5q` | `small_model_eval_questions.json` | 5 short-answer formatting + correctness questions |
  | `20q` | `python_task_suite_questions.json` | 20 realistic Python task prompts |
  | `parallel_same` | `parallel_qwen_same_model_20q_suite.json` | parallel-lane stress with one model |
  | `parallel_mixed` | `parallel_qwen_mixed_model_20q_suite.json` | parallel-lane stress with multiple models |
  | `edge_append` | `python_context_edge_append_questions.json` | long-context append behavior |
  | `edge_suite` | `python_context_edge_suite_only.json` | long-context whole-suite reasoning |

**Driver:** `harness/run_benchmark.py` — one process, sequential calls to your local OpenAI-compatible `/v1/chat/completions` endpoint, one JSONL line per call.

**Output:** `submissions/<handle>/<device-tag>/run-<uuid>/` containing:
- `run.jsonl` — every call recorded
- `manifest.json` — written automatically by the runner
- `hardware.json` — **you fill this** from the hardware probe (§2)
- `metadata.json` — computed aggregates (you generate, see §6)
- `run.md` — human-readable summary (you write, see §6)

**Run order:** ALL six suites run in sequence per model, per the canonical Pavilion methodology Sloba uses. Don't pick-and-choose unless the friend is explicitly time-constrained — partial runs are still useful but they're documented as "partial" in the manifest, and they show up as `archive-only` in Sloba's catalogue rather than full-grade runs.

---

## 2. Hardware probe — do this first, write `hardware.json` from the result

Before anything else, gather the friend's hardware truth. Pick the platform-appropriate commands; don't run all of them, just the ones that work on the friend's OS.

**macOS:**
```bash
system_profiler SPHardwareDataType SPDisplaysDataType
sysctl -n machdep.cpu.brand_string
sysctl -n hw.memsize
sw_vers
uname -a
```

**Linux:**
```bash
lscpu
cat /proc/meminfo | head -3
nvidia-smi --query-gpu=name,memory.total,driver_version,compute_cap --format=csv  # if NVIDIA
lspci | grep -iE "vga|3d|display"
uname -a
cat /etc/os-release
```

**Windows (PowerShell):**
```powershell
Get-CimInstance Win32_ComputerSystem | Select Manufacturer, Model, TotalPhysicalMemory
Get-CimInstance Win32_Processor | Select Name, NumberOfCores, MaxClockSpeed
Get-CimInstance Win32_VideoController | Select Name, AdapterRAM, DriverVersion
$PSVersionTable.OS
```

Write the canonical findings to `hardware.json`. Schema (every field present; `null` if not applicable):

```json
{
  "schema_version": "hardware-1.0",
  "device_tag": "mac-m1-8gb",
  "manufacturer_model": "Apple MacBook Air (Mac14,2)",
  "os": {"name": "macOS", "version": "14.5", "kernel": "23.5.0"},
  "cpu": {"name": "Apple M1", "cores": 8, "threads": 8, "max_ghz": 3.2,
          "arch": "arm64", "isa": ["NEON"]},
  "memory_gb_total": 8,
  "memory_gb_available_at_run_start": 4.2,
  "gpu": [
    {"name": "Apple M1 GPU", "kind": "integrated", "vram_gb": null,
     "driver": "Metal/macOS-14", "compute_cap": null}
  ],
  "storage": {"kind": "ssd", "free_gb_at_run_start": 220},
  "thermal_or_power_notes": "default OS thermal mgmt; on AC power",
  "network_used_for_model_fetch": "wifi-100mbps",
  "container_or_vm": null
}
```

Honest mode flags to mention in `thermal_or_power_notes`:
- "battery-only, low-power-mode active" → results may be artificially slow
- "thermal throttling observed mid-run" → tag any affected calls in `run.md` caveats
- "GPU shared with display compositor" → expect 5-15% throughput hit vs headless

---

## 3. Adapt to hardware reality — this is the part you cannot skip

The harness uses Sloba's canonical knobs as defaults. They are **not** guaranteed to be optimal for the friend's hardware. Your job:

### 3a. Canonical knobs (Sloba's reference values)

```python
CANONICAL_OPTIONS = {
    "temperature": 0.1,    # near-deterministic; comparable across runs
    "num_ctx": 4096,       # context window
    "num_predict": 2048,   # max generated tokens per call
}
```

Plus runtime-level (Ollama-specific but apply equivalently to llama.cpp / vLLM):
- `KEEP_ALIVE` — how long the loaded model stays warm. Sloba uses **2400h** on cluster nodes (~100 days = effectively pinned). On a friend's laptop, **5m** is gentler if RAM is tight.
- `NUM_PARALLEL` — concurrent slots. Sloba uses **3** on Pavilion. **1** is fine on tight RAM.
- `MAX_LOADED_MODELS` — how many models held in VRAM. Sloba uses **3** on a 12 GB GPU; default to **1** on anything ≤ 8 GB.
- For llama.cpp: `--n-gpu-layers` (NGL) — number of model layers offloaded to GPU. **Critical** on borderline VRAM. NGL=99 is full offload; NGL=0 is CPU-only. Sloba's Predator (GTX 1060 6 GB) runs Qwen3.5:9B at NGL=6 because higher offloads OOM with the KV cache.

### 3b. Decision rules

| Friend's hardware | Likely runtime | Likely model size | Likely NGL | Likely NUM_PARALLEL |
|---|---|---|---|---|
| Apple Silicon (M1/M2/M3, ≥8 GB unified) | Ollama OR llama.cpp w/ Metal OR MLX | 0.5B – 4B | n/a (Metal handles offload) | 1–2 |
| Apple Silicon (M-Pro/M-Max, ≥16 GB) | same, MLX preferred for 8B+ | 4B – 14B | n/a | 2–3 |
| NVIDIA GPU 6 GB VRAM | llama.cpp + CUDA | 0.5B – 4B (or 8B at NGL ~10–20) | tuned per model | 1 |
| NVIDIA GPU 8–12 GB VRAM | llama.cpp + CUDA, or vLLM | 4B – 14B | high (60–99) | 1–2 |
| NVIDIA GPU 24+ GB VRAM | vLLM or llama.cpp | up to 32B | 99 (full) | 4+ |
| AMD GPU | llama.cpp + ROCm | conservative one tier below NVIDIA-equivalent | tuned | 1 |
| CPU only | llama.cpp + CPU | 0.5B – 2B (Q4_K_M) | 0 | 1 |

These are starting points. **Don't trust them blindly.** For any model + hardware combination you're uncertain about:

1. Check the model's HuggingFace card for "recommended quantization / hardware" notes.
2. Check the runtime's GitHub for known issues with this model family.
3. Look up llama.cpp issues for "VRAM OOM <model>" — community usually finds the NGL sweet spot.
4. If still uncertain, run a dry probe: `python3 harness/run_benchmark.py --probe --target-url ... --cell-id-prefix ... --models <model>` and observe RSS / VRAM / tokens-per-sec.

### 3c. Document every deviation in `manifest.json.canonical_options_overrides`

The runner records overrides automatically when you pass `--temperature` / `--num-ctx` / `--num-predict`. For runtime-level deviations (NGL, NUM_PARALLEL, KEEP_ALIVE), add them to `hardware.json.thermal_or_power_notes` or to `run.md` § Methodology Deviations. **Untracked deviations are the worst kind — silently make a run uncomparable.** Honest-and-deviated > silent-and-clean.

---

## 4. Pick a runtime and a model

Sloba's instruction: **use any model**. The harness doesn't ship a fixed model list — the friend's agent picks based on hardware. Suggestions, in increasing size:

| Model | Size | When |
|---|---|---|
| `qwen2.5-coder:0.5b` | ~400 MB | minimum-viable code benchmarks; runs anywhere |
| `qwen3.5:0.8b` | ~600 MB | Sloba's reference smallest; matches his catalogue runs |
| `qwen2.5-coder:1.5b` | ~1.1 GB | code-focused mid-tier |
| `qwen3.5:2b` | ~1.5 GB | conversational mid-tier |
| `qwen3.5:4b` | ~3 GB | flagship mid-tier; common comparison point |
| `qwen3.5:8b-q4km` | ~5 GB | mid-tier flagship |
| `qwen3.5:9b-q4km` | ~5.4 GB | Sloba's Predator flagship; 6 GB VRAM borderline (run with reduced NGL) |
| `qwen3.5:14b-q4km` | ~9 GB | needs ≥10 GB VRAM or Apple Silicon ≥16 GB unified |
| `gemma-4:e4b-it-q4km` | ~3 GB | non-Qwen comparison |
| `granite-4.1:8b-q4km` | ~5 GB | non-Qwen comparison |

Models are pulled from:
- **Ollama Hub:** `ollama pull qwen3.5:0.8b`, etc.
- **HuggingFace + llama.cpp:** download GGUF directly via `wget`/`hf-download`, then point `llama-server` at it.

Run more than one model in the same run if you can — comparability. The harness loops models inside one run; cell_ids encode the (node, engine, model) tuple.

---

## 5. Run the benchmark

### 5a. Smoke first (30 seconds)

```bash
python3 harness/run_benchmark.py --smoke \
    --target-url http://127.0.0.1:11434 \
    --models qwen3.5:0.8b \
    --cell-id-prefix mac-m1:ollama \
    --submitter-handle <friend-gitea-handle> \
    --device-tag <short-device-tag>
```

If smoke 200s back, you have a working runtime. Run the real thing.

### 5b. Full run

```bash
python3 harness/run_benchmark.py \
    --target-url http://127.0.0.1:11434 \
    --models qwen3.5:0.8b,qwen3.5:4b \
    --cell-id-prefix mac-m1:ollama \
    --phases hello,5q,20q \
    --submitter-handle alice \
    --device-tag mac-m1-8gb
```

For the canonical full sweep across all six suites:
```bash
python3 harness/run_benchmark.py --phases all \
    --target-url http://127.0.0.1:11434 \
    --models qwen3.5:0.8b \
    --cell-id-prefix mac-m1:ollama \
    --submitter-handle alice --device-tag mac-m1-8gb
```

Expect minutes per cell. The 20Q + edge suites are the long ones (~10–40 minutes per model on a small box). If the friend is time-bounded, drop edge_* and parallel_* — but record what you skipped.

### 5c. Resume on interrupt

If interrupted, the JSONL ledger is preserved (every line is fsync'd). To resume the same `run-id`:
```bash
python3 harness/run_benchmark.py --run-id <previous-uuid> ...
```
This appends to a new ledger; you'll need to merge them by hand (or just submit them as two separate runs sharing the same `device-tag`).

---

## 6. Generate `metadata.json` and `run.md`

### 6a. `metadata.json` — computed aggregates per cell

Schema (one row per (cell_id, phase) pair):
```json
{
  "schema_version": "metadata-1.0",
  "run_id": "<uuid>",
  "submitter_handle": "alice",
  "device_tag": "mac-m1-8gb",
  "cells": [
    {
      "cell_id": "mac-m1:ollama:qwen3.5:0.8b",
      "phase": "20q",
      "n_calls": 20,
      "n_errors": 0,
      "duration_ms_p50": 9600,
      "duration_ms_p95": 24000,
      "duration_ms_mean": 11200,
      "tokens_per_sec_p50": 16.4,
      "tokens_per_sec_p95": 22.1,
      "tokens_per_sec_mean": 17.0,
      "tokens_per_sec_max": 24.8,
      "completion_tokens_total": 18234,
      "format_ok_rate": 0.85,
      "marker_hit_rate_mean": 0.72
    }
  ]
}
```

You can compute this in-line (small script) or use a quick Python REPL pass over `run.jsonl`. The catalogue builder on Sloba's side will recompute it anyway, but having it in the PR makes review fast.

### 6b. `run.md` — human-readable summary

Template (fill in every section honestly):

```markdown
# <device-tag> — <model-set> — <YYYY-MM-DD>

**Run ID:** `<uuid>`
**Submitter:** <handle>
**Hardware:** Apple MacBook Air M1, 8 GB unified, macOS 14.5
**Runtime:** Ollama 0.5.x (default settings; NUM_PARALLEL=1, KEEP_ALIVE=5m)
**Models:** qwen3.5:0.8b, qwen3.5:4b
**Phases run:** hello, 5q, 20q
**Phases skipped:** parallel_same, parallel_mixed, edge_append, edge_suite — RAM constraint, 4 GB free at run start was insufficient for parallel suites.

## Headline numbers

| cell | n_calls | tok/s mean | tok/s p50 | duration p50 | format_ok rate |
|---|---|---|---|---|---|
| mac-m1:ollama:qwen3.5:0.8b 20q | 20 | 17.0 | 16.4 | 9.6 s | 85% |
| mac-m1:ollama:qwen3.5:4b 20q | 20 | 5.8 | 5.5 | 28.2 s | 70% |

## Methodology

Followed the canonical Pavilion methodology with these deviations:

- **NUM_PARALLEL=1** instead of canonical 3 — 8 GB unified RAM doesn't fit two warm copies of qwen3.5:4b.
- **KEEP_ALIVE=5m** instead of 2400h — laptop, not server.
- **edge_* and parallel_* phases skipped** — friend's time budget.

## Caveats

- Run started at 18% battery; one call (20q-q14, model qwen3.5:4b) coincided with macOS Spotlight indexing; flagged in run.jsonl with run_idx=14 — that data point is high-variance.
- Network was on hotel wifi; model pull took ~6 minutes for qwen3.5:4b. Did not affect benchmark timing (model warm before any timed call).

## Reproducibility

```
python3 harness/run_benchmark.py \
    --target-url http://127.0.0.1:11434 \
    --models qwen3.5:0.8b,qwen3.5:4b \
    --cell-id-prefix mac-m1:ollama \
    --phases hello,5q,20q \
    --submitter-handle alice \
    --device-tag mac-m1-8gb
```
```

---

## 7. Submit the PR

1. **Fork** `https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public` to the friend's Gitea account (Gitea web UI → "Fork").
2. **Add the friend's fork as a remote on the local clone:**
    ```bash
    git remote add fork ssh://gitea@git.weeyuga.com/<friend-handle>/weeyuga-benchmarks-public.git
    ```
3. **Create a topic branch** off `main`:
    ```bash
    git checkout -b submission/<handle>-<device-tag>-<short-date>
    ```
4. **Stage only the new files under `submissions/<handle>/<device-tag>/run-<uuid>/`.** NEVER modify anything outside that directory in this PR.
    ```bash
    git add submissions/<handle>/<device-tag>/run-<uuid>/
    git status   # confirm: only files under your run-<uuid>/ are staged
    ```
5. **Commit** with a descriptive message:
    ```
    submit: alice / mac-m1-8gb / 2026-05-12 — qwen3.5 0.8b+4b, hello+5q+20q

    First contribution from a friend's hardware. M1 8 GB unified, Ollama 0.5.x.
    Skipped edge_* + parallel_* due to RAM. Headline: qwen3.5:0.8b ~17 tok/s,
    qwen3.5:4b ~5.8 tok/s on 20Q.
    ```
6. **Push to fork:**
    ```bash
    git push fork submission/<handle>-<device-tag>-<short-date>
    ```
7. **Open a PR on Gitea** with target = `slobodanmargetic988/weeyuga-benchmarks-public:main`. PR description should include:
    - One-paragraph what-and-why
    - Link to the friend's `run.md`
    - Any methodology deviations the reviewer should know
    - Privacy attestation: "I have reviewed run.jsonl and confirmed no PII / SSH keys / API tokens / personal home paths leaked"

Sloba reviews and merges. **Nothing auto-merges.** A typical review surfaces 0–2 follow-ups; address and force-push to the same branch.

---

## 8. Privacy guardrails — DO NOT submit any of these

- API keys (OpenAI, Anthropic, HuggingFace tokens, etc.)
- SSH private keys, `.ssh/` paths
- Personal home directory paths (`/Users/alice/secrets/...`)
- Real names if the friend prefers a handle
- Internal corporate IPs, hostnames, or SSO endpoints
- Bearer tokens in error messages (some runtimes echo headers in 4xx errors)

Before pushing, **scan the run.jsonl** for these patterns:
```bash
grep -nE "Bearer |sk-|api_key|/Users/|/home/|password|ssh-rsa|ssh-ed25519" submissions/<handle>/<device-tag>/run-<uuid>/*.{jsonl,md,json}
```

If anything matches, redact it from `response_preview` (the JSONL stores only the first 240 chars of each response, so leaks are rare — but please scan).

---

## 9. What if you get stuck

- **`/v1/models` returns empty:** the runtime isn't OpenAI-compat or no models are loaded. For Ollama: `ollama list`. For llama.cpp: it doesn't list models on `/v1/models` historically; pass `--models <name> --target-url http://127.0.0.1:11436` and it'll work anyway.
- **Every call returns 500 / timeout:** runtime is up but model isn't loading. Check VRAM with `nvidia-smi` or memory pressure with `vm_stat` / `free`. Smaller model. Or smaller `num_ctx`.
- **Tokens/sec absurdly low (<1 tok/s on hardware that should manage):** thermal throttling, swap thrashing, or wrong-quantization. Check `free -h` mid-run; if swap is being used, model is too big for RAM.
- **One question keeps getting `format_ok=false`:** model can't follow that instruction shape. NORMAL. Don't shorten the prompt or reword. Document in run.md and move on.
- **Ambiguous hardware setup (eGPU? VM? Container?):** ASK the friend. Container/VM resource caps make benchmarks misleading.

---

## 10. The methodology lineage

This harness mirrors `WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py` v3 — Sloba's canonical Pavilion methodology established 2026-04-11. The 21 reference runs in `runs/` (in this repo) are the comparable baseline. Sloba's locked catalogue lives at `catalogue.json` (this repo). When your run is merged, it'll be added to the catalogue under your `device-tag` and become a new comparison point.

The methodology and harness will evolve. Current canonical version: `HARNESS_VERSION = "public-1"`. Future versions will be additive — older ledgers stay valid forever.

---

## 11. Coordinate-while-running checklist

Before you start:
- [ ] Read this whole file
- [ ] Read `methodology.md` for the metric definitions (TTFT, p50/p95, format_ok, etc.)
- [ ] Verify the friend has ≥3 GB free disk for model files
- [ ] Verify network is OK for model pull (the GGUFs are 0.5–10 GB)

While running:
- [ ] Smoke first
- [ ] Full run
- [ ] Watch for thermal throttling on laptops / phones / mini-PCs
- [ ] Don't open Chrome / Slack / Zoom mid-run if you can avoid it (VRAM pressure)

After running:
- [ ] Generate `metadata.json` aggregates
- [ ] Write `run.md` honestly — including caveats
- [ ] Privacy-scan `run.jsonl`
- [ ] Fork → branch → push → PR

---

## Questions / blockers

If you hit something this runbook doesn't cover, the friend can email Sloba (slobodan@weeyuga.com) or open an issue on this repo. Don't burn an hour in a corner — ask. The whole point of crowdsourcing is the variance you'll see; that's data, not a problem.

Welcome aboard. 🦇

— The Weeyuga team

---

> **Maintainer note:** if you edit this file, edit `AGENTS.md` to match
> (Codex loads `AGENTS.md`, Claude Code loads `CLAUDE.md`; identical
> content prevents two-tier rules).