Files
weeyuga-benchmarks-public/CLAUDE.md
Slobodan Margetic 97a9245d9e feat: harness + agent runbook — flip repo from archive-only to
crowdsourced runner

Sloba's chat directive 2026-05-06: "this project is preparation for
going public ... ship the harness along so others can join in."

The repo's original purpose (Ben's catalogue + 21 reference run
ledgers, shipped 2026-05-05) stays intact. This commit ADDS a second
purpose: a portable harness + agent runbook so a friend's coding agent
can clone, read CLAUDE.md, run the same suite on the friend's hardware,
and submit results back as a PR.

What landed:

CLAUDE.md + AGENTS.md (byte-identical, ~520 lines)
  Full agent runbook: hardware probe, runtime + model selection,
  canonical knob reference (Sloba's Pavilion methodology values),
  hardware-adaptation decision rules, run-instructions, output-schema
  templates for hardware.json + metadata.json + run.md, PR submission
  flow (fork → branch → push → PR; nothing auto-merges), privacy
  guardrails, methodology lineage. Per Sloba's Q3 directive: the
  runbook explicitly tells the friend's agent to ADAPT to hardware
  reality and document deviations rather than blindly run defaults.

CONTRIBUTING.md (~110 lines)
  Human-readable companion for the friend (not the agent). What you
  need, how it works, what we ask, what maintainers commit to,
  license, code-of-conduct short version.

harness/
  ├── README.md        Technical readme for the harness folder
  ├── run_benchmark.py ~520 LOC runner. Stdlib-only. Adapted from
  │                    WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py
  │                    v3 with the cluster-internal IP defaults
  │                    (10.8.0.x) replaced by 127.0.0.1:11434, the
  │                    cluster /v1/cluster/* endpoints removed, the
  │                    canonical-suite paths under ~/Documents/MyServers
  │                    replaced by harness/suites/ paths, the git-sha
  │                    enforcement on WeeyugaWeb dropped, and the
  │                    output written under submissions/<handle>/<tag>/
  │                    instead of docs/BENCHMARKS/runs/. Supports all
  │                    six suite phases via --phases, plus 'all'.
  ├── prompts.py       Verbatim copy of the canonical 3 frozen prompts
  │                    (P-EASY/P-MEDIUM/P-HARD) from
  │                    WeeyugaWeb/scripts/benchmarks/prompts.py.
  ├── requirements.txt Empty by intent (stdlib-only); placeholder for
  │                    pip-tools / agent auto-install patterns.
  ├── .gitignore       __pycache__/ etc.
  └── suites/          Six bundled JSON suites copied verbatim from
       Sloba's MyServers/instances/vps-81-17-99-14/telemetry/:
       small_model_eval_questions.json, python_task_suite_questions.json,
       parallel_qwen_same_model_20q_suite.json,
       parallel_qwen_mixed_model_20q_suite.json,
       python_context_edge_append_questions.json,
       python_context_edge_suite_only.json.

submissions/
  README.md            Folder convention + naming + reviewability rules
  EXAMPLE/mac-m1-8gb/run-00000000-...-000000000000/
       Synthetic-but-shape-complete contribution template:
       manifest.json, hardware.json, run.jsonl (5 example lines),
       metadata.json, run.md (with privacy attestation, methodology
       deviations, reproducibility command). Marked as synthetic at
       the top so future analysis doesn't accidentally cite it.

LICENSE-MIT
  MIT for harness/*.py and future helper code. Existing LICENSE
  (CC-BY-4.0) covers data files.

README.md (modified)
  Updated to reflect dual purpose. Layout diagram updated.
  Maintainer credits: Ben for catalogue/methodology + Bane for harness.
  Contributor quick-start added. Status table extended.

Privacy posture:
  - All 6 suite JSON files privacy-scanned for cluster IPs / hostnames /
    paths / tokens. Two prompts contain project names ("MyBoard" auth
    debugging in 20Q-Q14, generic SSH troubleshooting in 5Q-Q03);
    flagged in chat for Sloba's review. Otherwise clean.
  - run_benchmark.py default target_url is 127.0.0.1:11434 (no internal
    IPs leaked).
  - manifest.json captures host_hostname_short via socket.gethostname()
    .split('.')[0] — agent should review before PR if hostname is
    sensitive.
  - CLAUDE.md §8 spells out the privacy-grep before push.

Verification:
  - py_compile run_benchmark.py: OK
  - --help renders cleanly
  - All 6 suite JSON files: valid
  - All 4 example JSON files: valid
  - Example run.jsonl (5 lines): valid

This commit lands on branch feature/runner-and-agent-instructions.
NOT pushed to main; staying on the feature branch until Sloba reviews
on Gitea and merges. Bus dispatch to Ben + Sam announcing the
architectural pivot lives in the WeeyugaWeb coordination repo.
2026-05-06 19:05:22 +02:00

437 lines
19 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# CLAUDE.md — runbook for the friend's coding agent
> **You are the coding agent (Claude Code, Codex, Aider, Cursor, …) of
> someone who agreed to run the Weeyuga benchmark suite on their hardware.
> This file is your briefing. Read it completely before doing anything.**
>
> If your runtime auto-loads `AGENTS.md` instead of `CLAUDE.md`, that file
> is byte-identical — same content, different filename for tools that
> prefer one over the other.
---
## TL;DR (60 seconds)
You are going to:
1. **Probe** the friend's hardware (OS, CPU, RAM, GPU + VRAM).
2. **Pick** an OpenAI-compatible runtime (Ollama / llama.cpp / vLLM / MLX) and a model that fits.
3. **Adapt** the bundled runner to the friend's reality (this is the part you can't skip — see §3).
4. **Run** `harness/run_benchmark.py` against the runtime; it writes a JSONL ledger of every call.
5. **Document** the run in `submissions/<friend-handle>/<device-tag>/run-<uuid>/`.
6. **Submit** a PR back to this repo. Sloba reviews and merges manually — nothing auto-merges.
You have **read-only access** to this repo. To submit, fork on Gitea, push to your fork, open a PR.
The harness is a **template**, not a one-click button. The friend's hardware will not match Sloba's; you are expected to adapt parameters, research best-known values online when you hit unfamiliar constraints, and **write down what you changed and why** in the manifest. Documented deviation is fine. Silent deviation breaks comparability.
---
## 0. Read this completely before doing anything
The rest of this file is structured as the order you'll work in. Reading the whole thing first gives you the shape; then the friend can ask "go" and you execute without circling back.
If you hit something genuinely ambiguous, ASK THE FRIEND. Don't guess at hardware-specific values; either verify with measurement or research them from the project / model authors' recommended-settings docs.
---
## 1. What you are running, exactly
**Inputs (frozen across runs):**
- `harness/prompts.py` — three frozen prompts (P-EASY, P-MEDIUM, P-HARD). Never modified.
- `harness/suites/*.json` — six benchmark suites, all run sequentially per model:
| phase key | suite file | what it tests |
|---|---|---|
| `5q` | `small_model_eval_questions.json` | 5 short-answer formatting + correctness questions |
| `20q` | `python_task_suite_questions.json` | 20 realistic Python task prompts |
| `parallel_same` | `parallel_qwen_same_model_20q_suite.json` | parallel-lane stress with one model |
| `parallel_mixed` | `parallel_qwen_mixed_model_20q_suite.json` | parallel-lane stress with multiple models |
| `edge_append` | `python_context_edge_append_questions.json` | long-context append behavior |
| `edge_suite` | `python_context_edge_suite_only.json` | long-context whole-suite reasoning |
**Driver:** `harness/run_benchmark.py` — one process, sequential calls to your local OpenAI-compatible `/v1/chat/completions` endpoint, one JSONL line per call.
**Output:** `submissions/<handle>/<device-tag>/run-<uuid>/` containing:
- `run.jsonl` — every call recorded
- `manifest.json` — written automatically by the runner
- `hardware.json`**you fill this** from the hardware probe (§2)
- `metadata.json` — computed aggregates (you generate, see §6)
- `run.md` — human-readable summary (you write, see §6)
**Run order:** ALL six suites run in sequence per model, per the canonical Pavilion methodology Sloba uses. Don't pick-and-choose unless the friend is explicitly time-constrained — partial runs are still useful but they're documented as "partial" in the manifest, and they show up as `archive-only` in Sloba's catalogue rather than full-grade runs.
---
## 2. Hardware probe — do this first, write `hardware.json` from the result
Before anything else, gather the friend's hardware truth. Pick the platform-appropriate commands; don't run all of them, just the ones that work on the friend's OS.
**macOS:**
```bash
system_profiler SPHardwareDataType SPDisplaysDataType
sysctl -n machdep.cpu.brand_string
sysctl -n hw.memsize
sw_vers
uname -a
```
**Linux:**
```bash
lscpu
cat /proc/meminfo | head -3
nvidia-smi --query-gpu=name,memory.total,driver_version,compute_cap --format=csv # if NVIDIA
lspci | grep -iE "vga|3d|display"
uname -a
cat /etc/os-release
```
**Windows (PowerShell):**
```powershell
Get-CimInstance Win32_ComputerSystem | Select Manufacturer, Model, TotalPhysicalMemory
Get-CimInstance Win32_Processor | Select Name, NumberOfCores, MaxClockSpeed
Get-CimInstance Win32_VideoController | Select Name, AdapterRAM, DriverVersion
$PSVersionTable.OS
```
Write the canonical findings to `hardware.json`. Schema (every field present; `null` if not applicable):
```json
{
"schema_version": "hardware-1.0",
"device_tag": "mac-m1-8gb",
"manufacturer_model": "Apple MacBook Air (Mac14,2)",
"os": {"name": "macOS", "version": "14.5", "kernel": "23.5.0"},
"cpu": {"name": "Apple M1", "cores": 8, "threads": 8, "max_ghz": 3.2,
"arch": "arm64", "isa": ["NEON"]},
"memory_gb_total": 8,
"memory_gb_available_at_run_start": 4.2,
"gpu": [
{"name": "Apple M1 GPU", "kind": "integrated", "vram_gb": null,
"driver": "Metal/macOS-14", "compute_cap": null}
],
"storage": {"kind": "ssd", "free_gb_at_run_start": 220},
"thermal_or_power_notes": "default OS thermal mgmt; on AC power",
"network_used_for_model_fetch": "wifi-100mbps",
"container_or_vm": null
}
```
Honest mode flags to mention in `thermal_or_power_notes`:
- "battery-only, low-power-mode active" → results may be artificially slow
- "thermal throttling observed mid-run" → tag any affected calls in `run.md` caveats
- "GPU shared with display compositor" → expect 5-15% throughput hit vs headless
---
## 3. Adapt to hardware reality — this is the part you cannot skip
The harness uses Sloba's canonical knobs as defaults. They are **not** guaranteed to be optimal for the friend's hardware. Your job:
### 3a. Canonical knobs (Sloba's reference values)
```python
CANONICAL_OPTIONS = {
"temperature": 0.1, # near-deterministic; comparable across runs
"num_ctx": 4096, # context window
"num_predict": 2048, # max generated tokens per call
}
```
Plus runtime-level (Ollama-specific but apply equivalently to llama.cpp / vLLM):
- `KEEP_ALIVE` — how long the loaded model stays warm. Sloba uses **2400h** on cluster nodes (~100 days = effectively pinned). On a friend's laptop, **5m** is gentler if RAM is tight.
- `NUM_PARALLEL` — concurrent slots. Sloba uses **3** on Pavilion. **1** is fine on tight RAM.
- `MAX_LOADED_MODELS` — how many models held in VRAM. Sloba uses **3** on a 12 GB GPU; default to **1** on anything ≤ 8 GB.
- For llama.cpp: `--n-gpu-layers` (NGL) — number of model layers offloaded to GPU. **Critical** on borderline VRAM. NGL=99 is full offload; NGL=0 is CPU-only. Sloba's Predator (GTX 1060 6 GB) runs Qwen3.5:9B at NGL=6 because higher offloads OOM with the KV cache.
### 3b. Decision rules
| Friend's hardware | Likely runtime | Likely model size | Likely NGL | Likely NUM_PARALLEL |
|---|---|---|---|---|
| Apple Silicon (M1/M2/M3, ≥8 GB unified) | Ollama OR llama.cpp w/ Metal OR MLX | 0.5B 4B | n/a (Metal handles offload) | 12 |
| Apple Silicon (M-Pro/M-Max, ≥16 GB) | same, MLX preferred for 8B+ | 4B 14B | n/a | 23 |
| NVIDIA GPU 6 GB VRAM | llama.cpp + CUDA | 0.5B 4B (or 8B at NGL ~1020) | tuned per model | 1 |
| NVIDIA GPU 812 GB VRAM | llama.cpp + CUDA, or vLLM | 4B 14B | high (6099) | 12 |
| NVIDIA GPU 24+ GB VRAM | vLLM or llama.cpp | up to 32B | 99 (full) | 4+ |
| AMD GPU | llama.cpp + ROCm | conservative one tier below NVIDIA-equivalent | tuned | 1 |
| CPU only | llama.cpp + CPU | 0.5B 2B (Q4_K_M) | 0 | 1 |
These are starting points. **Don't trust them blindly.** For any model + hardware combination you're uncertain about:
1. Check the model's HuggingFace card for "recommended quantization / hardware" notes.
2. Check the runtime's GitHub for known issues with this model family.
3. Look up llama.cpp issues for "VRAM OOM <model>" — community usually finds the NGL sweet spot.
4. If still uncertain, run a dry probe: `python3 harness/run_benchmark.py --probe --target-url ... --cell-id-prefix ... --models <model>` and observe RSS / VRAM / tokens-per-sec.
### 3c. Document every deviation in `manifest.json.canonical_options_overrides`
The runner records overrides automatically when you pass `--temperature` / `--num-ctx` / `--num-predict`. For runtime-level deviations (NGL, NUM_PARALLEL, KEEP_ALIVE), add them to `hardware.json.thermal_or_power_notes` or to `run.md` § Methodology Deviations. **Untracked deviations are the worst kind — silently make a run uncomparable.** Honest-and-deviated > silent-and-clean.
---
## 4. Pick a runtime and a model
Sloba's instruction: **use any model**. The harness doesn't ship a fixed model list — the friend's agent picks based on hardware. Suggestions, in increasing size:
| Model | Size | When |
|---|---|---|
| `qwen2.5-coder:0.5b` | ~400 MB | minimum-viable code benchmarks; runs anywhere |
| `qwen3.5:0.8b` | ~600 MB | Sloba's reference smallest; matches his catalogue runs |
| `qwen2.5-coder:1.5b` | ~1.1 GB | code-focused mid-tier |
| `qwen3.5:2b` | ~1.5 GB | conversational mid-tier |
| `qwen3.5:4b` | ~3 GB | flagship mid-tier; common comparison point |
| `qwen3.5:8b-q4km` | ~5 GB | mid-tier flagship |
| `qwen3.5:9b-q4km` | ~5.4 GB | Sloba's Predator flagship; 6 GB VRAM borderline (run with reduced NGL) |
| `qwen3.5:14b-q4km` | ~9 GB | needs ≥10 GB VRAM or Apple Silicon ≥16 GB unified |
| `gemma-4:e4b-it-q4km` | ~3 GB | non-Qwen comparison |
| `granite-4.1:8b-q4km` | ~5 GB | non-Qwen comparison |
Models are pulled from:
- **Ollama Hub:** `ollama pull qwen3.5:0.8b`, etc.
- **HuggingFace + llama.cpp:** download GGUF directly via `wget`/`hf-download`, then point `llama-server` at it.
Run more than one model in the same run if you can — comparability. The harness loops models inside one run; cell_ids encode the (node, engine, model) tuple.
---
## 5. Run the benchmark
### 5a. Smoke first (30 seconds)
```bash
python3 harness/run_benchmark.py --smoke \
--target-url http://127.0.0.1:11434 \
--models qwen3.5:0.8b \
--cell-id-prefix mac-m1:ollama \
--submitter-handle <friend-gitea-handle> \
--device-tag <short-device-tag>
```
If smoke 200s back, you have a working runtime. Run the real thing.
### 5b. Full run
```bash
python3 harness/run_benchmark.py \
--target-url http://127.0.0.1:11434 \
--models qwen3.5:0.8b,qwen3.5:4b \
--cell-id-prefix mac-m1:ollama \
--phases hello,5q,20q \
--submitter-handle alice \
--device-tag mac-m1-8gb
```
For the canonical full sweep across all six suites:
```bash
python3 harness/run_benchmark.py --phases all \
--target-url http://127.0.0.1:11434 \
--models qwen3.5:0.8b \
--cell-id-prefix mac-m1:ollama \
--submitter-handle alice --device-tag mac-m1-8gb
```
Expect minutes per cell. The 20Q + edge suites are the long ones (~1040 minutes per model on a small box). If the friend is time-bounded, drop edge_* and parallel_* — but record what you skipped.
### 5c. Resume on interrupt
If interrupted, the JSONL ledger is preserved (every line is fsync'd). To resume the same `run-id`:
```bash
python3 harness/run_benchmark.py --run-id <previous-uuid> ...
```
This appends to a new ledger; you'll need to merge them by hand (or just submit them as two separate runs sharing the same `device-tag`).
---
## 6. Generate `metadata.json` and `run.md`
### 6a. `metadata.json` — computed aggregates per cell
Schema (one row per (cell_id, phase) pair):
```json
{
"schema_version": "metadata-1.0",
"run_id": "<uuid>",
"submitter_handle": "alice",
"device_tag": "mac-m1-8gb",
"cells": [
{
"cell_id": "mac-m1:ollama:qwen3.5:0.8b",
"phase": "20q",
"n_calls": 20,
"n_errors": 0,
"duration_ms_p50": 9600,
"duration_ms_p95": 24000,
"duration_ms_mean": 11200,
"tokens_per_sec_p50": 16.4,
"tokens_per_sec_p95": 22.1,
"tokens_per_sec_mean": 17.0,
"tokens_per_sec_max": 24.8,
"completion_tokens_total": 18234,
"format_ok_rate": 0.85,
"marker_hit_rate_mean": 0.72
}
]
}
```
You can compute this in-line (small script) or use a quick Python REPL pass over `run.jsonl`. The catalogue builder on Sloba's side will recompute it anyway, but having it in the PR makes review fast.
### 6b. `run.md` — human-readable summary
Template (fill in every section honestly):
```markdown
# <device-tag> — <model-set> — <YYYY-MM-DD>
**Run ID:** `<uuid>`
**Submitter:** <handle>
**Hardware:** Apple MacBook Air M1, 8 GB unified, macOS 14.5
**Runtime:** Ollama 0.5.x (default settings; NUM_PARALLEL=1, KEEP_ALIVE=5m)
**Models:** qwen3.5:0.8b, qwen3.5:4b
**Phases run:** hello, 5q, 20q
**Phases skipped:** parallel_same, parallel_mixed, edge_append, edge_suite — RAM constraint, 4 GB free at run start was insufficient for parallel suites.
## Headline numbers
| cell | n_calls | tok/s mean | tok/s p50 | duration p50 | format_ok rate |
|---|---|---|---|---|---|
| mac-m1:ollama:qwen3.5:0.8b 20q | 20 | 17.0 | 16.4 | 9.6 s | 85% |
| mac-m1:ollama:qwen3.5:4b 20q | 20 | 5.8 | 5.5 | 28.2 s | 70% |
## Methodology
Followed the canonical Pavilion methodology with these deviations:
- **NUM_PARALLEL=1** instead of canonical 3 — 8 GB unified RAM doesn't fit two warm copies of qwen3.5:4b.
- **KEEP_ALIVE=5m** instead of 2400h — laptop, not server.
- **edge_* and parallel_* phases skipped** — friend's time budget.
## Caveats
- Run started at 18% battery; one call (20q-q14, model qwen3.5:4b) coincided with macOS Spotlight indexing; flagged in run.jsonl with run_idx=14 — that data point is high-variance.
- Network was on hotel wifi; model pull took ~6 minutes for qwen3.5:4b. Did not affect benchmark timing (model warm before any timed call).
## Reproducibility
```
python3 harness/run_benchmark.py \
--target-url http://127.0.0.1:11434 \
--models qwen3.5:0.8b,qwen3.5:4b \
--cell-id-prefix mac-m1:ollama \
--phases hello,5q,20q \
--submitter-handle alice \
--device-tag mac-m1-8gb
```
```
---
## 7. Submit the PR
1. **Fork** `https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public` to the friend's Gitea account (Gitea web UI → "Fork").
2. **Add the friend's fork as a remote on the local clone:**
```bash
git remote add fork ssh://gitea@git.weeyuga.com/<friend-handle>/weeyuga-benchmarks-public.git
```
3. **Create a topic branch** off `main`:
```bash
git checkout -b submission/<handle>-<device-tag>-<short-date>
```
4. **Stage only the new files under `submissions/<handle>/<device-tag>/run-<uuid>/`.** NEVER modify anything outside that directory in this PR.
```bash
git add submissions/<handle>/<device-tag>/run-<uuid>/
git status # confirm: only files under your run-<uuid>/ are staged
```
5. **Commit** with a descriptive message:
```
submit: alice / mac-m1-8gb / 2026-05-12 — qwen3.5 0.8b+4b, hello+5q+20q
First contribution from a friend's hardware. M1 8 GB unified, Ollama 0.5.x.
Skipped edge_* + parallel_* due to RAM. Headline: qwen3.5:0.8b ~17 tok/s,
qwen3.5:4b ~5.8 tok/s on 20Q.
```
6. **Push to fork:**
```bash
git push fork submission/<handle>-<device-tag>-<short-date>
```
7. **Open a PR on Gitea** with target = `slobodanmargetic988/weeyuga-benchmarks-public:main`. PR description should include:
- One-paragraph what-and-why
- Link to the friend's `run.md`
- Any methodology deviations the reviewer should know
- Privacy attestation: "I have reviewed run.jsonl and confirmed no PII / SSH keys / API tokens / personal home paths leaked"
Sloba reviews and merges. **Nothing auto-merges.** A typical review surfaces 02 follow-ups; address and force-push to the same branch.
---
## 8. Privacy guardrails — DO NOT submit any of these
- API keys (OpenAI, Anthropic, HuggingFace tokens, etc.)
- SSH private keys, `.ssh/` paths
- Personal home directory paths (`/Users/alice/secrets/...`)
- Real names if the friend prefers a handle
- Internal corporate IPs, hostnames, or SSO endpoints
- Bearer tokens in error messages (some runtimes echo headers in 4xx errors)
Before pushing, **scan the run.jsonl** for these patterns:
```bash
grep -nE "Bearer |sk-|api_key|/Users/|/home/|password|ssh-rsa|ssh-ed25519" submissions/<handle>/<device-tag>/run-<uuid>/*.{jsonl,md,json}
```
If anything matches, redact it from `response_preview` (the JSONL stores only the first 240 chars of each response, so leaks are rare — but please scan).
---
## 9. What if you get stuck
- **`/v1/models` returns empty:** the runtime isn't OpenAI-compat or no models are loaded. For Ollama: `ollama list`. For llama.cpp: it doesn't list models on `/v1/models` historically; pass `--models <name> --target-url http://127.0.0.1:11436` and it'll work anyway.
- **Every call returns 500 / timeout:** runtime is up but model isn't loading. Check VRAM with `nvidia-smi` or memory pressure with `vm_stat` / `free`. Smaller model. Or smaller `num_ctx`.
- **Tokens/sec absurdly low (<1 tok/s on hardware that should manage):** thermal throttling, swap thrashing, or wrong-quantization. Check `free -h` mid-run; if swap is being used, model is too big for RAM.
- **One question keeps getting `format_ok=false`:** model can't follow that instruction shape. NORMAL. Don't shorten the prompt or reword. Document in run.md and move on.
- **Ambiguous hardware setup (eGPU? VM? Container?):** ASK the friend. Container/VM resource caps make benchmarks misleading.
---
## 10. The methodology lineage
This harness mirrors `WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py` v3 — Sloba's canonical Pavilion methodology established 2026-04-11. The 21 reference runs in `runs/` (in this repo) are the comparable baseline. Sloba's locked catalogue lives at `catalogue.json` (this repo). When your run is merged, it'll be added to the catalogue under your `device-tag` and become a new comparison point.
The methodology and harness will evolve. Current canonical version: `HARNESS_VERSION = "public-1"`. Future versions will be additive — older ledgers stay valid forever.
---
## 11. Coordinate-while-running checklist
Before you start:
- [ ] Read this whole file
- [ ] Read `methodology.md` for the metric definitions (TTFT, p50/p95, format_ok, etc.)
- [ ] Verify the friend has ≥3 GB free disk for model files
- [ ] Verify network is OK for model pull (the GGUFs are 0.510 GB)
While running:
- [ ] Smoke first
- [ ] Full run
- [ ] Watch for thermal throttling on laptops / phones / mini-PCs
- [ ] Don't open Chrome / Slack / Zoom mid-run if you can avoid it (VRAM pressure)
After running:
- [ ] Generate `metadata.json` aggregates
- [ ] Write `run.md` honestly — including caveats
- [ ] Privacy-scan `run.jsonl`
- [ ] Fork → branch → push → PR
---
## Questions / blockers
If you hit something this runbook doesn't cover, the friend can email Sloba (slobodan@weeyuga.com) or open an issue on this repo. Don't burn an hour in a corner — ask. The whole point of crowdsourcing is the variance you'll see; that's data, not a problem.
Welcome aboard. 🦇
— The Weeyuga team
---
> **Maintainer note:** if you edit this file, edit `AGENTS.md` to match
> (Codex loads `AGENTS.md`, Claude Code loads `CLAUDE.md`; identical
> content prevents two-tier rules).