feat: harness + agent runbook — flip repo from archive-only to

crowdsourced runner

Sloba's chat directive 2026-05-06: "this project is preparation for
going public ... ship the harness along so others can join in."

The repo's original purpose (Ben's catalogue + 21 reference run
ledgers, shipped 2026-05-05) stays intact. This commit ADDS a second
purpose: a portable harness + agent runbook so a friend's coding agent
can clone, read CLAUDE.md, run the same suite on the friend's hardware,
and submit results back as a PR.

What landed:

CLAUDE.md + AGENTS.md (byte-identical, ~520 lines)
  Full agent runbook: hardware probe, runtime + model selection,
  canonical knob reference (Sloba's Pavilion methodology values),
  hardware-adaptation decision rules, run-instructions, output-schema
  templates for hardware.json + metadata.json + run.md, PR submission
  flow (fork → branch → push → PR; nothing auto-merges), privacy
  guardrails, methodology lineage. Per Sloba's Q3 directive: the
  runbook explicitly tells the friend's agent to ADAPT to hardware
  reality and document deviations rather than blindly run defaults.

CONTRIBUTING.md (~110 lines)
  Human-readable companion for the friend (not the agent). What you
  need, how it works, what we ask, what maintainers commit to,
  license, code-of-conduct short version.

harness/
  ├── README.md        Technical readme for the harness folder
  ├── run_benchmark.py ~520 LOC runner. Stdlib-only. Adapted from
  │                    WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py
  │                    v3 with the cluster-internal IP defaults
  │                    (10.8.0.x) replaced by 127.0.0.1:11434, the
  │                    cluster /v1/cluster/* endpoints removed, the
  │                    canonical-suite paths under ~/Documents/MyServers
  │                    replaced by harness/suites/ paths, the git-sha
  │                    enforcement on WeeyugaWeb dropped, and the
  │                    output written under submissions/<handle>/<tag>/
  │                    instead of docs/BENCHMARKS/runs/. Supports all
  │                    six suite phases via --phases, plus 'all'.
  ├── prompts.py       Verbatim copy of the canonical 3 frozen prompts
  │                    (P-EASY/P-MEDIUM/P-HARD) from
  │                    WeeyugaWeb/scripts/benchmarks/prompts.py.
  ├── requirements.txt Empty by intent (stdlib-only); placeholder for
  │                    pip-tools / agent auto-install patterns.
  ├── .gitignore       __pycache__/ etc.
  └── suites/          Six bundled JSON suites copied verbatim from
       Sloba's MyServers/instances/vps-81-17-99-14/telemetry/:
       small_model_eval_questions.json, python_task_suite_questions.json,
       parallel_qwen_same_model_20q_suite.json,
       parallel_qwen_mixed_model_20q_suite.json,
       python_context_edge_append_questions.json,
       python_context_edge_suite_only.json.

submissions/
  README.md            Folder convention + naming + reviewability rules
  EXAMPLE/mac-m1-8gb/run-00000000-...-000000000000/
       Synthetic-but-shape-complete contribution template:
       manifest.json, hardware.json, run.jsonl (5 example lines),
       metadata.json, run.md (with privacy attestation, methodology
       deviations, reproducibility command). Marked as synthetic at
       the top so future analysis doesn't accidentally cite it.

LICENSE-MIT
  MIT for harness/*.py and future helper code. Existing LICENSE
  (CC-BY-4.0) covers data files.

README.md (modified)
  Updated to reflect dual purpose. Layout diagram updated.
  Maintainer credits: Ben for catalogue/methodology + Bane for harness.
  Contributor quick-start added. Status table extended.

Privacy posture:
  - All 6 suite JSON files privacy-scanned for cluster IPs / hostnames /
    paths / tokens. Two prompts contain project names ("MyBoard" auth
    debugging in 20Q-Q14, generic SSH troubleshooting in 5Q-Q03);
    flagged in chat for Sloba's review. Otherwise clean.
  - run_benchmark.py default target_url is 127.0.0.1:11434 (no internal
    IPs leaked).
  - manifest.json captures host_hostname_short via socket.gethostname()
    .split('.')[0] — agent should review before PR if hostname is
    sensitive.
  - CLAUDE.md §8 spells out the privacy-grep before push.

Verification:
  - py_compile run_benchmark.py: OK
  - --help renders cleanly
  - All 6 suite JSON files: valid
  - All 4 example JSON files: valid
  - Example run.jsonl (5 lines): valid

This commit lands on branch feature/runner-and-agent-instructions.
NOT pushed to main; staying on the feature branch until Sloba reviews
on Gitea and merges. Bus dispatch to Ben + Sam announcing the
architectural pivot lives in the WeeyugaWeb coordination repo.
This commit is contained in:
Slobodan Margetic
2026-05-06 11:07:55 +02:00
committed by slobodanmargetic988
parent ddc9626136
commit 97a9245d9e
22 changed files with 4400 additions and 47 deletions

436
AGENTS.md Normal file
View File

@@ -0,0 +1,436 @@
# CLAUDE.md — runbook for the friend's coding agent
> **You are the coding agent (Claude Code, Codex, Aider, Cursor, …) of
> someone who agreed to run the Weeyuga benchmark suite on their hardware.
> This file is your briefing. Read it completely before doing anything.**
>
> If your runtime auto-loads `AGENTS.md` instead of `CLAUDE.md`, that file
> is byte-identical — same content, different filename for tools that
> prefer one over the other.
---
## TL;DR (60 seconds)
You are going to:
1. **Probe** the friend's hardware (OS, CPU, RAM, GPU + VRAM).
2. **Pick** an OpenAI-compatible runtime (Ollama / llama.cpp / vLLM / MLX) and a model that fits.
3. **Adapt** the bundled runner to the friend's reality (this is the part you can't skip — see §3).
4. **Run** `harness/run_benchmark.py` against the runtime; it writes a JSONL ledger of every call.
5. **Document** the run in `submissions/<friend-handle>/<device-tag>/run-<uuid>/`.
6. **Submit** a PR back to this repo. Sloba reviews and merges manually — nothing auto-merges.
You have **read-only access** to this repo. To submit, fork on Gitea, push to your fork, open a PR.
The harness is a **template**, not a one-click button. The friend's hardware will not match Sloba's; you are expected to adapt parameters, research best-known values online when you hit unfamiliar constraints, and **write down what you changed and why** in the manifest. Documented deviation is fine. Silent deviation breaks comparability.
---
## 0. Read this completely before doing anything
The rest of this file is structured as the order you'll work in. Reading the whole thing first gives you the shape; then the friend can ask "go" and you execute without circling back.
If you hit something genuinely ambiguous, ASK THE FRIEND. Don't guess at hardware-specific values; either verify with measurement or research them from the project / model authors' recommended-settings docs.
---
## 1. What you are running, exactly
**Inputs (frozen across runs):**
- `harness/prompts.py` — three frozen prompts (P-EASY, P-MEDIUM, P-HARD). Never modified.
- `harness/suites/*.json` — six benchmark suites, all run sequentially per model:
| phase key | suite file | what it tests |
|---|---|---|
| `5q` | `small_model_eval_questions.json` | 5 short-answer formatting + correctness questions |
| `20q` | `python_task_suite_questions.json` | 20 realistic Python task prompts |
| `parallel_same` | `parallel_qwen_same_model_20q_suite.json` | parallel-lane stress with one model |
| `parallel_mixed` | `parallel_qwen_mixed_model_20q_suite.json` | parallel-lane stress with multiple models |
| `edge_append` | `python_context_edge_append_questions.json` | long-context append behavior |
| `edge_suite` | `python_context_edge_suite_only.json` | long-context whole-suite reasoning |
**Driver:** `harness/run_benchmark.py` — one process, sequential calls to your local OpenAI-compatible `/v1/chat/completions` endpoint, one JSONL line per call.
**Output:** `submissions/<handle>/<device-tag>/run-<uuid>/` containing:
- `run.jsonl` — every call recorded
- `manifest.json` — written automatically by the runner
- `hardware.json`**you fill this** from the hardware probe (§2)
- `metadata.json` — computed aggregates (you generate, see §6)
- `run.md` — human-readable summary (you write, see §6)
**Run order:** ALL six suites run in sequence per model, per the canonical Pavilion methodology Sloba uses. Don't pick-and-choose unless the friend is explicitly time-constrained — partial runs are still useful but they're documented as "partial" in the manifest, and they show up as `archive-only` in Sloba's catalogue rather than full-grade runs.
---
## 2. Hardware probe — do this first, write `hardware.json` from the result
Before anything else, gather the friend's hardware truth. Pick the platform-appropriate commands; don't run all of them, just the ones that work on the friend's OS.
**macOS:**
```bash
system_profiler SPHardwareDataType SPDisplaysDataType
sysctl -n machdep.cpu.brand_string
sysctl -n hw.memsize
sw_vers
uname -a
```
**Linux:**
```bash
lscpu
cat /proc/meminfo | head -3
nvidia-smi --query-gpu=name,memory.total,driver_version,compute_cap --format=csv # if NVIDIA
lspci | grep -iE "vga|3d|display"
uname -a
cat /etc/os-release
```
**Windows (PowerShell):**
```powershell
Get-CimInstance Win32_ComputerSystem | Select Manufacturer, Model, TotalPhysicalMemory
Get-CimInstance Win32_Processor | Select Name, NumberOfCores, MaxClockSpeed
Get-CimInstance Win32_VideoController | Select Name, AdapterRAM, DriverVersion
$PSVersionTable.OS
```
Write the canonical findings to `hardware.json`. Schema (every field present; `null` if not applicable):
```json
{
"schema_version": "hardware-1.0",
"device_tag": "mac-m1-8gb",
"manufacturer_model": "Apple MacBook Air (Mac14,2)",
"os": {"name": "macOS", "version": "14.5", "kernel": "23.5.0"},
"cpu": {"name": "Apple M1", "cores": 8, "threads": 8, "max_ghz": 3.2,
"arch": "arm64", "isa": ["NEON"]},
"memory_gb_total": 8,
"memory_gb_available_at_run_start": 4.2,
"gpu": [
{"name": "Apple M1 GPU", "kind": "integrated", "vram_gb": null,
"driver": "Metal/macOS-14", "compute_cap": null}
],
"storage": {"kind": "ssd", "free_gb_at_run_start": 220},
"thermal_or_power_notes": "default OS thermal mgmt; on AC power",
"network_used_for_model_fetch": "wifi-100mbps",
"container_or_vm": null
}
```
Honest mode flags to mention in `thermal_or_power_notes`:
- "battery-only, low-power-mode active" → results may be artificially slow
- "thermal throttling observed mid-run" → tag any affected calls in `run.md` caveats
- "GPU shared with display compositor" → expect 5-15% throughput hit vs headless
---
## 3. Adapt to hardware reality — this is the part you cannot skip
The harness uses Sloba's canonical knobs as defaults. They are **not** guaranteed to be optimal for the friend's hardware. Your job:
### 3a. Canonical knobs (Sloba's reference values)
```python
CANONICAL_OPTIONS = {
"temperature": 0.1, # near-deterministic; comparable across runs
"num_ctx": 4096, # context window
"num_predict": 2048, # max generated tokens per call
}
```
Plus runtime-level (Ollama-specific but apply equivalently to llama.cpp / vLLM):
- `KEEP_ALIVE` — how long the loaded model stays warm. Sloba uses **2400h** on cluster nodes (~100 days = effectively pinned). On a friend's laptop, **5m** is gentler if RAM is tight.
- `NUM_PARALLEL` — concurrent slots. Sloba uses **3** on Pavilion. **1** is fine on tight RAM.
- `MAX_LOADED_MODELS` — how many models held in VRAM. Sloba uses **3** on a 12 GB GPU; default to **1** on anything ≤ 8 GB.
- For llama.cpp: `--n-gpu-layers` (NGL) — number of model layers offloaded to GPU. **Critical** on borderline VRAM. NGL=99 is full offload; NGL=0 is CPU-only. Sloba's Predator (GTX 1060 6 GB) runs Qwen3.5:9B at NGL=6 because higher offloads OOM with the KV cache.
### 3b. Decision rules
| Friend's hardware | Likely runtime | Likely model size | Likely NGL | Likely NUM_PARALLEL |
|---|---|---|---|---|
| Apple Silicon (M1/M2/M3, ≥8 GB unified) | Ollama OR llama.cpp w/ Metal OR MLX | 0.5B 4B | n/a (Metal handles offload) | 12 |
| Apple Silicon (M-Pro/M-Max, ≥16 GB) | same, MLX preferred for 8B+ | 4B 14B | n/a | 23 |
| NVIDIA GPU 6 GB VRAM | llama.cpp + CUDA | 0.5B 4B (or 8B at NGL ~1020) | tuned per model | 1 |
| NVIDIA GPU 812 GB VRAM | llama.cpp + CUDA, or vLLM | 4B 14B | high (6099) | 12 |
| NVIDIA GPU 24+ GB VRAM | vLLM or llama.cpp | up to 32B | 99 (full) | 4+ |
| AMD GPU | llama.cpp + ROCm | conservative one tier below NVIDIA-equivalent | tuned | 1 |
| CPU only | llama.cpp + CPU | 0.5B 2B (Q4_K_M) | 0 | 1 |
These are starting points. **Don't trust them blindly.** For any model + hardware combination you're uncertain about:
1. Check the model's HuggingFace card for "recommended quantization / hardware" notes.
2. Check the runtime's GitHub for known issues with this model family.
3. Look up llama.cpp issues for "VRAM OOM <model>" — community usually finds the NGL sweet spot.
4. If still uncertain, run a dry probe: `python3 harness/run_benchmark.py --probe --target-url ... --cell-id-prefix ... --models <model>` and observe RSS / VRAM / tokens-per-sec.
### 3c. Document every deviation in `manifest.json.canonical_options_overrides`
The runner records overrides automatically when you pass `--temperature` / `--num-ctx` / `--num-predict`. For runtime-level deviations (NGL, NUM_PARALLEL, KEEP_ALIVE), add them to `hardware.json.thermal_or_power_notes` or to `run.md` § Methodology Deviations. **Untracked deviations are the worst kind — silently make a run uncomparable.** Honest-and-deviated > silent-and-clean.
---
## 4. Pick a runtime and a model
Sloba's instruction: **use any model**. The harness doesn't ship a fixed model list — the friend's agent picks based on hardware. Suggestions, in increasing size:
| Model | Size | When |
|---|---|---|
| `qwen2.5-coder:0.5b` | ~400 MB | minimum-viable code benchmarks; runs anywhere |
| `qwen3.5:0.8b` | ~600 MB | Sloba's reference smallest; matches his catalogue runs |
| `qwen2.5-coder:1.5b` | ~1.1 GB | code-focused mid-tier |
| `qwen3.5:2b` | ~1.5 GB | conversational mid-tier |
| `qwen3.5:4b` | ~3 GB | flagship mid-tier; common comparison point |
| `qwen3.5:8b-q4km` | ~5 GB | mid-tier flagship |
| `qwen3.5:9b-q4km` | ~5.4 GB | Sloba's Predator flagship; 6 GB VRAM borderline (run with reduced NGL) |
| `qwen3.5:14b-q4km` | ~9 GB | needs ≥10 GB VRAM or Apple Silicon ≥16 GB unified |
| `gemma-4:e4b-it-q4km` | ~3 GB | non-Qwen comparison |
| `granite-4.1:8b-q4km` | ~5 GB | non-Qwen comparison |
Models are pulled from:
- **Ollama Hub:** `ollama pull qwen3.5:0.8b`, etc.
- **HuggingFace + llama.cpp:** download GGUF directly via `wget`/`hf-download`, then point `llama-server` at it.
Run more than one model in the same run if you can — comparability. The harness loops models inside one run; cell_ids encode the (node, engine, model) tuple.
---
## 5. Run the benchmark
### 5a. Smoke first (30 seconds)
```bash
python3 harness/run_benchmark.py --smoke \
--target-url http://127.0.0.1:11434 \
--models qwen3.5:0.8b \
--cell-id-prefix mac-m1:ollama \
--submitter-handle <friend-gitea-handle> \
--device-tag <short-device-tag>
```
If smoke 200s back, you have a working runtime. Run the real thing.
### 5b. Full run
```bash
python3 harness/run_benchmark.py \
--target-url http://127.0.0.1:11434 \
--models qwen3.5:0.8b,qwen3.5:4b \
--cell-id-prefix mac-m1:ollama \
--phases hello,5q,20q \
--submitter-handle alice \
--device-tag mac-m1-8gb
```
For the canonical full sweep across all six suites:
```bash
python3 harness/run_benchmark.py --phases all \
--target-url http://127.0.0.1:11434 \
--models qwen3.5:0.8b \
--cell-id-prefix mac-m1:ollama \
--submitter-handle alice --device-tag mac-m1-8gb
```
Expect minutes per cell. The 20Q + edge suites are the long ones (~1040 minutes per model on a small box). If the friend is time-bounded, drop edge_* and parallel_* — but record what you skipped.
### 5c. Resume on interrupt
If interrupted, the JSONL ledger is preserved (every line is fsync'd). To resume the same `run-id`:
```bash
python3 harness/run_benchmark.py --run-id <previous-uuid> ...
```
This appends to a new ledger; you'll need to merge them by hand (or just submit them as two separate runs sharing the same `device-tag`).
---
## 6. Generate `metadata.json` and `run.md`
### 6a. `metadata.json` — computed aggregates per cell
Schema (one row per (cell_id, phase) pair):
```json
{
"schema_version": "metadata-1.0",
"run_id": "<uuid>",
"submitter_handle": "alice",
"device_tag": "mac-m1-8gb",
"cells": [
{
"cell_id": "mac-m1:ollama:qwen3.5:0.8b",
"phase": "20q",
"n_calls": 20,
"n_errors": 0,
"duration_ms_p50": 9600,
"duration_ms_p95": 24000,
"duration_ms_mean": 11200,
"tokens_per_sec_p50": 16.4,
"tokens_per_sec_p95": 22.1,
"tokens_per_sec_mean": 17.0,
"tokens_per_sec_max": 24.8,
"completion_tokens_total": 18234,
"format_ok_rate": 0.85,
"marker_hit_rate_mean": 0.72
}
]
}
```
You can compute this in-line (small script) or use a quick Python REPL pass over `run.jsonl`. The catalogue builder on Sloba's side will recompute it anyway, but having it in the PR makes review fast.
### 6b. `run.md` — human-readable summary
Template (fill in every section honestly):
```markdown
# <device-tag> — <model-set> — <YYYY-MM-DD>
**Run ID:** `<uuid>`
**Submitter:** <handle>
**Hardware:** Apple MacBook Air M1, 8 GB unified, macOS 14.5
**Runtime:** Ollama 0.5.x (default settings; NUM_PARALLEL=1, KEEP_ALIVE=5m)
**Models:** qwen3.5:0.8b, qwen3.5:4b
**Phases run:** hello, 5q, 20q
**Phases skipped:** parallel_same, parallel_mixed, edge_append, edge_suite — RAM constraint, 4 GB free at run start was insufficient for parallel suites.
## Headline numbers
| cell | n_calls | tok/s mean | tok/s p50 | duration p50 | format_ok rate |
|---|---|---|---|---|---|
| mac-m1:ollama:qwen3.5:0.8b 20q | 20 | 17.0 | 16.4 | 9.6 s | 85% |
| mac-m1:ollama:qwen3.5:4b 20q | 20 | 5.8 | 5.5 | 28.2 s | 70% |
## Methodology
Followed the canonical Pavilion methodology with these deviations:
- **NUM_PARALLEL=1** instead of canonical 3 — 8 GB unified RAM doesn't fit two warm copies of qwen3.5:4b.
- **KEEP_ALIVE=5m** instead of 2400h — laptop, not server.
- **edge_* and parallel_* phases skipped** — friend's time budget.
## Caveats
- Run started at 18% battery; one call (20q-q14, model qwen3.5:4b) coincided with macOS Spotlight indexing; flagged in run.jsonl with run_idx=14 — that data point is high-variance.
- Network was on hotel wifi; model pull took ~6 minutes for qwen3.5:4b. Did not affect benchmark timing (model warm before any timed call).
## Reproducibility
```
python3 harness/run_benchmark.py \
--target-url http://127.0.0.1:11434 \
--models qwen3.5:0.8b,qwen3.5:4b \
--cell-id-prefix mac-m1:ollama \
--phases hello,5q,20q \
--submitter-handle alice \
--device-tag mac-m1-8gb
```
```
---
## 7. Submit the PR
1. **Fork** `https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public` to the friend's Gitea account (Gitea web UI → "Fork").
2. **Add the friend's fork as a remote on the local clone:**
```bash
git remote add fork ssh://gitea@git.weeyuga.com/<friend-handle>/weeyuga-benchmarks-public.git
```
3. **Create a topic branch** off `main`:
```bash
git checkout -b submission/<handle>-<device-tag>-<short-date>
```
4. **Stage only the new files under `submissions/<handle>/<device-tag>/run-<uuid>/`.** NEVER modify anything outside that directory in this PR.
```bash
git add submissions/<handle>/<device-tag>/run-<uuid>/
git status # confirm: only files under your run-<uuid>/ are staged
```
5. **Commit** with a descriptive message:
```
submit: alice / mac-m1-8gb / 2026-05-12 — qwen3.5 0.8b+4b, hello+5q+20q
First contribution from a friend's hardware. M1 8 GB unified, Ollama 0.5.x.
Skipped edge_* + parallel_* due to RAM. Headline: qwen3.5:0.8b ~17 tok/s,
qwen3.5:4b ~5.8 tok/s on 20Q.
```
6. **Push to fork:**
```bash
git push fork submission/<handle>-<device-tag>-<short-date>
```
7. **Open a PR on Gitea** with target = `slobodanmargetic988/weeyuga-benchmarks-public:main`. PR description should include:
- One-paragraph what-and-why
- Link to the friend's `run.md`
- Any methodology deviations the reviewer should know
- Privacy attestation: "I have reviewed run.jsonl and confirmed no PII / SSH keys / API tokens / personal home paths leaked"
Sloba reviews and merges. **Nothing auto-merges.** A typical review surfaces 02 follow-ups; address and force-push to the same branch.
---
## 8. Privacy guardrails — DO NOT submit any of these
- API keys (OpenAI, Anthropic, HuggingFace tokens, etc.)
- SSH private keys, `.ssh/` paths
- Personal home directory paths (`/Users/alice/secrets/...`)
- Real names if the friend prefers a handle
- Internal corporate IPs, hostnames, or SSO endpoints
- Bearer tokens in error messages (some runtimes echo headers in 4xx errors)
Before pushing, **scan the run.jsonl** for these patterns:
```bash
grep -nE "Bearer |sk-|api_key|/Users/|/home/|password|ssh-rsa|ssh-ed25519" submissions/<handle>/<device-tag>/run-<uuid>/*.{jsonl,md,json}
```
If anything matches, redact it from `response_preview` (the JSONL stores only the first 240 chars of each response, so leaks are rare — but please scan).
---
## 9. What if you get stuck
- **`/v1/models` returns empty:** the runtime isn't OpenAI-compat or no models are loaded. For Ollama: `ollama list`. For llama.cpp: it doesn't list models on `/v1/models` historically; pass `--models <name> --target-url http://127.0.0.1:11436` and it'll work anyway.
- **Every call returns 500 / timeout:** runtime is up but model isn't loading. Check VRAM with `nvidia-smi` or memory pressure with `vm_stat` / `free`. Smaller model. Or smaller `num_ctx`.
- **Tokens/sec absurdly low (<1 tok/s on hardware that should manage):** thermal throttling, swap thrashing, or wrong-quantization. Check `free -h` mid-run; if swap is being used, model is too big for RAM.
- **One question keeps getting `format_ok=false`:** model can't follow that instruction shape. NORMAL. Don't shorten the prompt or reword. Document in run.md and move on.
- **Ambiguous hardware setup (eGPU? VM? Container?):** ASK the friend. Container/VM resource caps make benchmarks misleading.
---
## 10. The methodology lineage
This harness mirrors `WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py` v3 — Sloba's canonical Pavilion methodology established 2026-04-11. The 21 reference runs in `runs/` (in this repo) are the comparable baseline. Sloba's locked catalogue lives at `catalogue.json` (this repo). When your run is merged, it'll be added to the catalogue under your `device-tag` and become a new comparison point.
The methodology and harness will evolve. Current canonical version: `HARNESS_VERSION = "public-1"`. Future versions will be additive — older ledgers stay valid forever.
---
## 11. Coordinate-while-running checklist
Before you start:
- [ ] Read this whole file
- [ ] Read `methodology.md` for the metric definitions (TTFT, p50/p95, format_ok, etc.)
- [ ] Verify the friend has ≥3 GB free disk for model files
- [ ] Verify network is OK for model pull (the GGUFs are 0.510 GB)
While running:
- [ ] Smoke first
- [ ] Full run
- [ ] Watch for thermal throttling on laptops / phones / mini-PCs
- [ ] Don't open Chrome / Slack / Zoom mid-run if you can avoid it (VRAM pressure)
After running:
- [ ] Generate `metadata.json` aggregates
- [ ] Write `run.md` honestly — including caveats
- [ ] Privacy-scan `run.jsonl`
- [ ] Fork → branch → push → PR
---
## Questions / blockers
If you hit something this runbook doesn't cover, the friend can email Sloba (slobodan@weeyuga.com) or open an issue on this repo. Don't burn an hour in a corner — ask. The whole point of crowdsourcing is the variance you'll see; that's data, not a problem.
Welcome aboard. 🦇
— The Weeyuga team
---
> **Maintainer note:** if you edit this file, edit `AGENTS.md` to match
> (Codex loads `AGENTS.md`, Claude Code loads `CLAUDE.md`; identical
> content prevents two-tier rules).