feat: harness + agent runbook — flip repo from archive-only to
crowdsourced runner
Sloba's chat directive 2026-05-06: "this project is preparation for
going public ... ship the harness along so others can join in."
The repo's original purpose (Ben's catalogue + 21 reference run
ledgers, shipped 2026-05-05) stays intact. This commit ADDS a second
purpose: a portable harness + agent runbook so a friend's coding agent
can clone, read CLAUDE.md, run the same suite on the friend's hardware,
and submit results back as a PR.
What landed:
CLAUDE.md + AGENTS.md (byte-identical, ~520 lines)
Full agent runbook: hardware probe, runtime + model selection,
canonical knob reference (Sloba's Pavilion methodology values),
hardware-adaptation decision rules, run-instructions, output-schema
templates for hardware.json + metadata.json + run.md, PR submission
flow (fork → branch → push → PR; nothing auto-merges), privacy
guardrails, methodology lineage. Per Sloba's Q3 directive: the
runbook explicitly tells the friend's agent to ADAPT to hardware
reality and document deviations rather than blindly run defaults.
CONTRIBUTING.md (~110 lines)
Human-readable companion for the friend (not the agent). What you
need, how it works, what we ask, what maintainers commit to,
license, code-of-conduct short version.
harness/
├── README.md Technical readme for the harness folder
├── run_benchmark.py ~520 LOC runner. Stdlib-only. Adapted from
│ WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py
│ v3 with the cluster-internal IP defaults
│ (10.8.0.x) replaced by 127.0.0.1:11434, the
│ cluster /v1/cluster/* endpoints removed, the
│ canonical-suite paths under ~/Documents/MyServers
│ replaced by harness/suites/ paths, the git-sha
│ enforcement on WeeyugaWeb dropped, and the
│ output written under submissions/<handle>/<tag>/
│ instead of docs/BENCHMARKS/runs/. Supports all
│ six suite phases via --phases, plus 'all'.
├── prompts.py Verbatim copy of the canonical 3 frozen prompts
│ (P-EASY/P-MEDIUM/P-HARD) from
│ WeeyugaWeb/scripts/benchmarks/prompts.py.
├── requirements.txt Empty by intent (stdlib-only); placeholder for
│ pip-tools / agent auto-install patterns.
├── .gitignore __pycache__/ etc.
└── suites/ Six bundled JSON suites copied verbatim from
Sloba's MyServers/instances/vps-81-17-99-14/telemetry/:
small_model_eval_questions.json, python_task_suite_questions.json,
parallel_qwen_same_model_20q_suite.json,
parallel_qwen_mixed_model_20q_suite.json,
python_context_edge_append_questions.json,
python_context_edge_suite_only.json.
submissions/
README.md Folder convention + naming + reviewability rules
EXAMPLE/mac-m1-8gb/run-00000000-...-000000000000/
Synthetic-but-shape-complete contribution template:
manifest.json, hardware.json, run.jsonl (5 example lines),
metadata.json, run.md (with privacy attestation, methodology
deviations, reproducibility command). Marked as synthetic at
the top so future analysis doesn't accidentally cite it.
LICENSE-MIT
MIT for harness/*.py and future helper code. Existing LICENSE
(CC-BY-4.0) covers data files.
README.md (modified)
Updated to reflect dual purpose. Layout diagram updated.
Maintainer credits: Ben for catalogue/methodology + Bane for harness.
Contributor quick-start added. Status table extended.
Privacy posture:
- All 6 suite JSON files privacy-scanned for cluster IPs / hostnames /
paths / tokens. Two prompts contain project names ("MyBoard" auth
debugging in 20Q-Q14, generic SSH troubleshooting in 5Q-Q03);
flagged in chat for Sloba's review. Otherwise clean.
- run_benchmark.py default target_url is 127.0.0.1:11434 (no internal
IPs leaked).
- manifest.json captures host_hostname_short via socket.gethostname()
.split('.')[0] — agent should review before PR if hostname is
sensitive.
- CLAUDE.md §8 spells out the privacy-grep before push.
Verification:
- py_compile run_benchmark.py: OK
- --help renders cleanly
- All 6 suite JSON files: valid
- All 4 example JSON files: valid
- Example run.jsonl (5 lines): valid
This commit lands on branch feature/runner-and-agent-instructions.
NOT pushed to main; staying on the feature branch until Sloba reviews
on Gitea and merges. Bus dispatch to Ben + Sam announcing the
architectural pivot lives in the WeeyugaWeb coordination repo.
This commit is contained in:
436
CLAUDE.md
Normal file
436
CLAUDE.md
Normal file
@@ -0,0 +1,436 @@
|
||||
# CLAUDE.md — runbook for the friend's coding agent
|
||||
|
||||
> **You are the coding agent (Claude Code, Codex, Aider, Cursor, …) of
|
||||
> someone who agreed to run the Weeyuga benchmark suite on their hardware.
|
||||
> This file is your briefing. Read it completely before doing anything.**
|
||||
>
|
||||
> If your runtime auto-loads `AGENTS.md` instead of `CLAUDE.md`, that file
|
||||
> is byte-identical — same content, different filename for tools that
|
||||
> prefer one over the other.
|
||||
|
||||
---
|
||||
|
||||
## TL;DR (60 seconds)
|
||||
|
||||
You are going to:
|
||||
|
||||
1. **Probe** the friend's hardware (OS, CPU, RAM, GPU + VRAM).
|
||||
2. **Pick** an OpenAI-compatible runtime (Ollama / llama.cpp / vLLM / MLX) and a model that fits.
|
||||
3. **Adapt** the bundled runner to the friend's reality (this is the part you can't skip — see §3).
|
||||
4. **Run** `harness/run_benchmark.py` against the runtime; it writes a JSONL ledger of every call.
|
||||
5. **Document** the run in `submissions/<friend-handle>/<device-tag>/run-<uuid>/`.
|
||||
6. **Submit** a PR back to this repo. Sloba reviews and merges manually — nothing auto-merges.
|
||||
|
||||
You have **read-only access** to this repo. To submit, fork on Gitea, push to your fork, open a PR.
|
||||
|
||||
The harness is a **template**, not a one-click button. The friend's hardware will not match Sloba's; you are expected to adapt parameters, research best-known values online when you hit unfamiliar constraints, and **write down what you changed and why** in the manifest. Documented deviation is fine. Silent deviation breaks comparability.
|
||||
|
||||
---
|
||||
|
||||
## 0. Read this completely before doing anything
|
||||
|
||||
The rest of this file is structured as the order you'll work in. Reading the whole thing first gives you the shape; then the friend can ask "go" and you execute without circling back.
|
||||
|
||||
If you hit something genuinely ambiguous, ASK THE FRIEND. Don't guess at hardware-specific values; either verify with measurement or research them from the project / model authors' recommended-settings docs.
|
||||
|
||||
---
|
||||
|
||||
## 1. What you are running, exactly
|
||||
|
||||
**Inputs (frozen across runs):**
|
||||
- `harness/prompts.py` — three frozen prompts (P-EASY, P-MEDIUM, P-HARD). Never modified.
|
||||
- `harness/suites/*.json` — six benchmark suites, all run sequentially per model:
|
||||
| phase key | suite file | what it tests |
|
||||
|---|---|---|
|
||||
| `5q` | `small_model_eval_questions.json` | 5 short-answer formatting + correctness questions |
|
||||
| `20q` | `python_task_suite_questions.json` | 20 realistic Python task prompts |
|
||||
| `parallel_same` | `parallel_qwen_same_model_20q_suite.json` | parallel-lane stress with one model |
|
||||
| `parallel_mixed` | `parallel_qwen_mixed_model_20q_suite.json` | parallel-lane stress with multiple models |
|
||||
| `edge_append` | `python_context_edge_append_questions.json` | long-context append behavior |
|
||||
| `edge_suite` | `python_context_edge_suite_only.json` | long-context whole-suite reasoning |
|
||||
|
||||
**Driver:** `harness/run_benchmark.py` — one process, sequential calls to your local OpenAI-compatible `/v1/chat/completions` endpoint, one JSONL line per call.
|
||||
|
||||
**Output:** `submissions/<handle>/<device-tag>/run-<uuid>/` containing:
|
||||
- `run.jsonl` — every call recorded
|
||||
- `manifest.json` — written automatically by the runner
|
||||
- `hardware.json` — **you fill this** from the hardware probe (§2)
|
||||
- `metadata.json` — computed aggregates (you generate, see §6)
|
||||
- `run.md` — human-readable summary (you write, see §6)
|
||||
|
||||
**Run order:** ALL six suites run in sequence per model, per the canonical Pavilion methodology Sloba uses. Don't pick-and-choose unless the friend is explicitly time-constrained — partial runs are still useful but they're documented as "partial" in the manifest, and they show up as `archive-only` in Sloba's catalogue rather than full-grade runs.
|
||||
|
||||
---
|
||||
|
||||
## 2. Hardware probe — do this first, write `hardware.json` from the result
|
||||
|
||||
Before anything else, gather the friend's hardware truth. Pick the platform-appropriate commands; don't run all of them, just the ones that work on the friend's OS.
|
||||
|
||||
**macOS:**
|
||||
```bash
|
||||
system_profiler SPHardwareDataType SPDisplaysDataType
|
||||
sysctl -n machdep.cpu.brand_string
|
||||
sysctl -n hw.memsize
|
||||
sw_vers
|
||||
uname -a
|
||||
```
|
||||
|
||||
**Linux:**
|
||||
```bash
|
||||
lscpu
|
||||
cat /proc/meminfo | head -3
|
||||
nvidia-smi --query-gpu=name,memory.total,driver_version,compute_cap --format=csv # if NVIDIA
|
||||
lspci | grep -iE "vga|3d|display"
|
||||
uname -a
|
||||
cat /etc/os-release
|
||||
```
|
||||
|
||||
**Windows (PowerShell):**
|
||||
```powershell
|
||||
Get-CimInstance Win32_ComputerSystem | Select Manufacturer, Model, TotalPhysicalMemory
|
||||
Get-CimInstance Win32_Processor | Select Name, NumberOfCores, MaxClockSpeed
|
||||
Get-CimInstance Win32_VideoController | Select Name, AdapterRAM, DriverVersion
|
||||
$PSVersionTable.OS
|
||||
```
|
||||
|
||||
Write the canonical findings to `hardware.json`. Schema (every field present; `null` if not applicable):
|
||||
|
||||
```json
|
||||
{
|
||||
"schema_version": "hardware-1.0",
|
||||
"device_tag": "mac-m1-8gb",
|
||||
"manufacturer_model": "Apple MacBook Air (Mac14,2)",
|
||||
"os": {"name": "macOS", "version": "14.5", "kernel": "23.5.0"},
|
||||
"cpu": {"name": "Apple M1", "cores": 8, "threads": 8, "max_ghz": 3.2,
|
||||
"arch": "arm64", "isa": ["NEON"]},
|
||||
"memory_gb_total": 8,
|
||||
"memory_gb_available_at_run_start": 4.2,
|
||||
"gpu": [
|
||||
{"name": "Apple M1 GPU", "kind": "integrated", "vram_gb": null,
|
||||
"driver": "Metal/macOS-14", "compute_cap": null}
|
||||
],
|
||||
"storage": {"kind": "ssd", "free_gb_at_run_start": 220},
|
||||
"thermal_or_power_notes": "default OS thermal mgmt; on AC power",
|
||||
"network_used_for_model_fetch": "wifi-100mbps",
|
||||
"container_or_vm": null
|
||||
}
|
||||
```
|
||||
|
||||
Honest mode flags to mention in `thermal_or_power_notes`:
|
||||
- "battery-only, low-power-mode active" → results may be artificially slow
|
||||
- "thermal throttling observed mid-run" → tag any affected calls in `run.md` caveats
|
||||
- "GPU shared with display compositor" → expect 5-15% throughput hit vs headless
|
||||
|
||||
---
|
||||
|
||||
## 3. Adapt to hardware reality — this is the part you cannot skip
|
||||
|
||||
The harness uses Sloba's canonical knobs as defaults. They are **not** guaranteed to be optimal for the friend's hardware. Your job:
|
||||
|
||||
### 3a. Canonical knobs (Sloba's reference values)
|
||||
|
||||
```python
|
||||
CANONICAL_OPTIONS = {
|
||||
"temperature": 0.1, # near-deterministic; comparable across runs
|
||||
"num_ctx": 4096, # context window
|
||||
"num_predict": 2048, # max generated tokens per call
|
||||
}
|
||||
```
|
||||
|
||||
Plus runtime-level (Ollama-specific but apply equivalently to llama.cpp / vLLM):
|
||||
- `KEEP_ALIVE` — how long the loaded model stays warm. Sloba uses **2400h** on cluster nodes (~100 days = effectively pinned). On a friend's laptop, **5m** is gentler if RAM is tight.
|
||||
- `NUM_PARALLEL` — concurrent slots. Sloba uses **3** on Pavilion. **1** is fine on tight RAM.
|
||||
- `MAX_LOADED_MODELS` — how many models held in VRAM. Sloba uses **3** on a 12 GB GPU; default to **1** on anything ≤ 8 GB.
|
||||
- For llama.cpp: `--n-gpu-layers` (NGL) — number of model layers offloaded to GPU. **Critical** on borderline VRAM. NGL=99 is full offload; NGL=0 is CPU-only. Sloba's Predator (GTX 1060 6 GB) runs Qwen3.5:9B at NGL=6 because higher offloads OOM with the KV cache.
|
||||
|
||||
### 3b. Decision rules
|
||||
|
||||
| Friend's hardware | Likely runtime | Likely model size | Likely NGL | Likely NUM_PARALLEL |
|
||||
|---|---|---|---|---|
|
||||
| Apple Silicon (M1/M2/M3, ≥8 GB unified) | Ollama OR llama.cpp w/ Metal OR MLX | 0.5B – 4B | n/a (Metal handles offload) | 1–2 |
|
||||
| Apple Silicon (M-Pro/M-Max, ≥16 GB) | same, MLX preferred for 8B+ | 4B – 14B | n/a | 2–3 |
|
||||
| NVIDIA GPU 6 GB VRAM | llama.cpp + CUDA | 0.5B – 4B (or 8B at NGL ~10–20) | tuned per model | 1 |
|
||||
| NVIDIA GPU 8–12 GB VRAM | llama.cpp + CUDA, or vLLM | 4B – 14B | high (60–99) | 1–2 |
|
||||
| NVIDIA GPU 24+ GB VRAM | vLLM or llama.cpp | up to 32B | 99 (full) | 4+ |
|
||||
| AMD GPU | llama.cpp + ROCm | conservative one tier below NVIDIA-equivalent | tuned | 1 |
|
||||
| CPU only | llama.cpp + CPU | 0.5B – 2B (Q4_K_M) | 0 | 1 |
|
||||
|
||||
These are starting points. **Don't trust them blindly.** For any model + hardware combination you're uncertain about:
|
||||
|
||||
1. Check the model's HuggingFace card for "recommended quantization / hardware" notes.
|
||||
2. Check the runtime's GitHub for known issues with this model family.
|
||||
3. Look up llama.cpp issues for "VRAM OOM <model>" — community usually finds the NGL sweet spot.
|
||||
4. If still uncertain, run a dry probe: `python3 harness/run_benchmark.py --probe --target-url ... --cell-id-prefix ... --models <model>` and observe RSS / VRAM / tokens-per-sec.
|
||||
|
||||
### 3c. Document every deviation in `manifest.json.canonical_options_overrides`
|
||||
|
||||
The runner records overrides automatically when you pass `--temperature` / `--num-ctx` / `--num-predict`. For runtime-level deviations (NGL, NUM_PARALLEL, KEEP_ALIVE), add them to `hardware.json.thermal_or_power_notes` or to `run.md` § Methodology Deviations. **Untracked deviations are the worst kind — silently make a run uncomparable.** Honest-and-deviated > silent-and-clean.
|
||||
|
||||
---
|
||||
|
||||
## 4. Pick a runtime and a model
|
||||
|
||||
Sloba's instruction: **use any model**. The harness doesn't ship a fixed model list — the friend's agent picks based on hardware. Suggestions, in increasing size:
|
||||
|
||||
| Model | Size | When |
|
||||
|---|---|---|
|
||||
| `qwen2.5-coder:0.5b` | ~400 MB | minimum-viable code benchmarks; runs anywhere |
|
||||
| `qwen3.5:0.8b` | ~600 MB | Sloba's reference smallest; matches his catalogue runs |
|
||||
| `qwen2.5-coder:1.5b` | ~1.1 GB | code-focused mid-tier |
|
||||
| `qwen3.5:2b` | ~1.5 GB | conversational mid-tier |
|
||||
| `qwen3.5:4b` | ~3 GB | flagship mid-tier; common comparison point |
|
||||
| `qwen3.5:8b-q4km` | ~5 GB | mid-tier flagship |
|
||||
| `qwen3.5:9b-q4km` | ~5.4 GB | Sloba's Predator flagship; 6 GB VRAM borderline (run with reduced NGL) |
|
||||
| `qwen3.5:14b-q4km` | ~9 GB | needs ≥10 GB VRAM or Apple Silicon ≥16 GB unified |
|
||||
| `gemma-4:e4b-it-q4km` | ~3 GB | non-Qwen comparison |
|
||||
| `granite-4.1:8b-q4km` | ~5 GB | non-Qwen comparison |
|
||||
|
||||
Models are pulled from:
|
||||
- **Ollama Hub:** `ollama pull qwen3.5:0.8b`, etc.
|
||||
- **HuggingFace + llama.cpp:** download GGUF directly via `wget`/`hf-download`, then point `llama-server` at it.
|
||||
|
||||
Run more than one model in the same run if you can — comparability. The harness loops models inside one run; cell_ids encode the (node, engine, model) tuple.
|
||||
|
||||
---
|
||||
|
||||
## 5. Run the benchmark
|
||||
|
||||
### 5a. Smoke first (30 seconds)
|
||||
|
||||
```bash
|
||||
python3 harness/run_benchmark.py --smoke \
|
||||
--target-url http://127.0.0.1:11434 \
|
||||
--models qwen3.5:0.8b \
|
||||
--cell-id-prefix mac-m1:ollama \
|
||||
--submitter-handle <friend-gitea-handle> \
|
||||
--device-tag <short-device-tag>
|
||||
```
|
||||
|
||||
If smoke 200s back, you have a working runtime. Run the real thing.
|
||||
|
||||
### 5b. Full run
|
||||
|
||||
```bash
|
||||
python3 harness/run_benchmark.py \
|
||||
--target-url http://127.0.0.1:11434 \
|
||||
--models qwen3.5:0.8b,qwen3.5:4b \
|
||||
--cell-id-prefix mac-m1:ollama \
|
||||
--phases hello,5q,20q \
|
||||
--submitter-handle alice \
|
||||
--device-tag mac-m1-8gb
|
||||
```
|
||||
|
||||
For the canonical full sweep across all six suites:
|
||||
```bash
|
||||
python3 harness/run_benchmark.py --phases all \
|
||||
--target-url http://127.0.0.1:11434 \
|
||||
--models qwen3.5:0.8b \
|
||||
--cell-id-prefix mac-m1:ollama \
|
||||
--submitter-handle alice --device-tag mac-m1-8gb
|
||||
```
|
||||
|
||||
Expect minutes per cell. The 20Q + edge suites are the long ones (~10–40 minutes per model on a small box). If the friend is time-bounded, drop edge_* and parallel_* — but record what you skipped.
|
||||
|
||||
### 5c. Resume on interrupt
|
||||
|
||||
If interrupted, the JSONL ledger is preserved (every line is fsync'd). To resume the same `run-id`:
|
||||
```bash
|
||||
python3 harness/run_benchmark.py --run-id <previous-uuid> ...
|
||||
```
|
||||
This appends to a new ledger; you'll need to merge them by hand (or just submit them as two separate runs sharing the same `device-tag`).
|
||||
|
||||
---
|
||||
|
||||
## 6. Generate `metadata.json` and `run.md`
|
||||
|
||||
### 6a. `metadata.json` — computed aggregates per cell
|
||||
|
||||
Schema (one row per (cell_id, phase) pair):
|
||||
```json
|
||||
{
|
||||
"schema_version": "metadata-1.0",
|
||||
"run_id": "<uuid>",
|
||||
"submitter_handle": "alice",
|
||||
"device_tag": "mac-m1-8gb",
|
||||
"cells": [
|
||||
{
|
||||
"cell_id": "mac-m1:ollama:qwen3.5:0.8b",
|
||||
"phase": "20q",
|
||||
"n_calls": 20,
|
||||
"n_errors": 0,
|
||||
"duration_ms_p50": 9600,
|
||||
"duration_ms_p95": 24000,
|
||||
"duration_ms_mean": 11200,
|
||||
"tokens_per_sec_p50": 16.4,
|
||||
"tokens_per_sec_p95": 22.1,
|
||||
"tokens_per_sec_mean": 17.0,
|
||||
"tokens_per_sec_max": 24.8,
|
||||
"completion_tokens_total": 18234,
|
||||
"format_ok_rate": 0.85,
|
||||
"marker_hit_rate_mean": 0.72
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
You can compute this in-line (small script) or use a quick Python REPL pass over `run.jsonl`. The catalogue builder on Sloba's side will recompute it anyway, but having it in the PR makes review fast.
|
||||
|
||||
### 6b. `run.md` — human-readable summary
|
||||
|
||||
Template (fill in every section honestly):
|
||||
|
||||
```markdown
|
||||
# <device-tag> — <model-set> — <YYYY-MM-DD>
|
||||
|
||||
**Run ID:** `<uuid>`
|
||||
**Submitter:** <handle>
|
||||
**Hardware:** Apple MacBook Air M1, 8 GB unified, macOS 14.5
|
||||
**Runtime:** Ollama 0.5.x (default settings; NUM_PARALLEL=1, KEEP_ALIVE=5m)
|
||||
**Models:** qwen3.5:0.8b, qwen3.5:4b
|
||||
**Phases run:** hello, 5q, 20q
|
||||
**Phases skipped:** parallel_same, parallel_mixed, edge_append, edge_suite — RAM constraint, 4 GB free at run start was insufficient for parallel suites.
|
||||
|
||||
## Headline numbers
|
||||
|
||||
| cell | n_calls | tok/s mean | tok/s p50 | duration p50 | format_ok rate |
|
||||
|---|---|---|---|---|---|
|
||||
| mac-m1:ollama:qwen3.5:0.8b 20q | 20 | 17.0 | 16.4 | 9.6 s | 85% |
|
||||
| mac-m1:ollama:qwen3.5:4b 20q | 20 | 5.8 | 5.5 | 28.2 s | 70% |
|
||||
|
||||
## Methodology
|
||||
|
||||
Followed the canonical Pavilion methodology with these deviations:
|
||||
|
||||
- **NUM_PARALLEL=1** instead of canonical 3 — 8 GB unified RAM doesn't fit two warm copies of qwen3.5:4b.
|
||||
- **KEEP_ALIVE=5m** instead of 2400h — laptop, not server.
|
||||
- **edge_* and parallel_* phases skipped** — friend's time budget.
|
||||
|
||||
## Caveats
|
||||
|
||||
- Run started at 18% battery; one call (20q-q14, model qwen3.5:4b) coincided with macOS Spotlight indexing; flagged in run.jsonl with run_idx=14 — that data point is high-variance.
|
||||
- Network was on hotel wifi; model pull took ~6 minutes for qwen3.5:4b. Did not affect benchmark timing (model warm before any timed call).
|
||||
|
||||
## Reproducibility
|
||||
|
||||
```
|
||||
python3 harness/run_benchmark.py \
|
||||
--target-url http://127.0.0.1:11434 \
|
||||
--models qwen3.5:0.8b,qwen3.5:4b \
|
||||
--cell-id-prefix mac-m1:ollama \
|
||||
--phases hello,5q,20q \
|
||||
--submitter-handle alice \
|
||||
--device-tag mac-m1-8gb
|
||||
```
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Submit the PR
|
||||
|
||||
1. **Fork** `https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public` to the friend's Gitea account (Gitea web UI → "Fork").
|
||||
2. **Add the friend's fork as a remote on the local clone:**
|
||||
```bash
|
||||
git remote add fork ssh://gitea@git.weeyuga.com/<friend-handle>/weeyuga-benchmarks-public.git
|
||||
```
|
||||
3. **Create a topic branch** off `main`:
|
||||
```bash
|
||||
git checkout -b submission/<handle>-<device-tag>-<short-date>
|
||||
```
|
||||
4. **Stage only the new files under `submissions/<handle>/<device-tag>/run-<uuid>/`.** NEVER modify anything outside that directory in this PR.
|
||||
```bash
|
||||
git add submissions/<handle>/<device-tag>/run-<uuid>/
|
||||
git status # confirm: only files under your run-<uuid>/ are staged
|
||||
```
|
||||
5. **Commit** with a descriptive message:
|
||||
```
|
||||
submit: alice / mac-m1-8gb / 2026-05-12 — qwen3.5 0.8b+4b, hello+5q+20q
|
||||
|
||||
First contribution from a friend's hardware. M1 8 GB unified, Ollama 0.5.x.
|
||||
Skipped edge_* + parallel_* due to RAM. Headline: qwen3.5:0.8b ~17 tok/s,
|
||||
qwen3.5:4b ~5.8 tok/s on 20Q.
|
||||
```
|
||||
6. **Push to fork:**
|
||||
```bash
|
||||
git push fork submission/<handle>-<device-tag>-<short-date>
|
||||
```
|
||||
7. **Open a PR on Gitea** with target = `slobodanmargetic988/weeyuga-benchmarks-public:main`. PR description should include:
|
||||
- One-paragraph what-and-why
|
||||
- Link to the friend's `run.md`
|
||||
- Any methodology deviations the reviewer should know
|
||||
- Privacy attestation: "I have reviewed run.jsonl and confirmed no PII / SSH keys / API tokens / personal home paths leaked"
|
||||
|
||||
Sloba reviews and merges. **Nothing auto-merges.** A typical review surfaces 0–2 follow-ups; address and force-push to the same branch.
|
||||
|
||||
---
|
||||
|
||||
## 8. Privacy guardrails — DO NOT submit any of these
|
||||
|
||||
- API keys (OpenAI, Anthropic, HuggingFace tokens, etc.)
|
||||
- SSH private keys, `.ssh/` paths
|
||||
- Personal home directory paths (`/Users/alice/secrets/...`)
|
||||
- Real names if the friend prefers a handle
|
||||
- Internal corporate IPs, hostnames, or SSO endpoints
|
||||
- Bearer tokens in error messages (some runtimes echo headers in 4xx errors)
|
||||
|
||||
Before pushing, **scan the run.jsonl** for these patterns:
|
||||
```bash
|
||||
grep -nE "Bearer |sk-|api_key|/Users/|/home/|password|ssh-rsa|ssh-ed25519" submissions/<handle>/<device-tag>/run-<uuid>/*.{jsonl,md,json}
|
||||
```
|
||||
|
||||
If anything matches, redact it from `response_preview` (the JSONL stores only the first 240 chars of each response, so leaks are rare — but please scan).
|
||||
|
||||
---
|
||||
|
||||
## 9. What if you get stuck
|
||||
|
||||
- **`/v1/models` returns empty:** the runtime isn't OpenAI-compat or no models are loaded. For Ollama: `ollama list`. For llama.cpp: it doesn't list models on `/v1/models` historically; pass `--models <name> --target-url http://127.0.0.1:11436` and it'll work anyway.
|
||||
- **Every call returns 500 / timeout:** runtime is up but model isn't loading. Check VRAM with `nvidia-smi` or memory pressure with `vm_stat` / `free`. Smaller model. Or smaller `num_ctx`.
|
||||
- **Tokens/sec absurdly low (<1 tok/s on hardware that should manage):** thermal throttling, swap thrashing, or wrong-quantization. Check `free -h` mid-run; if swap is being used, model is too big for RAM.
|
||||
- **One question keeps getting `format_ok=false`:** model can't follow that instruction shape. NORMAL. Don't shorten the prompt or reword. Document in run.md and move on.
|
||||
- **Ambiguous hardware setup (eGPU? VM? Container?):** ASK the friend. Container/VM resource caps make benchmarks misleading.
|
||||
|
||||
---
|
||||
|
||||
## 10. The methodology lineage
|
||||
|
||||
This harness mirrors `WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py` v3 — Sloba's canonical Pavilion methodology established 2026-04-11. The 21 reference runs in `runs/` (in this repo) are the comparable baseline. Sloba's locked catalogue lives at `catalogue.json` (this repo). When your run is merged, it'll be added to the catalogue under your `device-tag` and become a new comparison point.
|
||||
|
||||
The methodology and harness will evolve. Current canonical version: `HARNESS_VERSION = "public-1"`. Future versions will be additive — older ledgers stay valid forever.
|
||||
|
||||
---
|
||||
|
||||
## 11. Coordinate-while-running checklist
|
||||
|
||||
Before you start:
|
||||
- [ ] Read this whole file
|
||||
- [ ] Read `methodology.md` for the metric definitions (TTFT, p50/p95, format_ok, etc.)
|
||||
- [ ] Verify the friend has ≥3 GB free disk for model files
|
||||
- [ ] Verify network is OK for model pull (the GGUFs are 0.5–10 GB)
|
||||
|
||||
While running:
|
||||
- [ ] Smoke first
|
||||
- [ ] Full run
|
||||
- [ ] Watch for thermal throttling on laptops / phones / mini-PCs
|
||||
- [ ] Don't open Chrome / Slack / Zoom mid-run if you can avoid it (VRAM pressure)
|
||||
|
||||
After running:
|
||||
- [ ] Generate `metadata.json` aggregates
|
||||
- [ ] Write `run.md` honestly — including caveats
|
||||
- [ ] Privacy-scan `run.jsonl`
|
||||
- [ ] Fork → branch → push → PR
|
||||
|
||||
---
|
||||
|
||||
## Questions / blockers
|
||||
|
||||
If you hit something this runbook doesn't cover, the friend can email Sloba (slobodan@weeyuga.com) or open an issue on this repo. Don't burn an hour in a corner — ask. The whole point of crowdsourcing is the variance you'll see; that's data, not a problem.
|
||||
|
||||
Welcome aboard. 🦇
|
||||
|
||||
— The Weeyuga team
|
||||
|
||||
---
|
||||
|
||||
> **Maintainer note:** if you edit this file, edit `AGENTS.md` to match
|
||||
> (Codex loads `AGENTS.md`, Claude Code loads `CLAUDE.md`; identical
|
||||
> content prevents two-tier rules).
|
||||
Reference in New Issue
Block a user