feat: harness + agent runbook — flip repo from archive-only to

crowdsourced runner

Sloba's chat directive 2026-05-06: "this project is preparation for
going public ... ship the harness along so others can join in."

The repo's original purpose (Ben's catalogue + 21 reference run
ledgers, shipped 2026-05-05) stays intact. This commit ADDS a second
purpose: a portable harness + agent runbook so a friend's coding agent
can clone, read CLAUDE.md, run the same suite on the friend's hardware,
and submit results back as a PR.

What landed:

CLAUDE.md + AGENTS.md (byte-identical, ~520 lines)
  Full agent runbook: hardware probe, runtime + model selection,
  canonical knob reference (Sloba's Pavilion methodology values),
  hardware-adaptation decision rules, run-instructions, output-schema
  templates for hardware.json + metadata.json + run.md, PR submission
  flow (fork → branch → push → PR; nothing auto-merges), privacy
  guardrails, methodology lineage. Per Sloba's Q3 directive: the
  runbook explicitly tells the friend's agent to ADAPT to hardware
  reality and document deviations rather than blindly run defaults.

CONTRIBUTING.md (~110 lines)
  Human-readable companion for the friend (not the agent). What you
  need, how it works, what we ask, what maintainers commit to,
  license, code-of-conduct short version.

harness/
  ├── README.md        Technical readme for the harness folder
  ├── run_benchmark.py ~520 LOC runner. Stdlib-only. Adapted from
  │                    WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py
  │                    v3 with the cluster-internal IP defaults
  │                    (10.8.0.x) replaced by 127.0.0.1:11434, the
  │                    cluster /v1/cluster/* endpoints removed, the
  │                    canonical-suite paths under ~/Documents/MyServers
  │                    replaced by harness/suites/ paths, the git-sha
  │                    enforcement on WeeyugaWeb dropped, and the
  │                    output written under submissions/<handle>/<tag>/
  │                    instead of docs/BENCHMARKS/runs/. Supports all
  │                    six suite phases via --phases, plus 'all'.
  ├── prompts.py       Verbatim copy of the canonical 3 frozen prompts
  │                    (P-EASY/P-MEDIUM/P-HARD) from
  │                    WeeyugaWeb/scripts/benchmarks/prompts.py.
  ├── requirements.txt Empty by intent (stdlib-only); placeholder for
  │                    pip-tools / agent auto-install patterns.
  ├── .gitignore       __pycache__/ etc.
  └── suites/          Six bundled JSON suites copied verbatim from
       Sloba's MyServers/instances/vps-81-17-99-14/telemetry/:
       small_model_eval_questions.json, python_task_suite_questions.json,
       parallel_qwen_same_model_20q_suite.json,
       parallel_qwen_mixed_model_20q_suite.json,
       python_context_edge_append_questions.json,
       python_context_edge_suite_only.json.

submissions/
  README.md            Folder convention + naming + reviewability rules
  EXAMPLE/mac-m1-8gb/run-00000000-...-000000000000/
       Synthetic-but-shape-complete contribution template:
       manifest.json, hardware.json, run.jsonl (5 example lines),
       metadata.json, run.md (with privacy attestation, methodology
       deviations, reproducibility command). Marked as synthetic at
       the top so future analysis doesn't accidentally cite it.

LICENSE-MIT
  MIT for harness/*.py and future helper code. Existing LICENSE
  (CC-BY-4.0) covers data files.

README.md (modified)
  Updated to reflect dual purpose. Layout diagram updated.
  Maintainer credits: Ben for catalogue/methodology + Bane for harness.
  Contributor quick-start added. Status table extended.

Privacy posture:
  - All 6 suite JSON files privacy-scanned for cluster IPs / hostnames /
    paths / tokens. Two prompts contain project names ("MyBoard" auth
    debugging in 20Q-Q14, generic SSH troubleshooting in 5Q-Q03);
    flagged in chat for Sloba's review. Otherwise clean.
  - run_benchmark.py default target_url is 127.0.0.1:11434 (no internal
    IPs leaked).
  - manifest.json captures host_hostname_short via socket.gethostname()
    .split('.')[0] — agent should review before PR if hostname is
    sensitive.
  - CLAUDE.md §8 spells out the privacy-grep before push.

Verification:
  - py_compile run_benchmark.py: OK
  - --help renders cleanly
  - All 6 suite JSON files: valid
  - All 4 example JSON files: valid
  - Example run.jsonl (5 lines): valid

This commit lands on branch feature/runner-and-agent-instructions.
NOT pushed to main; staying on the feature branch until Sloba reviews
on Gitea and merges. Bus dispatch to Ben + Sam announcing the
architectural pivot lives in the WeeyugaWeb coordination repo.
This commit is contained in:
Slobodan Margetic
2026-05-06 11:07:55 +02:00
committed by slobodanmargetic988
parent ddc9626136
commit 97a9245d9e
22 changed files with 4400 additions and 47 deletions

436
AGENTS.md Normal file
View File

@@ -0,0 +1,436 @@
# CLAUDE.md — runbook for the friend's coding agent
> **You are the coding agent (Claude Code, Codex, Aider, Cursor, …) of
> someone who agreed to run the Weeyuga benchmark suite on their hardware.
> This file is your briefing. Read it completely before doing anything.**
>
> If your runtime auto-loads `AGENTS.md` instead of `CLAUDE.md`, that file
> is byte-identical — same content, different filename for tools that
> prefer one over the other.
---
## TL;DR (60 seconds)
You are going to:
1. **Probe** the friend's hardware (OS, CPU, RAM, GPU + VRAM).
2. **Pick** an OpenAI-compatible runtime (Ollama / llama.cpp / vLLM / MLX) and a model that fits.
3. **Adapt** the bundled runner to the friend's reality (this is the part you can't skip — see §3).
4. **Run** `harness/run_benchmark.py` against the runtime; it writes a JSONL ledger of every call.
5. **Document** the run in `submissions/<friend-handle>/<device-tag>/run-<uuid>/`.
6. **Submit** a PR back to this repo. Sloba reviews and merges manually — nothing auto-merges.
You have **read-only access** to this repo. To submit, fork on Gitea, push to your fork, open a PR.
The harness is a **template**, not a one-click button. The friend's hardware will not match Sloba's; you are expected to adapt parameters, research best-known values online when you hit unfamiliar constraints, and **write down what you changed and why** in the manifest. Documented deviation is fine. Silent deviation breaks comparability.
---
## 0. Read this completely before doing anything
The rest of this file is structured as the order you'll work in. Reading the whole thing first gives you the shape; then the friend can ask "go" and you execute without circling back.
If you hit something genuinely ambiguous, ASK THE FRIEND. Don't guess at hardware-specific values; either verify with measurement or research them from the project / model authors' recommended-settings docs.
---
## 1. What you are running, exactly
**Inputs (frozen across runs):**
- `harness/prompts.py` — three frozen prompts (P-EASY, P-MEDIUM, P-HARD). Never modified.
- `harness/suites/*.json` — six benchmark suites, all run sequentially per model:
| phase key | suite file | what it tests |
|---|---|---|
| `5q` | `small_model_eval_questions.json` | 5 short-answer formatting + correctness questions |
| `20q` | `python_task_suite_questions.json` | 20 realistic Python task prompts |
| `parallel_same` | `parallel_qwen_same_model_20q_suite.json` | parallel-lane stress with one model |
| `parallel_mixed` | `parallel_qwen_mixed_model_20q_suite.json` | parallel-lane stress with multiple models |
| `edge_append` | `python_context_edge_append_questions.json` | long-context append behavior |
| `edge_suite` | `python_context_edge_suite_only.json` | long-context whole-suite reasoning |
**Driver:** `harness/run_benchmark.py` — one process, sequential calls to your local OpenAI-compatible `/v1/chat/completions` endpoint, one JSONL line per call.
**Output:** `submissions/<handle>/<device-tag>/run-<uuid>/` containing:
- `run.jsonl` — every call recorded
- `manifest.json` — written automatically by the runner
- `hardware.json`**you fill this** from the hardware probe (§2)
- `metadata.json` — computed aggregates (you generate, see §6)
- `run.md` — human-readable summary (you write, see §6)
**Run order:** ALL six suites run in sequence per model, per the canonical Pavilion methodology Sloba uses. Don't pick-and-choose unless the friend is explicitly time-constrained — partial runs are still useful but they're documented as "partial" in the manifest, and they show up as `archive-only` in Sloba's catalogue rather than full-grade runs.
---
## 2. Hardware probe — do this first, write `hardware.json` from the result
Before anything else, gather the friend's hardware truth. Pick the platform-appropriate commands; don't run all of them, just the ones that work on the friend's OS.
**macOS:**
```bash
system_profiler SPHardwareDataType SPDisplaysDataType
sysctl -n machdep.cpu.brand_string
sysctl -n hw.memsize
sw_vers
uname -a
```
**Linux:**
```bash
lscpu
cat /proc/meminfo | head -3
nvidia-smi --query-gpu=name,memory.total,driver_version,compute_cap --format=csv # if NVIDIA
lspci | grep -iE "vga|3d|display"
uname -a
cat /etc/os-release
```
**Windows (PowerShell):**
```powershell
Get-CimInstance Win32_ComputerSystem | Select Manufacturer, Model, TotalPhysicalMemory
Get-CimInstance Win32_Processor | Select Name, NumberOfCores, MaxClockSpeed
Get-CimInstance Win32_VideoController | Select Name, AdapterRAM, DriverVersion
$PSVersionTable.OS
```
Write the canonical findings to `hardware.json`. Schema (every field present; `null` if not applicable):
```json
{
"schema_version": "hardware-1.0",
"device_tag": "mac-m1-8gb",
"manufacturer_model": "Apple MacBook Air (Mac14,2)",
"os": {"name": "macOS", "version": "14.5", "kernel": "23.5.0"},
"cpu": {"name": "Apple M1", "cores": 8, "threads": 8, "max_ghz": 3.2,
"arch": "arm64", "isa": ["NEON"]},
"memory_gb_total": 8,
"memory_gb_available_at_run_start": 4.2,
"gpu": [
{"name": "Apple M1 GPU", "kind": "integrated", "vram_gb": null,
"driver": "Metal/macOS-14", "compute_cap": null}
],
"storage": {"kind": "ssd", "free_gb_at_run_start": 220},
"thermal_or_power_notes": "default OS thermal mgmt; on AC power",
"network_used_for_model_fetch": "wifi-100mbps",
"container_or_vm": null
}
```
Honest mode flags to mention in `thermal_or_power_notes`:
- "battery-only, low-power-mode active" → results may be artificially slow
- "thermal throttling observed mid-run" → tag any affected calls in `run.md` caveats
- "GPU shared with display compositor" → expect 5-15% throughput hit vs headless
---
## 3. Adapt to hardware reality — this is the part you cannot skip
The harness uses Sloba's canonical knobs as defaults. They are **not** guaranteed to be optimal for the friend's hardware. Your job:
### 3a. Canonical knobs (Sloba's reference values)
```python
CANONICAL_OPTIONS = {
"temperature": 0.1, # near-deterministic; comparable across runs
"num_ctx": 4096, # context window
"num_predict": 2048, # max generated tokens per call
}
```
Plus runtime-level (Ollama-specific but apply equivalently to llama.cpp / vLLM):
- `KEEP_ALIVE` — how long the loaded model stays warm. Sloba uses **2400h** on cluster nodes (~100 days = effectively pinned). On a friend's laptop, **5m** is gentler if RAM is tight.
- `NUM_PARALLEL` — concurrent slots. Sloba uses **3** on Pavilion. **1** is fine on tight RAM.
- `MAX_LOADED_MODELS` — how many models held in VRAM. Sloba uses **3** on a 12 GB GPU; default to **1** on anything ≤ 8 GB.
- For llama.cpp: `--n-gpu-layers` (NGL) — number of model layers offloaded to GPU. **Critical** on borderline VRAM. NGL=99 is full offload; NGL=0 is CPU-only. Sloba's Predator (GTX 1060 6 GB) runs Qwen3.5:9B at NGL=6 because higher offloads OOM with the KV cache.
### 3b. Decision rules
| Friend's hardware | Likely runtime | Likely model size | Likely NGL | Likely NUM_PARALLEL |
|---|---|---|---|---|
| Apple Silicon (M1/M2/M3, ≥8 GB unified) | Ollama OR llama.cpp w/ Metal OR MLX | 0.5B 4B | n/a (Metal handles offload) | 12 |
| Apple Silicon (M-Pro/M-Max, ≥16 GB) | same, MLX preferred for 8B+ | 4B 14B | n/a | 23 |
| NVIDIA GPU 6 GB VRAM | llama.cpp + CUDA | 0.5B 4B (or 8B at NGL ~1020) | tuned per model | 1 |
| NVIDIA GPU 812 GB VRAM | llama.cpp + CUDA, or vLLM | 4B 14B | high (6099) | 12 |
| NVIDIA GPU 24+ GB VRAM | vLLM or llama.cpp | up to 32B | 99 (full) | 4+ |
| AMD GPU | llama.cpp + ROCm | conservative one tier below NVIDIA-equivalent | tuned | 1 |
| CPU only | llama.cpp + CPU | 0.5B 2B (Q4_K_M) | 0 | 1 |
These are starting points. **Don't trust them blindly.** For any model + hardware combination you're uncertain about:
1. Check the model's HuggingFace card for "recommended quantization / hardware" notes.
2. Check the runtime's GitHub for known issues with this model family.
3. Look up llama.cpp issues for "VRAM OOM <model>" — community usually finds the NGL sweet spot.
4. If still uncertain, run a dry probe: `python3 harness/run_benchmark.py --probe --target-url ... --cell-id-prefix ... --models <model>` and observe RSS / VRAM / tokens-per-sec.
### 3c. Document every deviation in `manifest.json.canonical_options_overrides`
The runner records overrides automatically when you pass `--temperature` / `--num-ctx` / `--num-predict`. For runtime-level deviations (NGL, NUM_PARALLEL, KEEP_ALIVE), add them to `hardware.json.thermal_or_power_notes` or to `run.md` § Methodology Deviations. **Untracked deviations are the worst kind — silently make a run uncomparable.** Honest-and-deviated > silent-and-clean.
---
## 4. Pick a runtime and a model
Sloba's instruction: **use any model**. The harness doesn't ship a fixed model list — the friend's agent picks based on hardware. Suggestions, in increasing size:
| Model | Size | When |
|---|---|---|
| `qwen2.5-coder:0.5b` | ~400 MB | minimum-viable code benchmarks; runs anywhere |
| `qwen3.5:0.8b` | ~600 MB | Sloba's reference smallest; matches his catalogue runs |
| `qwen2.5-coder:1.5b` | ~1.1 GB | code-focused mid-tier |
| `qwen3.5:2b` | ~1.5 GB | conversational mid-tier |
| `qwen3.5:4b` | ~3 GB | flagship mid-tier; common comparison point |
| `qwen3.5:8b-q4km` | ~5 GB | mid-tier flagship |
| `qwen3.5:9b-q4km` | ~5.4 GB | Sloba's Predator flagship; 6 GB VRAM borderline (run with reduced NGL) |
| `qwen3.5:14b-q4km` | ~9 GB | needs ≥10 GB VRAM or Apple Silicon ≥16 GB unified |
| `gemma-4:e4b-it-q4km` | ~3 GB | non-Qwen comparison |
| `granite-4.1:8b-q4km` | ~5 GB | non-Qwen comparison |
Models are pulled from:
- **Ollama Hub:** `ollama pull qwen3.5:0.8b`, etc.
- **HuggingFace + llama.cpp:** download GGUF directly via `wget`/`hf-download`, then point `llama-server` at it.
Run more than one model in the same run if you can — comparability. The harness loops models inside one run; cell_ids encode the (node, engine, model) tuple.
---
## 5. Run the benchmark
### 5a. Smoke first (30 seconds)
```bash
python3 harness/run_benchmark.py --smoke \
--target-url http://127.0.0.1:11434 \
--models qwen3.5:0.8b \
--cell-id-prefix mac-m1:ollama \
--submitter-handle <friend-gitea-handle> \
--device-tag <short-device-tag>
```
If smoke 200s back, you have a working runtime. Run the real thing.
### 5b. Full run
```bash
python3 harness/run_benchmark.py \
--target-url http://127.0.0.1:11434 \
--models qwen3.5:0.8b,qwen3.5:4b \
--cell-id-prefix mac-m1:ollama \
--phases hello,5q,20q \
--submitter-handle alice \
--device-tag mac-m1-8gb
```
For the canonical full sweep across all six suites:
```bash
python3 harness/run_benchmark.py --phases all \
--target-url http://127.0.0.1:11434 \
--models qwen3.5:0.8b \
--cell-id-prefix mac-m1:ollama \
--submitter-handle alice --device-tag mac-m1-8gb
```
Expect minutes per cell. The 20Q + edge suites are the long ones (~1040 minutes per model on a small box). If the friend is time-bounded, drop edge_* and parallel_* — but record what you skipped.
### 5c. Resume on interrupt
If interrupted, the JSONL ledger is preserved (every line is fsync'd). To resume the same `run-id`:
```bash
python3 harness/run_benchmark.py --run-id <previous-uuid> ...
```
This appends to a new ledger; you'll need to merge them by hand (or just submit them as two separate runs sharing the same `device-tag`).
---
## 6. Generate `metadata.json` and `run.md`
### 6a. `metadata.json` — computed aggregates per cell
Schema (one row per (cell_id, phase) pair):
```json
{
"schema_version": "metadata-1.0",
"run_id": "<uuid>",
"submitter_handle": "alice",
"device_tag": "mac-m1-8gb",
"cells": [
{
"cell_id": "mac-m1:ollama:qwen3.5:0.8b",
"phase": "20q",
"n_calls": 20,
"n_errors": 0,
"duration_ms_p50": 9600,
"duration_ms_p95": 24000,
"duration_ms_mean": 11200,
"tokens_per_sec_p50": 16.4,
"tokens_per_sec_p95": 22.1,
"tokens_per_sec_mean": 17.0,
"tokens_per_sec_max": 24.8,
"completion_tokens_total": 18234,
"format_ok_rate": 0.85,
"marker_hit_rate_mean": 0.72
}
]
}
```
You can compute this in-line (small script) or use a quick Python REPL pass over `run.jsonl`. The catalogue builder on Sloba's side will recompute it anyway, but having it in the PR makes review fast.
### 6b. `run.md` — human-readable summary
Template (fill in every section honestly):
```markdown
# <device-tag> — <model-set> — <YYYY-MM-DD>
**Run ID:** `<uuid>`
**Submitter:** <handle>
**Hardware:** Apple MacBook Air M1, 8 GB unified, macOS 14.5
**Runtime:** Ollama 0.5.x (default settings; NUM_PARALLEL=1, KEEP_ALIVE=5m)
**Models:** qwen3.5:0.8b, qwen3.5:4b
**Phases run:** hello, 5q, 20q
**Phases skipped:** parallel_same, parallel_mixed, edge_append, edge_suite — RAM constraint, 4 GB free at run start was insufficient for parallel suites.
## Headline numbers
| cell | n_calls | tok/s mean | tok/s p50 | duration p50 | format_ok rate |
|---|---|---|---|---|---|
| mac-m1:ollama:qwen3.5:0.8b 20q | 20 | 17.0 | 16.4 | 9.6 s | 85% |
| mac-m1:ollama:qwen3.5:4b 20q | 20 | 5.8 | 5.5 | 28.2 s | 70% |
## Methodology
Followed the canonical Pavilion methodology with these deviations:
- **NUM_PARALLEL=1** instead of canonical 3 — 8 GB unified RAM doesn't fit two warm copies of qwen3.5:4b.
- **KEEP_ALIVE=5m** instead of 2400h — laptop, not server.
- **edge_* and parallel_* phases skipped** — friend's time budget.
## Caveats
- Run started at 18% battery; one call (20q-q14, model qwen3.5:4b) coincided with macOS Spotlight indexing; flagged in run.jsonl with run_idx=14 — that data point is high-variance.
- Network was on hotel wifi; model pull took ~6 minutes for qwen3.5:4b. Did not affect benchmark timing (model warm before any timed call).
## Reproducibility
```
python3 harness/run_benchmark.py \
--target-url http://127.0.0.1:11434 \
--models qwen3.5:0.8b,qwen3.5:4b \
--cell-id-prefix mac-m1:ollama \
--phases hello,5q,20q \
--submitter-handle alice \
--device-tag mac-m1-8gb
```
```
---
## 7. Submit the PR
1. **Fork** `https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public` to the friend's Gitea account (Gitea web UI → "Fork").
2. **Add the friend's fork as a remote on the local clone:**
```bash
git remote add fork ssh://gitea@git.weeyuga.com/<friend-handle>/weeyuga-benchmarks-public.git
```
3. **Create a topic branch** off `main`:
```bash
git checkout -b submission/<handle>-<device-tag>-<short-date>
```
4. **Stage only the new files under `submissions/<handle>/<device-tag>/run-<uuid>/`.** NEVER modify anything outside that directory in this PR.
```bash
git add submissions/<handle>/<device-tag>/run-<uuid>/
git status # confirm: only files under your run-<uuid>/ are staged
```
5. **Commit** with a descriptive message:
```
submit: alice / mac-m1-8gb / 2026-05-12 — qwen3.5 0.8b+4b, hello+5q+20q
First contribution from a friend's hardware. M1 8 GB unified, Ollama 0.5.x.
Skipped edge_* + parallel_* due to RAM. Headline: qwen3.5:0.8b ~17 tok/s,
qwen3.5:4b ~5.8 tok/s on 20Q.
```
6. **Push to fork:**
```bash
git push fork submission/<handle>-<device-tag>-<short-date>
```
7. **Open a PR on Gitea** with target = `slobodanmargetic988/weeyuga-benchmarks-public:main`. PR description should include:
- One-paragraph what-and-why
- Link to the friend's `run.md`
- Any methodology deviations the reviewer should know
- Privacy attestation: "I have reviewed run.jsonl and confirmed no PII / SSH keys / API tokens / personal home paths leaked"
Sloba reviews and merges. **Nothing auto-merges.** A typical review surfaces 02 follow-ups; address and force-push to the same branch.
---
## 8. Privacy guardrails — DO NOT submit any of these
- API keys (OpenAI, Anthropic, HuggingFace tokens, etc.)
- SSH private keys, `.ssh/` paths
- Personal home directory paths (`/Users/alice/secrets/...`)
- Real names if the friend prefers a handle
- Internal corporate IPs, hostnames, or SSO endpoints
- Bearer tokens in error messages (some runtimes echo headers in 4xx errors)
Before pushing, **scan the run.jsonl** for these patterns:
```bash
grep -nE "Bearer |sk-|api_key|/Users/|/home/|password|ssh-rsa|ssh-ed25519" submissions/<handle>/<device-tag>/run-<uuid>/*.{jsonl,md,json}
```
If anything matches, redact it from `response_preview` (the JSONL stores only the first 240 chars of each response, so leaks are rare — but please scan).
---
## 9. What if you get stuck
- **`/v1/models` returns empty:** the runtime isn't OpenAI-compat or no models are loaded. For Ollama: `ollama list`. For llama.cpp: it doesn't list models on `/v1/models` historically; pass `--models <name> --target-url http://127.0.0.1:11436` and it'll work anyway.
- **Every call returns 500 / timeout:** runtime is up but model isn't loading. Check VRAM with `nvidia-smi` or memory pressure with `vm_stat` / `free`. Smaller model. Or smaller `num_ctx`.
- **Tokens/sec absurdly low (<1 tok/s on hardware that should manage):** thermal throttling, swap thrashing, or wrong-quantization. Check `free -h` mid-run; if swap is being used, model is too big for RAM.
- **One question keeps getting `format_ok=false`:** model can't follow that instruction shape. NORMAL. Don't shorten the prompt or reword. Document in run.md and move on.
- **Ambiguous hardware setup (eGPU? VM? Container?):** ASK the friend. Container/VM resource caps make benchmarks misleading.
---
## 10. The methodology lineage
This harness mirrors `WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py` v3 — Sloba's canonical Pavilion methodology established 2026-04-11. The 21 reference runs in `runs/` (in this repo) are the comparable baseline. Sloba's locked catalogue lives at `catalogue.json` (this repo). When your run is merged, it'll be added to the catalogue under your `device-tag` and become a new comparison point.
The methodology and harness will evolve. Current canonical version: `HARNESS_VERSION = "public-1"`. Future versions will be additive — older ledgers stay valid forever.
---
## 11. Coordinate-while-running checklist
Before you start:
- [ ] Read this whole file
- [ ] Read `methodology.md` for the metric definitions (TTFT, p50/p95, format_ok, etc.)
- [ ] Verify the friend has ≥3 GB free disk for model files
- [ ] Verify network is OK for model pull (the GGUFs are 0.510 GB)
While running:
- [ ] Smoke first
- [ ] Full run
- [ ] Watch for thermal throttling on laptops / phones / mini-PCs
- [ ] Don't open Chrome / Slack / Zoom mid-run if you can avoid it (VRAM pressure)
After running:
- [ ] Generate `metadata.json` aggregates
- [ ] Write `run.md` honestly — including caveats
- [ ] Privacy-scan `run.jsonl`
- [ ] Fork → branch → push → PR
---
## Questions / blockers
If you hit something this runbook doesn't cover, the friend can email Sloba (slobodan@weeyuga.com) or open an issue on this repo. Don't burn an hour in a corner — ask. The whole point of crowdsourcing is the variance you'll see; that's data, not a problem.
Welcome aboard. 🦇
— The Weeyuga team
---
> **Maintainer note:** if you edit this file, edit `AGENTS.md` to match
> (Codex loads `AGENTS.md`, Claude Code loads `CLAUDE.md`; identical
> content prevents two-tier rules).

436
CLAUDE.md Normal file
View File

@@ -0,0 +1,436 @@
# CLAUDE.md — runbook for the friend's coding agent
> **You are the coding agent (Claude Code, Codex, Aider, Cursor, …) of
> someone who agreed to run the Weeyuga benchmark suite on their hardware.
> This file is your briefing. Read it completely before doing anything.**
>
> If your runtime auto-loads `AGENTS.md` instead of `CLAUDE.md`, that file
> is byte-identical — same content, different filename for tools that
> prefer one over the other.
---
## TL;DR (60 seconds)
You are going to:
1. **Probe** the friend's hardware (OS, CPU, RAM, GPU + VRAM).
2. **Pick** an OpenAI-compatible runtime (Ollama / llama.cpp / vLLM / MLX) and a model that fits.
3. **Adapt** the bundled runner to the friend's reality (this is the part you can't skip — see §3).
4. **Run** `harness/run_benchmark.py` against the runtime; it writes a JSONL ledger of every call.
5. **Document** the run in `submissions/<friend-handle>/<device-tag>/run-<uuid>/`.
6. **Submit** a PR back to this repo. Sloba reviews and merges manually — nothing auto-merges.
You have **read-only access** to this repo. To submit, fork on Gitea, push to your fork, open a PR.
The harness is a **template**, not a one-click button. The friend's hardware will not match Sloba's; you are expected to adapt parameters, research best-known values online when you hit unfamiliar constraints, and **write down what you changed and why** in the manifest. Documented deviation is fine. Silent deviation breaks comparability.
---
## 0. Read this completely before doing anything
The rest of this file is structured as the order you'll work in. Reading the whole thing first gives you the shape; then the friend can ask "go" and you execute without circling back.
If you hit something genuinely ambiguous, ASK THE FRIEND. Don't guess at hardware-specific values; either verify with measurement or research them from the project / model authors' recommended-settings docs.
---
## 1. What you are running, exactly
**Inputs (frozen across runs):**
- `harness/prompts.py` — three frozen prompts (P-EASY, P-MEDIUM, P-HARD). Never modified.
- `harness/suites/*.json` — six benchmark suites, all run sequentially per model:
| phase key | suite file | what it tests |
|---|---|---|
| `5q` | `small_model_eval_questions.json` | 5 short-answer formatting + correctness questions |
| `20q` | `python_task_suite_questions.json` | 20 realistic Python task prompts |
| `parallel_same` | `parallel_qwen_same_model_20q_suite.json` | parallel-lane stress with one model |
| `parallel_mixed` | `parallel_qwen_mixed_model_20q_suite.json` | parallel-lane stress with multiple models |
| `edge_append` | `python_context_edge_append_questions.json` | long-context append behavior |
| `edge_suite` | `python_context_edge_suite_only.json` | long-context whole-suite reasoning |
**Driver:** `harness/run_benchmark.py` — one process, sequential calls to your local OpenAI-compatible `/v1/chat/completions` endpoint, one JSONL line per call.
**Output:** `submissions/<handle>/<device-tag>/run-<uuid>/` containing:
- `run.jsonl` — every call recorded
- `manifest.json` — written automatically by the runner
- `hardware.json`**you fill this** from the hardware probe (§2)
- `metadata.json` — computed aggregates (you generate, see §6)
- `run.md` — human-readable summary (you write, see §6)
**Run order:** ALL six suites run in sequence per model, per the canonical Pavilion methodology Sloba uses. Don't pick-and-choose unless the friend is explicitly time-constrained — partial runs are still useful but they're documented as "partial" in the manifest, and they show up as `archive-only` in Sloba's catalogue rather than full-grade runs.
---
## 2. Hardware probe — do this first, write `hardware.json` from the result
Before anything else, gather the friend's hardware truth. Pick the platform-appropriate commands; don't run all of them, just the ones that work on the friend's OS.
**macOS:**
```bash
system_profiler SPHardwareDataType SPDisplaysDataType
sysctl -n machdep.cpu.brand_string
sysctl -n hw.memsize
sw_vers
uname -a
```
**Linux:**
```bash
lscpu
cat /proc/meminfo | head -3
nvidia-smi --query-gpu=name,memory.total,driver_version,compute_cap --format=csv # if NVIDIA
lspci | grep -iE "vga|3d|display"
uname -a
cat /etc/os-release
```
**Windows (PowerShell):**
```powershell
Get-CimInstance Win32_ComputerSystem | Select Manufacturer, Model, TotalPhysicalMemory
Get-CimInstance Win32_Processor | Select Name, NumberOfCores, MaxClockSpeed
Get-CimInstance Win32_VideoController | Select Name, AdapterRAM, DriverVersion
$PSVersionTable.OS
```
Write the canonical findings to `hardware.json`. Schema (every field present; `null` if not applicable):
```json
{
"schema_version": "hardware-1.0",
"device_tag": "mac-m1-8gb",
"manufacturer_model": "Apple MacBook Air (Mac14,2)",
"os": {"name": "macOS", "version": "14.5", "kernel": "23.5.0"},
"cpu": {"name": "Apple M1", "cores": 8, "threads": 8, "max_ghz": 3.2,
"arch": "arm64", "isa": ["NEON"]},
"memory_gb_total": 8,
"memory_gb_available_at_run_start": 4.2,
"gpu": [
{"name": "Apple M1 GPU", "kind": "integrated", "vram_gb": null,
"driver": "Metal/macOS-14", "compute_cap": null}
],
"storage": {"kind": "ssd", "free_gb_at_run_start": 220},
"thermal_or_power_notes": "default OS thermal mgmt; on AC power",
"network_used_for_model_fetch": "wifi-100mbps",
"container_or_vm": null
}
```
Honest mode flags to mention in `thermal_or_power_notes`:
- "battery-only, low-power-mode active" → results may be artificially slow
- "thermal throttling observed mid-run" → tag any affected calls in `run.md` caveats
- "GPU shared with display compositor" → expect 5-15% throughput hit vs headless
---
## 3. Adapt to hardware reality — this is the part you cannot skip
The harness uses Sloba's canonical knobs as defaults. They are **not** guaranteed to be optimal for the friend's hardware. Your job:
### 3a. Canonical knobs (Sloba's reference values)
```python
CANONICAL_OPTIONS = {
"temperature": 0.1, # near-deterministic; comparable across runs
"num_ctx": 4096, # context window
"num_predict": 2048, # max generated tokens per call
}
```
Plus runtime-level (Ollama-specific but apply equivalently to llama.cpp / vLLM):
- `KEEP_ALIVE` — how long the loaded model stays warm. Sloba uses **2400h** on cluster nodes (~100 days = effectively pinned). On a friend's laptop, **5m** is gentler if RAM is tight.
- `NUM_PARALLEL` — concurrent slots. Sloba uses **3** on Pavilion. **1** is fine on tight RAM.
- `MAX_LOADED_MODELS` — how many models held in VRAM. Sloba uses **3** on a 12 GB GPU; default to **1** on anything ≤ 8 GB.
- For llama.cpp: `--n-gpu-layers` (NGL) — number of model layers offloaded to GPU. **Critical** on borderline VRAM. NGL=99 is full offload; NGL=0 is CPU-only. Sloba's Predator (GTX 1060 6 GB) runs Qwen3.5:9B at NGL=6 because higher offloads OOM with the KV cache.
### 3b. Decision rules
| Friend's hardware | Likely runtime | Likely model size | Likely NGL | Likely NUM_PARALLEL |
|---|---|---|---|---|
| Apple Silicon (M1/M2/M3, ≥8 GB unified) | Ollama OR llama.cpp w/ Metal OR MLX | 0.5B 4B | n/a (Metal handles offload) | 12 |
| Apple Silicon (M-Pro/M-Max, ≥16 GB) | same, MLX preferred for 8B+ | 4B 14B | n/a | 23 |
| NVIDIA GPU 6 GB VRAM | llama.cpp + CUDA | 0.5B 4B (or 8B at NGL ~1020) | tuned per model | 1 |
| NVIDIA GPU 812 GB VRAM | llama.cpp + CUDA, or vLLM | 4B 14B | high (6099) | 12 |
| NVIDIA GPU 24+ GB VRAM | vLLM or llama.cpp | up to 32B | 99 (full) | 4+ |
| AMD GPU | llama.cpp + ROCm | conservative one tier below NVIDIA-equivalent | tuned | 1 |
| CPU only | llama.cpp + CPU | 0.5B 2B (Q4_K_M) | 0 | 1 |
These are starting points. **Don't trust them blindly.** For any model + hardware combination you're uncertain about:
1. Check the model's HuggingFace card for "recommended quantization / hardware" notes.
2. Check the runtime's GitHub for known issues with this model family.
3. Look up llama.cpp issues for "VRAM OOM <model>" — community usually finds the NGL sweet spot.
4. If still uncertain, run a dry probe: `python3 harness/run_benchmark.py --probe --target-url ... --cell-id-prefix ... --models <model>` and observe RSS / VRAM / tokens-per-sec.
### 3c. Document every deviation in `manifest.json.canonical_options_overrides`
The runner records overrides automatically when you pass `--temperature` / `--num-ctx` / `--num-predict`. For runtime-level deviations (NGL, NUM_PARALLEL, KEEP_ALIVE), add them to `hardware.json.thermal_or_power_notes` or to `run.md` § Methodology Deviations. **Untracked deviations are the worst kind — silently make a run uncomparable.** Honest-and-deviated > silent-and-clean.
---
## 4. Pick a runtime and a model
Sloba's instruction: **use any model**. The harness doesn't ship a fixed model list — the friend's agent picks based on hardware. Suggestions, in increasing size:
| Model | Size | When |
|---|---|---|
| `qwen2.5-coder:0.5b` | ~400 MB | minimum-viable code benchmarks; runs anywhere |
| `qwen3.5:0.8b` | ~600 MB | Sloba's reference smallest; matches his catalogue runs |
| `qwen2.5-coder:1.5b` | ~1.1 GB | code-focused mid-tier |
| `qwen3.5:2b` | ~1.5 GB | conversational mid-tier |
| `qwen3.5:4b` | ~3 GB | flagship mid-tier; common comparison point |
| `qwen3.5:8b-q4km` | ~5 GB | mid-tier flagship |
| `qwen3.5:9b-q4km` | ~5.4 GB | Sloba's Predator flagship; 6 GB VRAM borderline (run with reduced NGL) |
| `qwen3.5:14b-q4km` | ~9 GB | needs ≥10 GB VRAM or Apple Silicon ≥16 GB unified |
| `gemma-4:e4b-it-q4km` | ~3 GB | non-Qwen comparison |
| `granite-4.1:8b-q4km` | ~5 GB | non-Qwen comparison |
Models are pulled from:
- **Ollama Hub:** `ollama pull qwen3.5:0.8b`, etc.
- **HuggingFace + llama.cpp:** download GGUF directly via `wget`/`hf-download`, then point `llama-server` at it.
Run more than one model in the same run if you can — comparability. The harness loops models inside one run; cell_ids encode the (node, engine, model) tuple.
---
## 5. Run the benchmark
### 5a. Smoke first (30 seconds)
```bash
python3 harness/run_benchmark.py --smoke \
--target-url http://127.0.0.1:11434 \
--models qwen3.5:0.8b \
--cell-id-prefix mac-m1:ollama \
--submitter-handle <friend-gitea-handle> \
--device-tag <short-device-tag>
```
If smoke 200s back, you have a working runtime. Run the real thing.
### 5b. Full run
```bash
python3 harness/run_benchmark.py \
--target-url http://127.0.0.1:11434 \
--models qwen3.5:0.8b,qwen3.5:4b \
--cell-id-prefix mac-m1:ollama \
--phases hello,5q,20q \
--submitter-handle alice \
--device-tag mac-m1-8gb
```
For the canonical full sweep across all six suites:
```bash
python3 harness/run_benchmark.py --phases all \
--target-url http://127.0.0.1:11434 \
--models qwen3.5:0.8b \
--cell-id-prefix mac-m1:ollama \
--submitter-handle alice --device-tag mac-m1-8gb
```
Expect minutes per cell. The 20Q + edge suites are the long ones (~1040 minutes per model on a small box). If the friend is time-bounded, drop edge_* and parallel_* — but record what you skipped.
### 5c. Resume on interrupt
If interrupted, the JSONL ledger is preserved (every line is fsync'd). To resume the same `run-id`:
```bash
python3 harness/run_benchmark.py --run-id <previous-uuid> ...
```
This appends to a new ledger; you'll need to merge them by hand (or just submit them as two separate runs sharing the same `device-tag`).
---
## 6. Generate `metadata.json` and `run.md`
### 6a. `metadata.json` — computed aggregates per cell
Schema (one row per (cell_id, phase) pair):
```json
{
"schema_version": "metadata-1.0",
"run_id": "<uuid>",
"submitter_handle": "alice",
"device_tag": "mac-m1-8gb",
"cells": [
{
"cell_id": "mac-m1:ollama:qwen3.5:0.8b",
"phase": "20q",
"n_calls": 20,
"n_errors": 0,
"duration_ms_p50": 9600,
"duration_ms_p95": 24000,
"duration_ms_mean": 11200,
"tokens_per_sec_p50": 16.4,
"tokens_per_sec_p95": 22.1,
"tokens_per_sec_mean": 17.0,
"tokens_per_sec_max": 24.8,
"completion_tokens_total": 18234,
"format_ok_rate": 0.85,
"marker_hit_rate_mean": 0.72
}
]
}
```
You can compute this in-line (small script) or use a quick Python REPL pass over `run.jsonl`. The catalogue builder on Sloba's side will recompute it anyway, but having it in the PR makes review fast.
### 6b. `run.md` — human-readable summary
Template (fill in every section honestly):
```markdown
# <device-tag> — <model-set> — <YYYY-MM-DD>
**Run ID:** `<uuid>`
**Submitter:** <handle>
**Hardware:** Apple MacBook Air M1, 8 GB unified, macOS 14.5
**Runtime:** Ollama 0.5.x (default settings; NUM_PARALLEL=1, KEEP_ALIVE=5m)
**Models:** qwen3.5:0.8b, qwen3.5:4b
**Phases run:** hello, 5q, 20q
**Phases skipped:** parallel_same, parallel_mixed, edge_append, edge_suite — RAM constraint, 4 GB free at run start was insufficient for parallel suites.
## Headline numbers
| cell | n_calls | tok/s mean | tok/s p50 | duration p50 | format_ok rate |
|---|---|---|---|---|---|
| mac-m1:ollama:qwen3.5:0.8b 20q | 20 | 17.0 | 16.4 | 9.6 s | 85% |
| mac-m1:ollama:qwen3.5:4b 20q | 20 | 5.8 | 5.5 | 28.2 s | 70% |
## Methodology
Followed the canonical Pavilion methodology with these deviations:
- **NUM_PARALLEL=1** instead of canonical 3 — 8 GB unified RAM doesn't fit two warm copies of qwen3.5:4b.
- **KEEP_ALIVE=5m** instead of 2400h — laptop, not server.
- **edge_* and parallel_* phases skipped** — friend's time budget.
## Caveats
- Run started at 18% battery; one call (20q-q14, model qwen3.5:4b) coincided with macOS Spotlight indexing; flagged in run.jsonl with run_idx=14 — that data point is high-variance.
- Network was on hotel wifi; model pull took ~6 minutes for qwen3.5:4b. Did not affect benchmark timing (model warm before any timed call).
## Reproducibility
```
python3 harness/run_benchmark.py \
--target-url http://127.0.0.1:11434 \
--models qwen3.5:0.8b,qwen3.5:4b \
--cell-id-prefix mac-m1:ollama \
--phases hello,5q,20q \
--submitter-handle alice \
--device-tag mac-m1-8gb
```
```
---
## 7. Submit the PR
1. **Fork** `https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public` to the friend's Gitea account (Gitea web UI → "Fork").
2. **Add the friend's fork as a remote on the local clone:**
```bash
git remote add fork ssh://gitea@git.weeyuga.com/<friend-handle>/weeyuga-benchmarks-public.git
```
3. **Create a topic branch** off `main`:
```bash
git checkout -b submission/<handle>-<device-tag>-<short-date>
```
4. **Stage only the new files under `submissions/<handle>/<device-tag>/run-<uuid>/`.** NEVER modify anything outside that directory in this PR.
```bash
git add submissions/<handle>/<device-tag>/run-<uuid>/
git status # confirm: only files under your run-<uuid>/ are staged
```
5. **Commit** with a descriptive message:
```
submit: alice / mac-m1-8gb / 2026-05-12 — qwen3.5 0.8b+4b, hello+5q+20q
First contribution from a friend's hardware. M1 8 GB unified, Ollama 0.5.x.
Skipped edge_* + parallel_* due to RAM. Headline: qwen3.5:0.8b ~17 tok/s,
qwen3.5:4b ~5.8 tok/s on 20Q.
```
6. **Push to fork:**
```bash
git push fork submission/<handle>-<device-tag>-<short-date>
```
7. **Open a PR on Gitea** with target = `slobodanmargetic988/weeyuga-benchmarks-public:main`. PR description should include:
- One-paragraph what-and-why
- Link to the friend's `run.md`
- Any methodology deviations the reviewer should know
- Privacy attestation: "I have reviewed run.jsonl and confirmed no PII / SSH keys / API tokens / personal home paths leaked"
Sloba reviews and merges. **Nothing auto-merges.** A typical review surfaces 02 follow-ups; address and force-push to the same branch.
---
## 8. Privacy guardrails — DO NOT submit any of these
- API keys (OpenAI, Anthropic, HuggingFace tokens, etc.)
- SSH private keys, `.ssh/` paths
- Personal home directory paths (`/Users/alice/secrets/...`)
- Real names if the friend prefers a handle
- Internal corporate IPs, hostnames, or SSO endpoints
- Bearer tokens in error messages (some runtimes echo headers in 4xx errors)
Before pushing, **scan the run.jsonl** for these patterns:
```bash
grep -nE "Bearer |sk-|api_key|/Users/|/home/|password|ssh-rsa|ssh-ed25519" submissions/<handle>/<device-tag>/run-<uuid>/*.{jsonl,md,json}
```
If anything matches, redact it from `response_preview` (the JSONL stores only the first 240 chars of each response, so leaks are rare — but please scan).
---
## 9. What if you get stuck
- **`/v1/models` returns empty:** the runtime isn't OpenAI-compat or no models are loaded. For Ollama: `ollama list`. For llama.cpp: it doesn't list models on `/v1/models` historically; pass `--models <name> --target-url http://127.0.0.1:11436` and it'll work anyway.
- **Every call returns 500 / timeout:** runtime is up but model isn't loading. Check VRAM with `nvidia-smi` or memory pressure with `vm_stat` / `free`. Smaller model. Or smaller `num_ctx`.
- **Tokens/sec absurdly low (<1 tok/s on hardware that should manage):** thermal throttling, swap thrashing, or wrong-quantization. Check `free -h` mid-run; if swap is being used, model is too big for RAM.
- **One question keeps getting `format_ok=false`:** model can't follow that instruction shape. NORMAL. Don't shorten the prompt or reword. Document in run.md and move on.
- **Ambiguous hardware setup (eGPU? VM? Container?):** ASK the friend. Container/VM resource caps make benchmarks misleading.
---
## 10. The methodology lineage
This harness mirrors `WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py` v3 — Sloba's canonical Pavilion methodology established 2026-04-11. The 21 reference runs in `runs/` (in this repo) are the comparable baseline. Sloba's locked catalogue lives at `catalogue.json` (this repo). When your run is merged, it'll be added to the catalogue under your `device-tag` and become a new comparison point.
The methodology and harness will evolve. Current canonical version: `HARNESS_VERSION = "public-1"`. Future versions will be additive — older ledgers stay valid forever.
---
## 11. Coordinate-while-running checklist
Before you start:
- [ ] Read this whole file
- [ ] Read `methodology.md` for the metric definitions (TTFT, p50/p95, format_ok, etc.)
- [ ] Verify the friend has ≥3 GB free disk for model files
- [ ] Verify network is OK for model pull (the GGUFs are 0.510 GB)
While running:
- [ ] Smoke first
- [ ] Full run
- [ ] Watch for thermal throttling on laptops / phones / mini-PCs
- [ ] Don't open Chrome / Slack / Zoom mid-run if you can avoid it (VRAM pressure)
After running:
- [ ] Generate `metadata.json` aggregates
- [ ] Write `run.md` honestly — including caveats
- [ ] Privacy-scan `run.jsonl`
- [ ] Fork → branch → push → PR
---
## Questions / blockers
If you hit something this runbook doesn't cover, the friend can email Sloba (slobodan@weeyuga.com) or open an issue on this repo. Don't burn an hour in a corner — ask. The whole point of crowdsourcing is the variance you'll see; that's data, not a problem.
Welcome aboard. 🦇
— The Weeyuga team
---
> **Maintainer note:** if you edit this file, edit `AGENTS.md` to match
> (Codex loads `AGENTS.md`, Claude Code loads `CLAUDE.md`; identical
> content prevents two-tier rules).

99
CONTRIBUTING.md Normal file
View File

@@ -0,0 +1,99 @@
# Contributing benchmarks from your hardware
Welcome — and thank you for adding a data point. This file is the
human-readable companion to `CLAUDE.md` / `AGENTS.md`. The agent files have
the full mechanical detail; this one has the humans-in-the-loop story.
## What you're contributing
You're running the same Weeyuga benchmark suite that powers
[benchmarks.weeyuga.com](https://benchmarks.weeyuga.com), on your hardware,
and submitting the raw output as a PR. Your numbers join Sloba's
cluster numbers as a comparison point. More devices = more honest
ladder.
## What you need
- A device with ≥3 GB free disk space (model files are 0.510 GB depending on size)
- An OpenAI-compatible LLM runtime — Ollama (easiest), llama.cpp, vLLM, or MLX
- A coding agent — Claude Code, Codex, Aider, Cursor, etc. — to read the runbook and adapt the harness to your hardware
- A Gitea account on `git.weeyuga.com` (free; you'll need it to fork + PR)
- Maybe 14 hours wall-clock, mostly idle while the benchmark runs
## How it works
1. **You clone** this repo:
```bash
git clone https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public.git
cd weeyuga-benchmarks-public
```
2. **You point your agent at it** with one sentence:
> "Read `CLAUDE.md` (or `AGENTS.md`), then run the Weeyuga benchmark on
> my hardware. Pick a model that fits, document everything, and prepare
> a PR. I'll review before pushing."
3. **The agent does the work** — probes your hardware, picks a model size,
adapts the runner's parameters, runs the suite, generates a JSONL ledger
plus a human-readable summary.
4. **You review** what the agent produced — especially `run.md` for honesty
and `run.jsonl` for any accidentally-leaked secrets.
5. **You fork → push → open PR** (the agent can do this too; you click
the merge button on Gitea after).
6. **Sloba reviews the PR** and merges. He may have a question or two; the
agent can address them on the same branch.
## Why crowdsource benchmarks?
Hardware variance is huge. Sloba's published numbers come from a Mac M1, a
laptop with a GTX 1060, and three VPSes — that's a thin slice of the world.
Your RTX 4090, your Snapdragon X Elite, your Ryzen 7950X3D, your old Xeon
all have stories worth recording.
Same prompts. Same suites. Different hardware. The numbers compose.
## What we ask
- Run the canonical suite as-is (5q + 20q minimum; the rest if you have time)
- **Document deviations honestly.** If you had to skip parallel suites because
of RAM, say so. If you tweaked NGL because the default OOM'd, say so. The
point is comparable runs, not perfect runs.
- **Privacy-scan before pushing.** `run.jsonl` stores response previews — if
the model echoed your home directory or an API key from your shell history,
redact before PR.
- **One PR per device per session.** Don't bundle "my laptop AND my desktop
AND my friend's PC" — separate PRs are easier to review.
## What the maintainers (Sloba + team) commit to
- We respond to PRs within ~3 days
- We don't merge without reading; if your run.md has clear caveats we'll usually merge
- We credit you by handle in `catalogue.json` if/when your run becomes a flagship
- We never expose anything from your `run.md` or `manifest.json` beyond what
you submitted; if you used a pseudonym, that's the name that ships
- If we ask for a re-run with different parameters, that's a separate dispatch — we don't silently reinterpret your run
## License of your contribution
By PR-ing data into this repo, you license it under
[CC-BY-4.0](LICENSE) (data) and the harness/runner code under
[MIT](LICENSE-MIT). Attribution stays with you (your handle becomes part of
the run record).
## What this repo is NOT
- A leaderboard with prizes
- A way to "win" against other devices (the point is honest measurement, not bragging rights)
- A vehicle for marketing claims (vendor PR runs need a separate flow we
haven't designed yet — please don't astroturf the catalogue)
## Found a bug or methodology gap?
Open an issue. We'd rather hear about a flawed prompt or a misleading metric
than ship more data using it.
## Code of conduct, the short version
Be kind, be honest about your data, don't try to game the catalogue, and
don't dox other contributors. Sloba reserves the right to remove submissions
that violate spirit-of-the-thing — but we'll say why.
— The Weeyuga team

31
LICENSE-MIT Normal file
View File

@@ -0,0 +1,31 @@
MIT License
Copyright (c) 2026 Slobodan Margetić and contributors
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
---
This MIT license applies to:
- harness/run_benchmark.py
- harness/prompts.py
- any future Python helper code under harness/
Data files (runs/, catalogue.json, methodology.md, harness/suites/*.json) are
licensed under CC-BY-4.0; see LICENSE.

161
README.md
View File

@@ -1,54 +1,112 @@
# weeyuga-benchmarks-public
> **Status: PRIVATE STAGING** — this repo is not yet public. Flips to anonymous-read after [Miljan + Stevan's pre-launch security audit](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb/src/branch/main/coordination/messages) signs off. If you're reading this and you're not on the Weeyuga team, you got here too early.
> **Status: PRIVATE STAGING** — this repo is not yet anonymous-readable.
> Flips to public after the pre-launch security audit signs off.
> If you got here too early, please hold; you'll be invited soon.
Canonical raw-data archive for **[benchmarks.weeyuga.com](https://benchmarks.weeyuga.com)** — every benchmark run we publish on the site is mirrored here as raw JSONL + log + human summary so anyone can clone, re-analyse, or cite.
Open benchmarks for local LLMs — same prompts, same suites, run on
whatever hardware you've got, results compose into one ladder.
This repo is two things in one:
1. **Canonical archive** — every benchmark Sloba's team publishes on
[benchmarks.weeyuga.com](https://benchmarks.weeyuga.com) lives here as
raw JSONL + computed metadata + human summary, so anyone can clone,
re-analyse, or cite. This is the original purpose; see `runs/`.
2. **Crowdsourced runner** — a portable harness + agent runbook so a
friend's coding agent (Claude Code, Codex, Aider, …) can clone this
repo, read `CLAUDE.md` / `AGENTS.md`, run the same suite on the
friend's hardware, and submit the result back as a PR. This is the
newer purpose; see `harness/` + `submissions/`.
Both purposes share one schema, one prompt set, one methodology.
## Quick start — for friends contributing a benchmark
```bash
git clone https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public.git
cd weeyuga-benchmarks-public
# Hand this file path to your coding agent and say:
# "Read CLAUDE.md, then run the Weeyuga benchmark on my hardware."
```
Then read `CONTRIBUTING.md` for the human-side flow (what to expect,
how reviews work, license, code of conduct).
## Layout
```
.
├── README.md — this file
├── LICENSE — CC-BY-4.0 (data) + MIT (helper code)
├── catalogue.json index of every published benchmark (mirror of the site catalogue)
├── methodology.md — how we benchmark + fairness rules + reproducibility notes
── runs/
└── <run-id>/
├── run.jsonl — canonical raw event stream (one JSON object per line)
├── run.log — tee'd stdout/stderr from the harness (when captured)
├── run.md — human-readable summary (when synthesis exists)
└── metadata.json — computed snapshot: meta record + per-cell aggregates + status
├── README.md — this file
├── CLAUDE.md — full agent runbook (read this if you're an LLM-driven agent)
├── AGENTS.mdbyte-identical to CLAUDE.md (Codex / other tools that prefer this name)
├── CONTRIBUTING.md — human-readable contribution guide
── LICENSE — CC-BY-4.0 for data
├── LICENSE-MIT — MIT for harness/runner code
├── catalogue.json — index of every published benchmark (canonical archive)
├── methodology.md — how we benchmark + fairness rules + reproducibility notes
├── harness/ — portable runner + suites + prompts (the crowdsourced piece)
├── README.md
│ ├── run_benchmark.py
│ ├── prompts.py
│ ├── requirements.txt
│ └── suites/ — six bundled JSON suites (5Q, 20Q, parallel × 2, edge × 2)
├── runs/ — canonical archive: 21 reference runs from Sloba's cluster
│ └── <run-id>/
│ ├── run.jsonl
│ ├── run.log — when captured
│ ├── run.md — when synthesis exists
│ └── metadata.json
└── submissions/ — community contributions land here
├── README.md
├── EXAMPLE/ — one fully-filled-out template you can read
│ └── mac-m1-8gb/
│ └── run-<uuid>/
│ ├── manifest.json
│ ├── hardware.json
│ ├── run.jsonl
│ ├── metadata.json
│ └── run.md
└── <handle>/ — your contributions
└── <device-tag>/
└── run-<uuid>/...
```
## Run-ID format
Every run gets a UUID v4 `<run-id>` assigned at harness startup. Run IDs are stable across re-runs of synthesis — the same run-id always points to the same raw `run.jsonl` event stream. Synthesis docs (`run.md`) and computed metadata (`metadata.json`) can be regenerated from the canonical jsonl at any time.
Every run gets a UUID v4 `<run-id>` assigned at harness startup. Run IDs
are stable across re-runs of synthesis — the same run-id always points
to the same raw `run.jsonl` event stream. Synthesis docs (`run.md`) and
computed metadata (`metadata.json`) can be regenerated from the
canonical jsonl at any time.
## Schema
The `catalogue.json` index follows `schema_version = "1.0-draft"` (or later — check the value at the top of the file). Per-benchmark entries include:
The `catalogue.json` index follows `schema_version = "1.0-draft"` (or
later — check the value at the top of the file). Per-benchmark entries
include:
- `id` — run-id
- `title`, `headline`, `date`
- `hardware` (pavilion / predator / mac / vps50 / runpod)
- `hardware` (pavilion / predator / mac / vps50 / runpod / community-`<device-tag>`)
- `engine` (llamacpp / ollama / vllm / mlx / cpu)
- `harness` (which harness produced this — see `methodology.md` for the matrix)
- `harness` (which harness produced this — see `methodology.md`)
- `model_family`, `model_sizes`
- `cells[]` — per-(machine × engine × model) summary: n_calls, n_errors, duration_ms (mean + p50), tokens_per_sec (mean + max)
- `synthesis_doc` — filename of the synthesis prose for this run, if one exists
- `tags`, `status`, `visibility`
- `cells[]` — per-(machine × engine × model) summary
- `synthesis_doc` — filename of the synthesis prose, if one exists
- `tags`, `status`, `visibility`, `site_grade`
Per-run `metadata.json` adds `cells_full[]` with the full call list inline.
## How to consume
## How to consume the archive
### Just download a single run
### Single run
```bash
curl -O https://benchmarks.weeyuga.com/data/runs/<run-id>/run.jsonl
curl -O https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public/raw/branch/main/runs/<run-id>/run.jsonl
```
### Clone the whole archive
### Whole archive
```bash
git clone https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public.git
@@ -56,48 +114,57 @@ git clone https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public.
### Re-build catalogue from raw
The canonical builder lives in [WeeyugaWeb/scripts/benchmarks/build_catalogue.py](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb/src/branch/main/scripts/benchmarks/build_catalogue.py) and runs against `runs/*.jsonl`. If you want to regenerate the catalogue from your own clone of this repo:
The canonical builder lives in
[WeeyugaWeb/scripts/benchmarks/build_catalogue.py](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb/src/branch/main/scripts/benchmarks/build_catalogue.py)
and runs against `runs/*/run.jsonl`.
```bash
git clone https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb.git
cd WeeyugaWeb
python3 scripts/benchmarks/build_catalogue.py
```
## How to contribute a benchmark
See `CLAUDE.md` (or `AGENTS.md` — same content) for the full agent
runbook, and `CONTRIBUTING.md` for the human-side flow. Short version:
1. Clone repo
2. Hand `CLAUDE.md` to your coding agent
3. Agent probes hardware, picks a model, runs benchmark, writes results
4. You review what the agent produced
5. You fork → push → open PR
6. Maintainers review and merge
Read access is open. **Write access is via PR only — nothing auto-merges.**
## Citation
If you use this data, please cite as:
```
Margetić, S. & contributors. (2026). Weeyuga cluster benchmarks (raw data archive).
Margetić, S. & contributors. (2026). Weeyuga local-LLM benchmarks.
https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public
```
(A more formal citation form will land here once Mila weighs in on academic-attribution conventions.)
## License
- **Data** (`runs/`, `catalogue.json`, `methodology.md`): [Creative Commons Attribution 4.0 International (CC-BY-4.0)](LICENSE)
- **Helper code** (any future scripts inside this repo): [MIT](LICENSE-MIT) (separate file added if/when code lands here)
- **Data** (`runs/`, `submissions/`, `catalogue.json`, `methodology.md`,
`harness/suites/`): [Creative Commons Attribution 4.0 International (CC-BY-4.0)](LICENSE).
- **Helper code** (`harness/*.py`, future scripts): [MIT](LICENSE-MIT).
You are free to share, re-host, re-analyse, and remix the data with attribution.
You're free to share, re-host, re-analyse, and remix the data with attribution.
## What's in here vs what's NOT
## Reporting issues
This repo contains **bench-run output only**. No source code. No infrastructure config. No application internals. Reproducing a run requires the [WeeyugaWeb](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb) main repo (also Gitea-hosted; visibility separate).
## Reporting an issue with the data
If you spot a bench number that looks wrong, a methodology gap, or a privacy slip in published metadata: open an issue on this repo, or email the Weeyuga team. We'd rather know.
If you spot a bench number that looks wrong, a methodology gap, or a
privacy slip in published metadata: open an issue on this repo, or
email the team at `slobodan@weeyuga.com`. We'd rather know.
## Status
| What | State |
|---|---|
| Repo created | 2026-05-05 |
| First 21 runs landed | 2026-05-05 |
| Miljan + Stevan security audit | scheduled |
| Canonical archive landed (21 runs) | 2026-05-05 |
| Harness + agent runbook landed | 2026-05-06 |
| Pre-launch security audit | scheduled |
| Visibility flipped to public | pending audit sign-off |
| Site `benchmarks.weeyuga.com` live | pending Bane DNS + nginx + Tomas site |
| First friend's submission merged | pending |
Owner: `mac/benchmark-tester-ben` (Ben). For coordination, see the [WeeyugaWeb coordination bus](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb/src/branch/main/coordination).
Maintainers: `mac/benchmark-tester-ben` (catalogue + methodology),
`mac/devops-bane` (harness + runner + this README change).
For coordination, see the
[WeeyugaWeb coordination bus](https://git.weeyuga.com/slobodanmargetic988/WeeyugaWeb/src/branch/main/coordination).

4
harness/.gitignore vendored Normal file
View File

@@ -0,0 +1,4 @@
__pycache__/
*.pyc
*.pyo
.pytest_cache/

151
harness/README.md Normal file
View File

@@ -0,0 +1,151 @@
# `harness/` — runner + suites + prompts
Self-contained, dependency-free Python 3 benchmark runner. Drives any
OpenAI-compatible `/v1/chat/completions` endpoint with the canonical
Weeyuga prompt set; emits a JSONL event ledger.
## Files
```
harness/
├── README.md — this file
├── run_benchmark.py — the runner (one Python 3 process, stdlib only)
├── prompts.py — 3 frozen reference prompts (P-EASY/P-MEDIUM/P-HARD)
├── requirements.txt — empty by intent (stdlib only); listed for tooling
└── suites/
├── small_model_eval_questions.json — 5Q (5 short tasks, format-checked)
├── python_task_suite_questions.json — 20Q (20 realistic Python prompts)
├── parallel_qwen_same_model_20q_suite.json — same-model parallel-lane stress
├── parallel_qwen_mixed_model_20q_suite.json — mixed-model parallel-lane stress
├── python_context_edge_append_questions.json — long-context append behavior
└── python_context_edge_suite_only.json — long-context whole-suite reasoning
```
## Quick reference
```bash
# Smoke (one hello call, end-to-end runtime check)
python3 harness/run_benchmark.py --smoke \
--target-url http://127.0.0.1:11434 \
--models qwen3.5:0.8b \
--cell-id-prefix mac:ollama \
--submitter-handle alice --device-tag mac-m1-8gb
# Default phases (hello + 5q + 20q)
python3 harness/run_benchmark.py \
--target-url http://127.0.0.1:11434 \
--models qwen3.5:0.8b \
--cell-id-prefix mac:ollama \
--submitter-handle alice --device-tag mac-m1-8gb
# Full sweep (all six suites)
python3 harness/run_benchmark.py --phases all \
--target-url http://127.0.0.1:11434 \
--models qwen3.5:0.8b \
--cell-id-prefix mac:ollama \
--submitter-handle alice --device-tag mac-m1-8gb
# Probe (list models + one hello, no ledger written)
python3 harness/run_benchmark.py --probe \
--target-url http://127.0.0.1:11434 \
--cell-id-prefix mac:ollama
```
## Output layout
The runner writes to `submissions/<submitter-handle>/<device-tag>/run-<uuid>/`:
```
submissions/alice/mac-m1-8gb/run-<uuid>/
├── run.jsonl — event ledger; one JSON object per line
├── manifest.json — automatic; written at run start
├── hardware.json — agent fills from hardware probe (see CLAUDE.md §2)
├── metadata.json — agent fills from aggregates (see CLAUDE.md §6)
└── run.md — agent fills from template (see CLAUDE.md §6)
```
## Knobs
CLI flags (see `--help`):
- `--target-url` — OpenAI-compat base URL (default `http://127.0.0.1:11434`)
- `--models` — comma-separated, or `auto` for `/v1/models` discovery
- `--cell-id-prefix``<node-tag>:<engine>` for the JSONL `cell_id` field
- `--phases` — subset of `hello, frozen, 5q, 20q, parallel_same, parallel_mixed, edge_append, edge_suite`, or `all`
- `--timeout` — per-call wall-clock cap (default 360 s)
- `--temperature` / `--num-ctx` / `--num-predict` — override canonical knobs
- `--probe` / `--smoke` — health-check shortcuts
- `--run-id` / `--out-dir` — resume / custom output
Canonical defaults (in code):
```python
CANONICAL_OPTIONS = {
"temperature": 0.1,
"num_ctx": 4096,
"num_predict": 2048,
}
```
Any deviation is recorded automatically in `manifest.json.canonical_options_overrides`.
## Dependencies
**None beyond Python 3 stdlib.** `urllib.request` does the HTTP, `json` does
serde, `uuid` makes the run-id. The empty `requirements.txt` exists so tools
like `pip-tools` and reproducibility scripts have a hook; if a future version
adds dependencies they'll land there with pinned versions.
Tested against Python 3.10, 3.11, 3.12. Earlier 3.x may work but isn't tested.
## Suite shapes
All six `suites/*.json` follow the same top-level shape:
```json
{
"suite_name": "...",
"version": "1",
"purpose": "...",
"models": ["..."], // advisory; runner uses --models flag
"questions": [
{
"id": "Q01",
"prompt": "...",
"required_markers": ["..."], // optional; lower-cased substring matches
"format_rule": "..." // optional; one of: bash_code, python_code, shell_lines, four_numbered_steps, five_bullets, json_dict, pytest_code
}
]
}
```
`required_markers` and `format_rule` are heuristic — they exist to flag
"obviously wrong shape" answers without claiming semantic correctness. Don't
treat them as ground truth; treat them as a sanity check.
The parallel and edge suites add more top-level fields (`run_mode`, `lanes`,
`question_assignment`, etc.) for advisory context; the runner reads only
`questions[]` from any suite.
## Adding a new suite
For now: don't. The six suites are stable and adding more in this branch
breaks comparability with the existing 21 reference runs. If you want a new
suite, open an issue on this repo proposing it; we'll discuss whether it
warrants a `HARNESS_VERSION = public-2` bump (suites would still need to be
backwards-compatible — adding new phase keys is fine, redefining existing
ones is not).
## Why no `pip install -e` / no Python package?
This is a **scripts directory**, not a library. The runner is one file. The
suites are data files. Friends running this from a fresh clone shouldn't have
to deal with packaging, virtualenvs (beyond what their agent recommends), or
upgrade flows. If/when this grows past one runner, we'll split it.
## License
`prompts.py`, `run_benchmark.py`, and any future `harness/*.py` code: MIT
(see [LICENSE-MIT](../LICENSE-MIT)).
`suites/*.json`: CC-BY-4.0 (see [LICENSE](../LICENSE)) — same as the bench
data they test against.

26
harness/prompts.py Normal file
View File

@@ -0,0 +1,26 @@
"""Frozen canonical prompts for the Weeyuga benchmark harness.
These three prompts NEVER change once shipped. New prompts get new
IDs (P-NEW1, P-NEW2, ...). See docs/BENCHMARKS/HARNESS.md §3.
"""
PROMPTS = {
"P-EASY": {
"intent": "trivial — single-token response space, near-zero work",
"prompt": "hi",
"max_tokens": 64,
},
"P-MEDIUM": {
"intent": "bounded structured task — 4 sentences on a known topic",
"prompt": "Explain in 4 sentences why the sky appears blue at noon.",
"max_tokens": 512,
},
"P-HARD": {
"intent": "open-ended creative — 200-word generation",
"prompt": (
"Write a 200-word story about a fisherman who discovers a "
"coin from a sunken empire."
),
"max_tokens": 1024,
},
}

12
harness/requirements.txt Normal file
View File

@@ -0,0 +1,12 @@
# Empty by intent — the harness uses Python 3 stdlib only.
#
# This file exists so:
# - pip-tools / poetry / uv have a hook if future versions add deps
# - reproducibility scripts can say "see requirements.txt" without exception
# - friends' agents that auto-`pip install -r requirements.txt` will succeed
# trivially rather than fail because the file is missing
#
# If you're contributing a benchmark run and you need a non-stdlib package
# (e.g. for hardware probing — psutil, nvidia-ml-py), install it in your own
# environment but DON'T add it here in your PR — the runner itself must stay
# stdlib-only so it works on a friend's bare Python 3 install.

645
harness/run_benchmark.py Normal file
View File

@@ -0,0 +1,645 @@
#!/usr/bin/env python3
"""weeyuga-benchmarks-public — generic runner.
A portable, agent-driven version of the Weeyuga benchmark harness. Mirrors
the canonical Pavilion methodology (temperature=0.1, num_ctx=4096,
num_predict=2048, single-loaded-model single-parallel-slot) and runs every
suite in `harness/suites/` against every model the agent declares, in
sequence. One JSONL ledger captures every call.
This script is a TEMPLATE, not a one-click button.
The friend running this is expected to be working with a coding agent (Claude
Code, Codex, Aider, etc.). The agent reads CLAUDE.md / AGENTS.md to learn the
methodology, probes the friend's hardware, picks a target model + an
OpenAI-compatible runtime (Ollama / llama.cpp / vLLM / MLX / etc.), then
adapts this script to the friend's reality before running.
Read CLAUDE.md before reading this code.
The runner does ONE thing well: drive an OpenAI-compat /v1/chat/completions
endpoint with the canonical prompts, write a JSONL ledger of every call, and
emit a manifest. Everything else (hardware probe, model pull, runtime tuning,
output curation, PR submission) is the agent's job.
Usage examples:
# smoke (one cell × hello call only — verify the runtime is reachable)
python3 harness/run_benchmark.py --smoke \\
--target-url http://127.0.0.1:11434 \\
--models qwen3.5:0.8b \\
--cell-id-prefix mac:ollama \\
--submitter-handle alice \\
--device-tag mac-m1-8gb
# the canonical full suite (all phases × all models)
python3 harness/run_benchmark.py \\
--target-url http://127.0.0.1:11434 \\
--models qwen3.5:0.8b,qwen3.5:4b \\
--cell-id-prefix mac:ollama \\
--submitter-handle alice \\
--device-tag mac-m1-8gb
Required output goes under:
submissions/<submitter-handle>/<device-tag>/run-<benchmark-run-id>/
run.jsonl — canonical event stream (one JSON object per line)
manifest.json — who, what, when, where (no PII)
hardware.json — what device the run happened on (filled by agent)
metadata.json — computed aggregates (filled by --post-process or by hand)
run.md — human-readable summary (filled by agent or by
`harness/render_summary.py`)
"""
from __future__ import annotations
import argparse
import datetime as dt
import json
import os
import platform
import re
import socket
import sys
import time
import urllib.error
import urllib.request
import uuid
from pathlib import Path
from typing import Any
HARNESS_VERSION = "public-1"
REPO_ROOT = Path(__file__).resolve().parents[1]
SUITES_DIR = REPO_ROOT / "harness" / "suites"
SUBMISSIONS_DIR = REPO_ROOT / "submissions"
DEFAULT_TARGET_URL = os.environ.get(
"WEEYUGA_TARGET_URL", "http://127.0.0.1:11434"
)
DEFAULT_TIMEOUT_S = 360 # 6-minute hard wall-clock per call
HELLO_PROMPT = "hi can you help me?"
# Canonical knobs (Sloba's reference values from the Pavilion methodology).
# An agent running this on different hardware MAY override via flags or by
# editing this file — but the override has to be recorded in manifest.json
# `canonical_options_overrides` so the run is honestly comparable.
CANONICAL_OPTIONS = {
"temperature": 0.1,
"num_ctx": 4096,
"num_predict": 2048,
}
# Suite files in run order. ALL of these run, sequentially, per model, per
# this run's --phases setting. Don't reorder — comparability across runs
# depends on stable ordering. To skip a suite, use --phases.
SUITES = {
"5q": "small_model_eval_questions.json",
"20q": "python_task_suite_questions.json",
"parallel_same": "parallel_qwen_same_model_20q_suite.json",
"parallel_mixed": "parallel_qwen_mixed_model_20q_suite.json",
"edge_append": "python_context_edge_append_questions.json",
"edge_suite": "python_context_edge_suite_only.json",
}
DEFAULT_PHASES = "hello,5q,20q" # the safest "first run" set
# ── tiny utilities ──────────────────────────────────────────────────
def utc_now() -> str:
return dt.datetime.now(dt.timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
def utc_filename_stamp() -> str:
return dt.datetime.now(dt.timezone.utc).strftime("%Y%m%dT%H%M%SZ")
def load_avg() -> tuple[float, float, float]:
try:
return os.getloadavg()
except (OSError, AttributeError):
return (0.0, 0.0, 0.0)
def log(message: str) -> None:
print(f"{utc_now()} {message}", flush=True)
def write_jsonl(handle, record: dict[str, Any]) -> None:
handle.write(json.dumps(record, ensure_ascii=False) + "\n")
handle.flush()
try:
os.fsync(handle.fileno())
except (OSError, ValueError):
pass
# ── HTTP layer ──────────────────────────────────────────────────────
def list_models(target_url: str, timeout: int = 15) -> list[str]:
"""Read /v1/models from the target. Standard OpenAI-compat endpoint."""
req = urllib.request.Request(f"{target_url.rstrip('/')}/v1/models")
try:
with urllib.request.urlopen(req, timeout=timeout) as resp:
body = json.loads(resp.read().decode("utf-8"))
except (urllib.error.URLError, urllib.error.HTTPError, TimeoutError, OSError) as exc:
log(f"WARN: could not list models on {target_url} ({exc!r}); "
"agent must pass --models explicitly.")
return []
out: list[str] = []
for item in body.get("data", []):
if item.get("status") == "placeholder":
continue
mid = item.get("id")
if mid:
out.append(mid)
return out
def chat_completion(
target_url: str,
model: str,
user_prompt: str,
timeout: int = DEFAULT_TIMEOUT_S,
max_tokens_override: int | None = None,
canonical_options: dict | None = None,
) -> dict[str, Any]:
"""One non-streaming call. Returns a dict with timing + content + error."""
opts = dict(canonical_options or CANONICAL_OPTIONS)
body = {
"model": model,
"messages": [{"role": "user", "content": user_prompt}],
"stream": False,
"max_tokens": max_tokens_override or opts["num_predict"],
"temperature": opts["temperature"],
# Ollama-flavored extras; harmless on llama.cpp / vLLM (ignored).
"extra_body": {"options": opts},
}
req = urllib.request.Request(
f"{target_url.rstrip('/')}/v1/chat/completions",
data=json.dumps(body).encode("utf-8"),
headers={"Content-Type": "application/json"},
)
started = time.perf_counter()
out: dict[str, Any] = {
"duration_seconds": None,
"response_text": "",
"response_chars": 0,
"prompt_tokens": None,
"completion_tokens": None,
"total_tokens": None,
"tokens_per_second": None,
"finish_reason": None,
"status_code": None,
"error": None,
}
try:
with urllib.request.urlopen(req, timeout=timeout) as resp:
body_bytes = resp.read()
out["status_code"] = resp.status
elapsed = time.perf_counter() - started
out["duration_seconds"] = round(elapsed, 3)
payload = json.loads(body_bytes.decode("utf-8"))
choices = payload.get("choices") or []
if choices:
msg = choices[0].get("message") or {}
out["response_text"] = msg.get("content") or ""
out["finish_reason"] = choices[0].get("finish_reason")
out["response_chars"] = len(out["response_text"])
usage = payload.get("usage") or {}
out["prompt_tokens"] = usage.get("prompt_tokens")
out["completion_tokens"] = usage.get("completion_tokens")
out["total_tokens"] = usage.get("total_tokens")
if (
out["completion_tokens"]
and out["duration_seconds"]
and out["duration_seconds"] > 0
):
out["tokens_per_second"] = round(
out["completion_tokens"] / out["duration_seconds"], 2
)
except (urllib.error.URLError, urllib.error.HTTPError, TimeoutError, OSError) as exc:
out["duration_seconds"] = round(time.perf_counter() - started, 3)
out["error"] = repr(exc)
except json.JSONDecodeError as exc:
out["duration_seconds"] = round(time.perf_counter() - started, 3)
out["error"] = f"JSONDecodeError: {exc}"
return out
# ── eval helpers (mirrored from canonical Pavilion harness) ─────────
def marker_hits(required: list[str], text: str) -> list[str]:
lowered = (text or "").lower()
return [m for m in required if m.lower() in lowered]
def format_ok(rule: str, text: str) -> bool:
"""Lightweight heuristic format-checker. Mirrors capture_small_model_eval.py."""
stripped = (text or "").strip()
lines = [line.strip() for line in stripped.splitlines() if line.strip()]
if rule == "bash_code":
return stripped.startswith("#!/usr/bin/env bash")
if rule == "python_code":
return (
"def is_valid_ipv4" in stripped
and sum(1 for line in lines if line.startswith("def test_")) >= 3
)
if rule == "shell_lines":
return (
all(not line.startswith(("1.", "- ", "* ")) for line in lines[:3])
and "nginx -t" in stripped
)
if rule == "four_numbered_steps":
return sum(1 for line in lines if re.match(r"^[1-4]\. ", line)) >= 4
if rule == "five_bullets":
return sum(1 for line in lines if line.startswith(("- ", "* "))) >= 5
if rule == "json_dict":
try:
parsed = json.loads(stripped)
return isinstance(parsed, dict)
except (json.JSONDecodeError, ValueError):
return False
if rule == "pytest_code":
return (
"def test_" in stripped
and ("import pytest" in stripped or "pytest" in stripped)
)
return False
# ── per-call record builder ─────────────────────────────────────────
def make_call_record(
cell_id: str,
model: str,
phase: str,
question_id: str,
prompt: str,
run_idx: int,
required_markers: list[str],
format_rule: str,
target_url: str,
timeout: int,
canonical_options: dict | None = None,
) -> dict[str, Any]:
result = chat_completion(
target_url, model, prompt, timeout=timeout,
canonical_options=canonical_options,
)
text = result["response_text"]
hits = marker_hits(required_markers, text)
return {
"type": "call",
"ts_utc": utc_now(),
"cell_id": cell_id,
"model": model,
"phase": phase,
"question_id": question_id,
"run_idx": run_idx,
"duration_seconds": result["duration_seconds"],
"prompt_tokens": result["prompt_tokens"],
"completion_tokens": result["completion_tokens"],
"tokens_per_second": result["tokens_per_second"],
"finish_reason": result["finish_reason"],
"status_code": result["status_code"],
"response_chars": result["response_chars"],
"response_preview": (text or "")[:240],
"required_markers": required_markers,
"markers_hit": hits,
"marker_hit_rate": (
round(len(hits) / len(required_markers), 3)
if required_markers
else None
),
"format_rule": format_rule,
"format_ok": format_ok(format_rule, text) if format_rule else None,
"usable_answer": bool((text or "").strip()),
"error": result["error"],
}
# ── phase runners ───────────────────────────────────────────────────
def run_hello(handle, model, target_url, timeout, cell_id_prefix, options):
cell_id = f"{cell_id_prefix}:{model}"
log(f"[hello] {model}")
rec = make_call_record(
cell_id=cell_id, model=model, phase="hello",
question_id="hello_check", prompt=HELLO_PROMPT, run_idx=0,
required_markers=[], format_rule="",
target_url=target_url, timeout=timeout,
canonical_options=options,
)
write_jsonl(handle, rec)
log(f"{rec['duration_seconds']}s "
f"completion_tokens={rec['completion_tokens']} "
f"finish={rec['finish_reason']} err={rec['error']}")
def run_frozen_prompts(handle, model, target_url, timeout, cell_id_prefix, options):
"""3 frozen prompts (P-EASY/P-MEDIUM/P-HARD) from harness/prompts.py."""
sys.path.insert(0, str(REPO_ROOT / "harness"))
from prompts import PROMPTS # type: ignore[import-not-found]
cell_id = f"{cell_id_prefix}:{model}"
for q_idx, (pid, p) in enumerate(PROMPTS.items()):
log(f"[frozen] {model} {pid}")
rec = make_call_record(
cell_id=cell_id, model=model, phase="frozen",
question_id=pid, prompt=p["prompt"], run_idx=q_idx,
required_markers=[], format_rule="",
target_url=target_url, timeout=timeout,
canonical_options=options,
)
# Honor per-prompt max_tokens floor
rec["max_tokens_used"] = options.get("num_predict") if options else CANONICAL_OPTIONS["num_predict"]
write_jsonl(handle, rec)
log(f"{rec['duration_seconds']}s "
f"completion_tokens={rec['completion_tokens']} err={rec['error']}")
def run_suite(handle, model, target_url, timeout, suite_path, phase, cell_id_prefix, options):
"""Drive any of the 6 .json suites (5q / 20q / parallel_* / edge_*)."""
suite = json.loads(suite_path.read_text(encoding="utf-8"))
questions = suite.get("questions") or []
if not questions:
log(f"WARN: suite {suite_path.name} has no questions; skipping")
return
cell_id = f"{cell_id_prefix}:{model}"
for q_idx, q in enumerate(questions):
log(f"[{phase}] {model} {q.get('id', f'q{q_idx}')} "
f"({q_idx + 1}/{len(questions)})")
rec = make_call_record(
cell_id=cell_id, model=model, phase=phase,
question_id=q.get("id", f"q{q_idx}"), prompt=q["prompt"],
run_idx=q_idx,
required_markers=q.get("required_markers") or [],
format_rule=q.get("format_rule") or "",
target_url=target_url, timeout=timeout,
canonical_options=options,
)
write_jsonl(handle, rec)
log(f"{rec['duration_seconds']}s "
f"completion_tokens={rec['completion_tokens']} "
f"format_ok={rec['format_ok']} markers={rec['marker_hit_rate']} "
f"err={rec['error']}")
# ── CLI ─────────────────────────────────────────────────────────────
def parse_args() -> argparse.Namespace:
p = argparse.ArgumentParser(
description="Run the Weeyuga benchmark suite against a local model.",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=(
"Read CLAUDE.md / AGENTS.md before running. The friend's coding\n"
"agent is expected to adapt this script to the friend's hardware\n"
"(GPU vs CPU, VRAM constraints, cold-start tuning) and record any\n"
"deviations in manifest.json."
),
)
p.add_argument("--target-url", default=DEFAULT_TARGET_URL,
help="OpenAI-compat base URL (default http://127.0.0.1:11434).")
p.add_argument("--models", default="auto",
help="Comma-separated model list, or 'auto' to pull from /v1/models.")
p.add_argument("--cell-id-prefix", required=False, default="local:ollama",
help="Prefix for cell_id. Pattern: <node-tag>:<engine>. "
"Examples: mac-m1:ollama, predator:llamacpp, vps:cpu.")
p.add_argument("--phases", default=DEFAULT_PHASES,
help=("Comma-separated subset of: hello, frozen, "
+ ", ".join(SUITES.keys())
+ ". Default: hello,5q,20q. Pass 'all' for the full suite."))
p.add_argument("--timeout", type=int, default=DEFAULT_TIMEOUT_S,
help="Per-call wall-clock timeout (default 360).")
p.add_argument("--probe", action="store_true",
help="Health-check only: list models, do one hello call, exit.")
p.add_argument("--smoke", action="store_true",
help="One model × hello-only. End-to-end runtime validation.")
p.add_argument("--submitter-handle", required=False, default=None,
help="Your Gitea (or any) handle. Used in submissions/<handle>/...")
p.add_argument("--device-tag", required=False, default=None,
help="Short device tag. Examples: mac-m1-8gb, rtx-4090-pc, "
"predator-gtx-1060-6gb. Used in submissions/<handle>/<tag>/...")
p.add_argument("--run-id", default=None,
help="Override auto-generated run UUID (for resuming).")
p.add_argument("--out-dir", default=None,
help="Override the submissions/<handle>/<tag>/<run-id>/ output dir.")
p.add_argument("--temperature", type=float, default=None,
help="Override canonical temperature 0.1.")
p.add_argument("--num-ctx", type=int, default=None,
help="Override canonical num_ctx 4096.")
p.add_argument("--num-predict", type=int, default=None,
help="Override canonical num_predict 2048.")
return p.parse_args()
def resolve_phases(phases_arg: str) -> list[str]:
if phases_arg.strip().lower() == "all":
return ["hello", "frozen"] + list(SUITES.keys())
raw = [p.strip() for p in phases_arg.split(",") if p.strip()]
out = []
for r in raw:
if r in ("hello", "frozen") or r in SUITES:
out.append(r)
else:
raise SystemExit(
f"unknown phase {r!r}; valid: hello, frozen, "
+ ", ".join(SUITES.keys()) + ", or 'all'"
)
return out
def resolve_models(models_arg: str, target_url: str) -> list[str]:
available = list_models(target_url)
log(f" /v1/models reports {len(available)}: {available}")
if models_arg == "auto":
if not available:
raise SystemExit(
"auto-list found no models on target; pass --models explicitly "
"(e.g. --models qwen3.5:0.8b,qwen3.5:4b)."
)
return available
wanted = [m.strip() for m in models_arg.split(",") if m.strip()]
if available:
missing = [m for m in wanted if m not in available]
if missing:
log(f"WARN: requested but not on target: {missing}. "
"The runner will try anyway — your runtime may auto-pull.")
return wanted
def write_manifest(out_dir: Path, args, run_id: str, phases: list[str], models: list[str], options: dict) -> None:
manifest = {
"schema_version": "manifest-1.0",
"run_id": run_id,
"harness_version": HARNESS_VERSION,
"submitter_handle": args.submitter_handle,
"device_tag": args.device_tag,
"cell_id_prefix": args.cell_id_prefix,
"target_url": args.target_url,
"phases_run": phases,
"models_run": models,
"canonical_options": dict(CANONICAL_OPTIONS),
"canonical_options_overrides": {
k: v for k, v in {
"temperature": args.temperature,
"num_ctx": args.num_ctx,
"num_predict": args.num_predict,
}.items() if v is not None
},
"timeout_seconds": args.timeout,
"started_at_utc": utc_now(),
"host_hostname_short": socket.gethostname().split(".")[0],
"platform_system": platform.system(),
"platform_release": platform.release(),
"python_version": platform.python_version(),
}
(out_dir / "manifest.json").write_text(
json.dumps(manifest, indent=2, ensure_ascii=False) + "\n",
encoding="utf-8",
)
def main() -> int:
args = parse_args()
target_url = args.target_url.rstrip("/")
log(f"target_url = {target_url}")
options = dict(CANONICAL_OPTIONS)
if args.temperature is not None: options["temperature"] = args.temperature
if args.num_ctx is not None: options["num_ctx"] = args.num_ctx
if args.num_predict is not None: options["num_predict"] = args.num_predict
log("listing /v1/models on target …")
models = resolve_models(args.models, target_url)
if args.probe:
if not models:
log("no models available; abort")
return 1
smallest = sorted(models, key=lambda m: (
0 if "0.5b" in m or "0.6b" in m
else 1 if "0.8b" in m
else 2 if "1.5b" in m or "1b" in m
else 3 if "2b" in m
else 4
))[0]
log(f"probe: hello call against smallest = {smallest}")
rec = make_call_record(
cell_id=f"{args.cell_id_prefix}:{smallest}",
model=smallest, phase="probe",
question_id="hello_check", prompt=HELLO_PROMPT, run_idx=0,
required_markers=[], format_rule="",
target_url=target_url, timeout=min(args.timeout, 60),
canonical_options=options,
)
log(json.dumps(rec, indent=2, ensure_ascii=False))
return 0 if not rec["error"] else 3
if args.smoke:
models = [models[0]] if models else []
if not models:
log("no models available for smoke; abort")
return 1
log(f"smoke: hello-only against {models[0]}")
phases = ["hello"]
else:
phases = resolve_phases(args.phases)
if not models:
log("no models to run; abort")
return 1
if not args.submitter_handle and not args.out_dir:
raise SystemExit(
"--submitter-handle is required (or pass --out-dir for ad-hoc runs)."
)
if not args.device_tag and not args.out_dir:
raise SystemExit(
"--device-tag is required (or pass --out-dir for ad-hoc runs)."
)
run_id = args.run_id or str(uuid.uuid4())
out_dir = (
Path(args.out_dir) if args.out_dir
else SUBMISSIONS_DIR / args.submitter_handle / args.device_tag / f"run-{run_id}"
)
out_dir.mkdir(parents=True, exist_ok=True)
out_path = out_dir / "run.jsonl"
log(f"writing JSONL ledger to {out_path}")
write_manifest(out_dir, args, run_id, phases, models, options)
started_at = utc_now()
started_load = load_avg()
meta = {
"type": "meta",
"benchmark_run_id": run_id,
"harness_version": HARNESS_VERSION,
"started_at_utc": started_at,
"host_hostname_short": socket.gethostname().split(".")[0],
"load_avg_start": started_load,
"target_url": target_url,
"cell_id_prefix": args.cell_id_prefix,
"submitter_handle": args.submitter_handle,
"device_tag": args.device_tag,
"execution_shape": "per-model-block",
"phases_planned": phases,
"models_planned": models,
"canonical_options": dict(CANONICAL_OPTIONS),
"canonical_options_effective": options,
"timeout_seconds": args.timeout,
"platform_system": platform.system(),
"platform_release": platform.release(),
"python_version": platform.python_version(),
}
with out_path.open("w", encoding="utf-8") as fh:
write_jsonl(fh, meta)
try:
for model in models:
log(f"=== model block: {model} ===")
for phase in phases:
if phase == "hello":
run_hello(fh, model, target_url, args.timeout,
args.cell_id_prefix, options)
elif phase == "frozen":
run_frozen_prompts(fh, model, target_url, args.timeout,
args.cell_id_prefix, options)
else:
suite_path = SUITES_DIR / SUITES[phase]
if not suite_path.exists():
log(f"WARN: suite missing at {suite_path}; skip")
continue
run_suite(fh, model, target_url, args.timeout,
suite_path, phase, args.cell_id_prefix, options)
except KeyboardInterrupt:
log("interrupted — partial ledger written")
write_jsonl(fh, {
"type": "interrupted",
"ts_utc": utc_now(),
"reason": "KeyboardInterrupt",
})
return 130
finally:
write_jsonl(fh, {
"type": "footer",
"ts_utc": utc_now(),
"finished_at_utc": utc_now(),
"load_avg_end": load_avg(),
})
log(f"done — ledger at {out_path}")
log(f"next: agent fills hardware.json + run.md, then opens a PR.")
return 0
if __name__ == "__main__":
raise SystemExit(main())

View File

@@ -0,0 +1,363 @@
{
"generated_at": "2026-04-11T19:00:02Z",
"suite_name": "parallel-qwen-mixed-model-20q-v1",
"version": "1.0",
"purpose": "Run the shared 20-question Python benchmark in two-question batches against qwen size pairs. Within each batch the first model answers the odd-numbered question and the second model answers the even-numbered question, while Ollama keeps two models loaded with two parallel request slots and a 32K request context.",
"run_mode": "mixed_model_pairs",
"question_batch_size": 2,
"question_assignment": "odd_questions_to_first_model_even_questions_to_second_model",
"ollama_runtime": {
"max_loaded_models": 2,
"num_parallel": 2,
"num_ctx": 32768,
"keep_alive": "24h"
},
"lanes": [
{
"lane_id": "qwen2_5_coder_0_5b__qwen2_5_coder_3b",
"display_name": "Qwen2.5 Coder 0.5B plus Qwen2.5 Coder 3B",
"kind": "mixed_model",
"question_assignment": "odd_questions_to_first_model_even_questions_to_second_model",
"models": [
{
"role": "odd_questions",
"model": "qwen2.5-coder:0.5b",
"display_name": "Qwen2.5 Coder 0.5B",
"family": "qwen2.5-coder",
"size_label": "0.5B",
"max_context_tokens": 32768,
"requested_num_ctx": 32768,
"mode": "mixed_model"
},
{
"role": "even_questions",
"model": "qwen2.5-coder:3b",
"display_name": "Qwen2.5 Coder 3B",
"family": "qwen2.5-coder",
"size_label": "3B",
"max_context_tokens": 32768,
"requested_num_ctx": 32768,
"mode": "mixed_model"
}
]
},
{
"lane_id": "qwen2_5_coder_0_5b__qwen2_5_coder_1_5b",
"display_name": "Qwen2.5 Coder 0.5B plus Qwen2.5 Coder 1.5B",
"kind": "mixed_model",
"question_assignment": "odd_questions_to_first_model_even_questions_to_second_model",
"models": [
{
"role": "odd_questions",
"model": "qwen2.5-coder:0.5b",
"display_name": "Qwen2.5 Coder 0.5B",
"family": "qwen2.5-coder",
"size_label": "0.5B",
"max_context_tokens": 32768,
"requested_num_ctx": 32768,
"mode": "mixed_model"
},
{
"role": "even_questions",
"model": "qwen2.5-coder:1.5b",
"display_name": "Qwen2.5 Coder 1.5B",
"family": "qwen2.5-coder",
"size_label": "1.5B",
"max_context_tokens": 32768,
"requested_num_ctx": 32768,
"mode": "mixed_model"
}
]
},
{
"lane_id": "qwen2_5_coder_1_5b__qwen2_5_coder_3b",
"display_name": "Qwen2.5 Coder 1.5B plus Qwen2.5 Coder 3B",
"kind": "mixed_model",
"question_assignment": "odd_questions_to_first_model_even_questions_to_second_model",
"models": [
{
"role": "odd_questions",
"model": "qwen2.5-coder:1.5b",
"display_name": "Qwen2.5 Coder 1.5B",
"family": "qwen2.5-coder",
"size_label": "1.5B",
"max_context_tokens": 32768,
"requested_num_ctx": 32768,
"mode": "mixed_model"
},
{
"role": "even_questions",
"model": "qwen2.5-coder:3b",
"display_name": "Qwen2.5 Coder 3B",
"family": "qwen2.5-coder",
"size_label": "3B",
"max_context_tokens": 32768,
"requested_num_ctx": 32768,
"mode": "mixed_model"
}
]
}
],
"questions": [
{
"id": "py_csv_parse",
"title": "CSV Parser",
"category": "parsing",
"prompt": "Return only Python code. Write a function that reads CSV text, skips blank lines, and returns a list of dicts keyed by the header row.",
"required_markers": [
"csv",
"Dict",
"reader",
"header"
],
"format_rule": "python_code"
},
{
"id": "py_file_scan",
"title": "File Scanner",
"category": "file_io",
"prompt": "Return only Python code. Write a script that walks a directory tree and prints the paths of files larger than 5 MB.",
"required_markers": [
"os.walk",
"5 * 1024 * 1024",
"print",
"path"
],
"format_rule": "python_code"
},
{
"id": "py_cli_args",
"title": "CLI Arguments",
"category": "cli",
"prompt": "Return only Python code. Build a small argparse CLI with one required path argument and one optional verbose flag.",
"required_markers": [
"argparse",
"--verbose",
"ArgumentParser",
"path"
],
"format_rule": "python_code"
},
{
"id": "py_typing_dataclass",
"title": "Typed Dataclass",
"category": "typing",
"prompt": "Return only Python code. Define a typed dataclass for a job record with id, name, created_at, and is_active fields.",
"required_markers": [
"@dataclass",
"created_at",
"is_active",
"str"
],
"format_rule": "python_code"
},
{
"id": "py_pytest_fixture",
"title": "Pytest Fixture",
"category": "tests",
"prompt": "Return only Python code. Write a pytest fixture and one test that uses it to verify a function converting Celsius to Fahrenheit.",
"required_markers": [
"@pytest.fixture",
"def test_",
"assert",
"fahrenheit"
],
"format_rule": "python_code"
},
{
"id": "py_async_fetch",
"title": "Async Fetch",
"category": "async",
"prompt": "Return only Python code. Write an async function that fetches two URLs concurrently with asyncio.gather and returns both bodies.",
"required_markers": [
"async def",
"asyncio.gather",
"await",
"aiohttp"
],
"format_rule": "python_code"
},
{
"id": "py_http_retry",
"title": "HTTP Retry",
"category": "http",
"prompt": "Return only Python code. Write a requests wrapper that retries HTTP 429 with exponential backoff and a maximum attempt count.",
"required_markers": [
"requests",
"429",
"backoff",
"max_attempts"
],
"format_rule": "python_code"
},
{
"id": "py_json_validate",
"title": "JSON Validation",
"category": "validation",
"prompt": "Return only Python code. Validate a JSON object against a schema and raise ValueError when required keys are missing.",
"required_markers": [
"jsonschema",
"ValueError",
"required",
"schema"
],
"format_rule": "python_code"
},
{
"id": "py_sqlite_store",
"title": "SQLite Store",
"category": "sqlite",
"prompt": "Return only Python code. Create a SQLite table for events and write a function that inserts one event row safely.",
"required_markers": [
"sqlite3",
"CREATE TABLE",
"INSERT INTO",
"commit"
],
"format_rule": "python_code"
},
{
"id": "py_fastapi_handler",
"title": "FastAPI Handler",
"category": "web",
"prompt": "Return only Python code. Write a FastAPI route that returns a JSON health response with status and version fields.",
"required_markers": [
"FastAPI",
"@app.get",
"status",
"version"
],
"format_rule": "python_code"
},
{
"id": "py_config_dataclass",
"title": "Config Dataclass",
"category": "config",
"prompt": "Return only Python code. Build a dataclass-based config loader that reads environment variables and supplies defaults.",
"required_markers": [
"dataclass",
"os.environ",
"default",
"load"
],
"format_rule": "python_code"
},
{
"id": "py_logging_setup",
"title": "Logging Setup",
"category": "logging",
"prompt": "Return only Python code. Configure structured logging with a timestamped formatter and a reusable setup function.",
"required_markers": [
"logging",
"Formatter",
"timestamp",
"basicConfig"
],
"format_rule": "python_code"
},
{
"id": "py_thread_pool",
"title": "Thread Pool",
"category": "concurrency",
"prompt": "Return only Python code. Use concurrent.futures to run a small CPU-bound function across a list of inputs and collect results.",
"required_markers": [
"concurrent.futures",
"ThreadPoolExecutor",
"map",
"results"
],
"format_rule": "python_code"
},
{
"id": "py_package_layout",
"title": "Package Layout",
"category": "package",
"prompt": "Return only Python code. Show a minimal package layout with __init__.py and a helper module that can be imported from tests.",
"required_markers": [
"__init__.py",
"import",
"helper",
"tests"
],
"format_rule": "python_code"
},
{
"id": "py_debug_stacktrace",
"title": "Debug Stacktrace",
"category": "debugging",
"prompt": "Return only Python code. Fix a function that crashes on None input by adding an early return and a clear exception message.",
"required_markers": [
"None",
"return",
"raise",
"message"
],
"format_rule": "python_code"
},
{
"id": "py_refactor_split",
"title": "Refactor Split",
"category": "refactor",
"prompt": "Return only Python code. Refactor a large function into two smaller helpers while preserving behavior.",
"required_markers": [
"def",
"helper",
"return",
"preserve"
],
"format_rule": "python_code"
},
{
"id": "py_csv_summary",
"title": "CSV Summary",
"category": "analysis",
"prompt": "Return only Python code. Read a CSV file and produce a summary with row count and a count of unique values in one column.",
"required_markers": [
"csv",
"row_count",
"unique",
"Counter"
],
"format_rule": "python_code"
},
{
"id": "py_pathlib_clean",
"title": "Pathlib Cleaner",
"category": "filesystem",
"prompt": "Return only Python code. Use pathlib to remove empty files from a directory tree and print each deleted path.",
"required_markers": [
"pathlib",
"rglob",
"unlink",
"print"
],
"format_rule": "python_code"
},
{
"id": "py_pydantic_model",
"title": "Pydantic Model",
"category": "validation",
"prompt": "Return only Python code. Define a Pydantic model for a user profile with email validation and an age field.",
"required_markers": [
"BaseModel",
"EmailStr",
"age",
"validation"
],
"format_rule": "python_code"
},
{
"id": "py_regex_log_parser",
"title": "Regex Log Parser",
"category": "parsing",
"prompt": "Return only Python code. Parse web server log lines with regex and return a list of status codes and request paths.",
"required_markers": [
"re",
"status",
"path",
"findall"
],
"format_rule": "python_code"
}
]
}

View File

@@ -0,0 +1,347 @@
{
"generated_at": "2026-04-11T19:00:02Z",
"suite_name": "parallel-qwen-same-model-20q-v1",
"version": "1.0",
"purpose": "Run the shared 20-question Python benchmark in two-question batches against one model at a time. Questions 1+2 run together, then 3+4, and so on, while Ollama stays on one loaded model with two parallel request slots and a 32K request context.",
"run_mode": "same_model_pairs",
"question_batch_size": 2,
"question_assignment": "same_model_receives_both_questions_in_each_batch",
"ollama_runtime": {
"max_loaded_models": 1,
"num_parallel": 2,
"num_ctx": 32768,
"keep_alive": "24h"
},
"lanes": [
{
"lane_id": "qwen2_5_coder_0_5b_same_model",
"display_name": "Qwen2.5 Coder 0.5B same-model 2-up",
"kind": "same_model",
"models": [
{
"role": "shared",
"model": "qwen2.5-coder:0.5b",
"display_name": "Qwen2.5 Coder 0.5B",
"family": "qwen2.5-coder",
"size_label": "0.5B",
"max_context_tokens": 32768,
"requested_num_ctx": 32768,
"mode": "same_model"
}
]
},
{
"lane_id": "qwen2_5_coder_1_5b_same_model",
"display_name": "Qwen2.5 Coder 1.5B same-model 2-up",
"kind": "same_model",
"models": [
{
"role": "shared",
"model": "qwen2.5-coder:1.5b",
"display_name": "Qwen2.5 Coder 1.5B",
"family": "qwen2.5-coder",
"size_label": "1.5B",
"max_context_tokens": 32768,
"requested_num_ctx": 32768,
"mode": "same_model"
}
]
},
{
"lane_id": "qwen2_5_coder_3b_same_model",
"display_name": "Qwen2.5 Coder 3B same-model 2-up",
"kind": "same_model",
"models": [
{
"role": "shared",
"model": "qwen2.5-coder:3b",
"display_name": "Qwen2.5 Coder 3B",
"family": "qwen2.5-coder",
"size_label": "3B",
"max_context_tokens": 32768,
"requested_num_ctx": 32768,
"mode": "same_model"
}
]
},
{
"lane_id": "qwen3_5_0_8b_same_model",
"display_name": "Qwen3.5 0.8B same-model 2-up",
"kind": "same_model",
"models": [
{
"role": "shared",
"model": "qwen3.5:0.8b",
"display_name": "Qwen3.5 0.8B",
"family": "qwen3.5",
"size_label": "0.8B",
"max_context_tokens": 262144,
"requested_num_ctx": 32768,
"mode": "same_model"
}
]
}
],
"questions": [
{
"id": "py_csv_parse",
"title": "CSV Parser",
"category": "parsing",
"prompt": "Return only Python code. Write a function that reads CSV text, skips blank lines, and returns a list of dicts keyed by the header row.",
"required_markers": [
"csv",
"Dict",
"reader",
"header"
],
"format_rule": "python_code"
},
{
"id": "py_file_scan",
"title": "File Scanner",
"category": "file_io",
"prompt": "Return only Python code. Write a script that walks a directory tree and prints the paths of files larger than 5 MB.",
"required_markers": [
"os.walk",
"5 * 1024 * 1024",
"print",
"path"
],
"format_rule": "python_code"
},
{
"id": "py_cli_args",
"title": "CLI Arguments",
"category": "cli",
"prompt": "Return only Python code. Build a small argparse CLI with one required path argument and one optional verbose flag.",
"required_markers": [
"argparse",
"--verbose",
"ArgumentParser",
"path"
],
"format_rule": "python_code"
},
{
"id": "py_typing_dataclass",
"title": "Typed Dataclass",
"category": "typing",
"prompt": "Return only Python code. Define a typed dataclass for a job record with id, name, created_at, and is_active fields.",
"required_markers": [
"@dataclass",
"created_at",
"is_active",
"str"
],
"format_rule": "python_code"
},
{
"id": "py_pytest_fixture",
"title": "Pytest Fixture",
"category": "tests",
"prompt": "Return only Python code. Write a pytest fixture and one test that uses it to verify a function converting Celsius to Fahrenheit.",
"required_markers": [
"@pytest.fixture",
"def test_",
"assert",
"fahrenheit"
],
"format_rule": "python_code"
},
{
"id": "py_async_fetch",
"title": "Async Fetch",
"category": "async",
"prompt": "Return only Python code. Write an async function that fetches two URLs concurrently with asyncio.gather and returns both bodies.",
"required_markers": [
"async def",
"asyncio.gather",
"await",
"aiohttp"
],
"format_rule": "python_code"
},
{
"id": "py_http_retry",
"title": "HTTP Retry",
"category": "http",
"prompt": "Return only Python code. Write a requests wrapper that retries HTTP 429 with exponential backoff and a maximum attempt count.",
"required_markers": [
"requests",
"429",
"backoff",
"max_attempts"
],
"format_rule": "python_code"
},
{
"id": "py_json_validate",
"title": "JSON Validation",
"category": "validation",
"prompt": "Return only Python code. Validate a JSON object against a schema and raise ValueError when required keys are missing.",
"required_markers": [
"jsonschema",
"ValueError",
"required",
"schema"
],
"format_rule": "python_code"
},
{
"id": "py_sqlite_store",
"title": "SQLite Store",
"category": "sqlite",
"prompt": "Return only Python code. Create a SQLite table for events and write a function that inserts one event row safely.",
"required_markers": [
"sqlite3",
"CREATE TABLE",
"INSERT INTO",
"commit"
],
"format_rule": "python_code"
},
{
"id": "py_fastapi_handler",
"title": "FastAPI Handler",
"category": "web",
"prompt": "Return only Python code. Write a FastAPI route that returns a JSON health response with status and version fields.",
"required_markers": [
"FastAPI",
"@app.get",
"status",
"version"
],
"format_rule": "python_code"
},
{
"id": "py_config_dataclass",
"title": "Config Dataclass",
"category": "config",
"prompt": "Return only Python code. Build a dataclass-based config loader that reads environment variables and supplies defaults.",
"required_markers": [
"dataclass",
"os.environ",
"default",
"load"
],
"format_rule": "python_code"
},
{
"id": "py_logging_setup",
"title": "Logging Setup",
"category": "logging",
"prompt": "Return only Python code. Configure structured logging with a timestamped formatter and a reusable setup function.",
"required_markers": [
"logging",
"Formatter",
"timestamp",
"basicConfig"
],
"format_rule": "python_code"
},
{
"id": "py_thread_pool",
"title": "Thread Pool",
"category": "concurrency",
"prompt": "Return only Python code. Use concurrent.futures to run a small CPU-bound function across a list of inputs and collect results.",
"required_markers": [
"concurrent.futures",
"ThreadPoolExecutor",
"map",
"results"
],
"format_rule": "python_code"
},
{
"id": "py_package_layout",
"title": "Package Layout",
"category": "package",
"prompt": "Return only Python code. Show a minimal package layout with __init__.py and a helper module that can be imported from tests.",
"required_markers": [
"__init__.py",
"import",
"helper",
"tests"
],
"format_rule": "python_code"
},
{
"id": "py_debug_stacktrace",
"title": "Debug Stacktrace",
"category": "debugging",
"prompt": "Return only Python code. Fix a function that crashes on None input by adding an early return and a clear exception message.",
"required_markers": [
"None",
"return",
"raise",
"message"
],
"format_rule": "python_code"
},
{
"id": "py_refactor_split",
"title": "Refactor Split",
"category": "refactor",
"prompt": "Return only Python code. Refactor a large function into two smaller helpers while preserving behavior.",
"required_markers": [
"def",
"helper",
"return",
"preserve"
],
"format_rule": "python_code"
},
{
"id": "py_csv_summary",
"title": "CSV Summary",
"category": "analysis",
"prompt": "Return only Python code. Read a CSV file and produce a summary with row count and a count of unique values in one column.",
"required_markers": [
"csv",
"row_count",
"unique",
"Counter"
],
"format_rule": "python_code"
},
{
"id": "py_pathlib_clean",
"title": "Pathlib Cleaner",
"category": "filesystem",
"prompt": "Return only Python code. Use pathlib to remove empty files from a directory tree and print each deleted path.",
"required_markers": [
"pathlib",
"rglob",
"unlink",
"print"
],
"format_rule": "python_code"
},
{
"id": "py_pydantic_model",
"title": "Pydantic Model",
"category": "validation",
"prompt": "Return only Python code. Define a Pydantic model for a user profile with email validation and an age field.",
"required_markers": [
"BaseModel",
"EmailStr",
"age",
"validation"
],
"format_rule": "python_code"
},
{
"id": "py_regex_log_parser",
"title": "Regex Log Parser",
"category": "parsing",
"prompt": "Return only Python code. Parse web server log lines with regex and return a list of status codes and request paths.",
"required_markers": [
"re",
"status",
"path",
"findall"
],
"format_rule": "python_code"
}
]
}

View File

@@ -0,0 +1,505 @@
{
"suite_name": "python-context-edge-append-v1",
"version": "1.0",
"append_mode": "questions_only",
"purpose": "Append-only long-context stress questions for the overnight Python suite. The runner expands context bands and renders model-specific packets near the configured benchmark context caps.",
"questions": [
{
"id": "context_edge_release_wave_planner",
"title": "Context Edge Release Wave Planner",
"category": "orchestration",
"format_rule": "json_dict",
"num_predict": 650,
"required_markers": [
"auth/session.py",
"contracts/user-profile.json",
"FLAG_REQUIRE_NEW_TOKEN_CACHE",
"db_migrate --lock-timeout 120",
"billing-webhook",
"search-reindex",
"09:30"
],
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your orchestration answer. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
"followup_required_markers": [
"- Early anchor:",
"- Middle anchor:",
"- Late anchor:",
"WG-02",
"CHK-27",
"BUS-03"
],
"followup_format_rule": "three_bullets",
"prompt": "Return only JSON. You are the release orchestrator for a multi-service Python deployment train. Read the full packet carefully because the decisive blockers are spread across the early, middle, and late parts of the context.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}\\n\\nTask: produce a deployment packet with output keys exactly: objective, blocking_dependencies, execution_waves, owner_handoffs, validation_gates, rollback_triggers. Constraints: mention auth/session.py, contracts/user-profile.json, FLAG_REQUIRE_NEW_TOKEN_CACHE, db_migrate --lock-timeout 120, billing-webhook, search-reindex, and 09:30 customer demo.",
"context_stress": {
"bands": [
0.5,
0.75,
0.9
],
"reserved_output_tokens": 1100,
"minimum_context_tokens": 2048,
"record_prefix": "Release packet item",
"context_intro": "Release train packet for the April wave. Every line came from a planning note, test summary, operator handoff, or business constraint. Treat the packet as authoritative and do not invent hidden systems.",
"anchors": {
"early": [
"WG-02: identity-api and admin-web are both changing auth/session.py and contracts/user-profile.json. If those branches merge out of order, cookie version mismatches break post-login redirects for tenant-scoped routes.",
"OPS-14: db_migrate --lock-timeout 120 must run while FLAG_REQUIRE_NEW_TOKEN_CACHE is disabled. The flag flips cache key shape and makes rollback harder once migration starts."
],
"middle": [
"CHK-27: search-reindex can lag by 35 minutes after the API deploy. That lag is acceptable for customer search results but should not block the release acceptance gate.",
"SEC-09: deploy key rotation already happened. Do not roll back to images older than 2026.04.07-3 because those images still reference the retired package registry key."
],
"late": [
"BUS-03: the billing-webhook queue must keep draining during the 09:30 customer demo. A pause longer than 90 seconds will surface stale invoice state in the live walkthrough.",
"QA-41: mobile login smoke is only meaningful after edge-proxy and identity-api are both serving the same cookie version. Running it earlier produces false failures."
]
},
"records": [
"Service identity-api is green on unit tests but still has one open canary note about tenant header normalization. The branch owner says the change only touches cookie parsing and the response contract for auth bootstrap.",
"Service edge-proxy passes lint and integration tests. The remaining note says cookie-version forwarding was renamed from cookie_build to cookie_version to match the new auth contract.",
"Service admin-web updated the post-login redirect helper and now reads project context after auth bootstrap. QA notes that login, logout, and tenant-missing flows all need one shared smoke pass after deploy.",
"Worker queue-scheduler has no code changes in this train but its cron definitions were regenerated yesterday. Operators want to avoid overlapping scheduler restarts with the migration step.",
"Billing service is not changing code in this train. The operational risk is backlog accumulation in the billing-webhook consumer if the identity rollout accidentally stalls shared Redis access.",
"Search service is receiving a schema-compatible event rename. The reindex job can backfill eventually, and product already accepted a temporary lag in search freshness during the train.",
"QA note: the fastest critical path is identity-api, then edge-proxy, then admin-web, then mobile smoke, then billing observation, then search verification. They do not want optional checks in front of auth safety checks.",
"Rollback note: if cookie validation fails after the proxy deploy, revert edge-proxy first and hold admin-web. Reverting admin-web alone leaves the browser storing the wrong redirect metadata.",
"Observability note: dashboard ORCH-REL-12 tracks tenant-scoped login success, billing-webhook lag, and search event age in one board. Release managers prefer those metrics over raw pod restart counts.",
"Dependency note: the deploy tool can stage identity-api and edge-proxy in separate waves, but shared contract changes mean contracts/user-profile.json must land before admin-web is exposed to users.",
"Comms note: support has a saved macro for minor search delay, but no macro for failed billing state during a customer demo. Business risk is therefore asymmetric toward queue health.",
"Infra note: the release train uses one database migration transaction and one feature-flag flip. Operators only want one irreversible step, and they want it late enough that rollback still exists before then."
]
}
},
{
"id": "context_edge_worker_dispatch_matrix",
"title": "Context Edge Worker Dispatch Matrix",
"category": "worker_coordination",
"format_rule": "json_dict",
"num_predict": 650,
"required_markers": [
"resolve_context.go",
"20260408_add_job_owner.sql",
"toolset_registry.py",
"status.json",
"rebase",
"ops-2"
],
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your worker-coordination answer. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
"followup_required_markers": [
"- Early anchor:",
"- Middle anchor:",
"- Late anchor:",
"orch-3",
"worker-8",
"ops-2"
],
"followup_format_rule": "three_bullets",
"prompt": "Return only JSON. You are coordinating a mixed worker wave across Earth, TruthGraph, and MyServers. The packet is intentionally long because the real risk is file overlap and sequencing, not raw task count.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}\\n\\nTask: produce a dispatch packet with output keys exactly: stalled_workstreams, safe_parallel_groups, files_with_conflict_risk, required_rebases, first_messages_to_send, done_definition. Constraints: mention resolve_context.go, 20260408_add_job_owner.sql, toolset_registry.py, status.json, rebase order, and the ops-2 bottleneck.",
"context_stress": {
"bands": [
0.5,
0.75,
0.9
],
"reserved_output_tokens": 1000,
"minimum_context_tokens": 2048,
"record_prefix": "Dispatch packet item",
"context_intro": "This packet merges worker handoffs, dirty-file reports, and operator availability notes. Every worker owns a different slice, but shared files and sequencing make or break the wave.",
"anchors": {
"early": [
"orch-3: branch tg-query-cleanup and orch-7: branch tg-doc-ingest both touch TruthGraph/internal/query/resolve_context.go. They cannot land independently without a reconciliation pass.",
"worker-5: scheduler ownership cleanup depends on migration 20260408_add_job_owner.sql. Any code merge before the migration lands will leave mixed owner semantics in runtime views."
],
"middle": [
"worker-2: tests are green, but docs/contracts/status.json still reflects the old rollout states. Snapshot tests downstream will churn if the contract file is not refreshed before merge.",
"worker-8: already cherry-picked part of worker-1. Rebase order matters now because tool names were renamed in one branch and only documented in the other."
],
"late": [
"ops-2: the only human with production shell access before 08:00. Anything needing live verification or cron edits must line up behind that window.",
"worker-4: can unblock three others by landing toolset_registry.py first. Until that file stabilizes, downstream command manifests will keep conflicting."
]
},
"records": [
"worker-1 is updating telemetry/build_python_overnight_mini_report.py and the JSON summary contract consumed by the manual page. Their branch also renames one latency field used by dashboards.",
"worker-2 is on TruthGraph/docs plus a small code touch in cmd/truthgraph/status.go. The branch is mostly docs but accidentally edits one shared enum name in the CLI output helper.",
"worker-3 is improving the MyServers cron installer for one-time jobs. Their changes are isolated except for touching a shared helper that prints UTC timestamps for wrapper scripts.",
"worker-4 is consolidating tool declarations in toolset_registry.py. Multiple downstream branches imported old names directly instead of using the registry.",
"worker-5 is adding explicit owner fields to scheduler jobs and matching database rows. The migration is written but has not been reviewed against existing null rows.",
"worker-6 is editing operator docs and runbooks. They do not block code merges directly, but they own the wording that gets copied into incident channels during rollout.",
"worker-7 is adjusting model-routing defaults for Hermes and Discord. Their branch changes both config defaults and one reconnect warning string in gateway/run.py.",
"worker-8 is on lightweight dashboard polish but already cherry-picked worker-1's field rename to unblock local screenshots. Their branch now contains an older copy of the report schema.",
"orch-1 wants the final wave to preserve linear, reviewable commits. They explicitly do not want one mega-merge that hides ordering mistakes.",
"orch-2 notes that MyServers and Earth can merge independently unless the status contract is changed. If status.json shifts shape, the report builder and dashboards need to move together.",
"test note: the riskiest shared files are resolve_context.go, toolset_registry.py, status.json, and the migration plus scheduler read path. Everything else is secondary.",
"communications note: developers are online all morning, but only ops-2 can approve production crontab edits before the normal business day starts."
]
}
},
{
"id": "context_edge_scheduler_incident_forensics",
"title": "Context Edge Scheduler Incident Forensics",
"category": "debugging",
"format_rule": "json_dict",
"num_predict": 700,
"required_markers": [
"ACK_AFTER_WRITE",
"deadline_seconds=45",
"clock_skew_ms",
"retry_id",
"ack",
"duplicate",
"2026.04.08-rc3"
],
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your incident answer. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
"followup_required_markers": [
"- Early anchor:",
"- Middle anchor:",
"- Late anchor:",
"INC-104",
"TRACE-22",
"DB-19"
],
"followup_format_rule": "three_bullets",
"prompt": "Return only JSON. You are reading a long incident packet for a Python scheduler service that produced duplicate downstream outputs. Several clues are noisy; only some of them matter.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}\\n\\nTask: produce an incident packet with output keys exactly: primary_failure, evidence_chain, misleading_signals, immediate_mitigation, durable_fix, verification_sequence. Constraints: mention ACK_AFTER_WRITE, deadline_seconds=45, clock_skew_ms, retry_id, duplicate ack, and deploy 2026.04.08-rc3.",
"context_stress": {
"bands": [
0.5,
0.75,
0.9
],
"reserved_output_tokens": 1100,
"minimum_context_tokens": 2048,
"record_prefix": "Incident packet item",
"context_intro": "This packet combines deploy notes, logs, traces, metrics, and operator comments from a duplicate-output incident in a Python scheduler pipeline. Not every warning is causal.",
"anchors": {
"early": [
"INC-104: the first duplicate outputs started only after deploy 2026.04.08-rc3 changed ACK_AFTER_WRITE from false to true in the scheduler worker configuration.",
"LOG-13: clock_skew_ms spiked to 1820 on one host, but duplicates had already begun before NTP finished correcting the clock warning."
],
"middle": [
"TRACE-22: retry_id increments before the original worker calls ack(), so two workers can believe they own the same logical job when the deadline expires.",
"CFG-07: deadline_seconds=45 replaced the previous value of 90 in the same deploy, shrinking the time between write completion and retry pickup."
],
"late": [
"DB-19: the database write commits successfully, then a second worker acks the same logical job after the retry path already re-issued it.",
"OPS-31: restarting Redis reduced queue noise and warning volume, but duplicate downstream outputs continued afterward."
]
},
"records": [
"The scheduler processes one logical job per payload_id and writes a completion row before acking the queue lease. Prior to rc3, ack happened first and the write path was shorter.",
"Metrics packet: queue lag rose mildly during the incident, but CPU and memory stayed within normal range. The most visible symptom to customers was duplicate email and webhook fan-out.",
"Operator note: one host emitted noisy clock warnings, which pulled attention toward NTP first. A later cross-host trace showed duplicate ownership on hosts without clock issues.",
"Deploy note: rc3 also changed retry logging verbosity and added one trace span around downstream fan-out. That made the incident look larger in logs but was not itself causal.",
"Trace note: original worker wrote success, paused in a post-write hook, then attempted ack. Retry worker acquired the lease after deadline expiration and re-issued fan-out with a new retry_id.",
"Safety note: the downstream consumer is idempotent for storage writes but not for customer notifications, which is why duplicates surfaced in email and webhook channels first.",
"Redis note: one operator restart reduced pending command backlog and made queue metrics calmer. No code paths changed and the duplicate symptom persisted.",
"Config note: deadline_seconds and ACK_AFTER_WRITE were rolled out together. There is no experiment isolating one from the other in production.",
"Postmortem draft: the service lacks a single ownership fence between write completion and lease acknowledgment. Retry semantics assume that ack or durable ownership happens first.",
"Verification note: operators want a fix that can be tested under a fake clock and a delayed post-write hook so the race becomes deterministic in CI.",
"Rollback note: reverting only the retry logging changes would be meaningless. The risky part of rc3 is the ordering change plus the tighter deadline.",
"Customer note: the biggest harm was duplicate human-facing notifications, not raw queue delay. Mitigation must stop duplicate fan-out quickly even if throughput drops."
]
}
},
{
"id": "context_edge_ingest_requirements_contract",
"title": "Context Edge Ingest Requirements Contract",
"category": "structured_extraction",
"format_rule": "json_dict",
"num_predict": 600,
"required_markers": [
"ingestion_mode",
"retry_budget",
"quarantine_rule",
"required_artifacts",
"owner_escalation",
"privacy_constraint",
"rollout_gate",
"kill_switch"
],
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your contract extraction answer. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
"followup_required_markers": [
"- Early anchor:",
"- Middle anchor:",
"- Late anchor:",
"REQ-03",
"POL-11",
"OPS-28"
],
"followup_format_rule": "three_bullets",
"prompt": "Return only JSON. You are extracting a deployment contract for a Python ingestion pipeline from a long mixed packet of requirements, policy notes, rollout notes, and operator reminders.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}\\n\\nTask: return a JSON object with exactly these keys and no extras: ingestion_mode, retry_budget, quarantine_rule, required_artifacts, owner_escalation, privacy_constraint, rollout_gate, kill_switch. Constraints: stay literal, prefer exact values over paraphrase, and do not invent unstated defaults.",
"context_stress": {
"bands": [
0.5,
0.75,
0.9
],
"reserved_output_tokens": 900,
"minimum_context_tokens": 2048,
"record_prefix": "Contract packet item",
"context_intro": "Mixed contract packet for a Python ingestion system. Some details are binding requirements, some are background. The task is to extract only the binding contract values.",
"anchors": {
"early": [
"REQ-03: retry budget is exactly 2 automatic retries after the first failed attempt. A third retry is forbidden because it can duplicate external side effects.",
"REQ-05: ingestion_mode is shadow_then_stream. The pipeline must begin in shadow mode, prove parity, and only then flip to streamed writes."
],
"middle": [
"POL-11: any payload containing raw customer email addresses goes to quarantine bucket pii-review and must not be summarized into human-readable incident reports.",
"ART-07: required_artifacts are manifest.json, validation_report.json, and trace.txt for every promoted ingest run."
],
"late": [
"OPS-28: the kill switch is environment variable INGEST_STOP_AFTER_DOWNLOAD=1 and it must stop promotion before parsing or persistence.",
"OWN-04: owner escalation goes to platform-oncall first, then data-infra lead only if the incident lasts more than 30 minutes."
]
},
"records": [
"Design note: the system ingests partner dumps, validates rows, stages transformed objects, and only later publishes promoted records. Teams want one contract that product and ops can both read.",
"Rollout note: shadow mode exists because partner dumps are often messy. The team wants hard evidence that counts and hashes line up before streamed writes go live.",
"Validation note: row-level errors may be sampled for debugging, but privacy guidance forbids copying raw customer email into broad operator summaries or public incident channels.",
"Ops note: when promotion is blocked, the run still needs complete artifacts so responders can debug without rerunning the partner dump immediately.",
"Noise note: one historical document recommended three retries for network flaps, but that advice pre-dates the current side-effect model and is no longer authoritative.",
"Escalation note: data-infra lead helps on sustained incidents, but the first operational owner is always the platform-oncall rotation because they control the promotion switch.",
"Runbook note: the kill switch exists to stop damage after download if a bad dump arrives. It should preserve downloaded evidence while preventing parse and write phases.",
"Compliance note: the quarantine bucket name is fixed because downstream cleanup tooling keys off pii-review and nothing else.",
"Artifact note: analysts depend on manifest.json, validation_report.json, and trace.txt when they compare shadow and stream runs. Missing any one of them blocks promotion approval.",
"Product note: streamed writes are the end state, but leadership explicitly wants a visible shadow phase first, not an immediate cutover."
]
}
},
{
"id": "context_edge_ollama_runbook_migration_brief",
"title": "Context Edge Ollama Runbook Migration Brief",
"category": "documentation",
"format_rule": "numbered_plan_4",
"num_predict": 480,
"required_markers": [
"OLLAMA_HOST=0.0.0.0:11434",
"OLLAMA_MAX_LOADED_MODELS=3",
"curl http://SERVER_IP:11434/api/tags",
"hermes gateway restart"
],
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your migration brief. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
"followup_required_markers": [
"- Early anchor:",
"- Middle anchor:",
"- Late anchor:",
"CFG-21",
"NET-08",
"BOT-14"
],
"followup_format_rule": "three_bullets",
"prompt": "Return exactly four numbered lines and nothing else. Each line must be one migration step for an operator moving from local-only Ollama to a remote Ollama plus Discord/Hermes setup.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}\\n\\nRequirements: Step 1 must be a precheck, step 2 must be the server cutover, step 3 must be verification from the developer machine, and step 4 must be the bot/client reconnection step. Mention OLLAMA_HOST=0.0.0.0:11434, OLLAMA_MAX_LOADED_MODELS=3, curl http://SERVER_IP:11434/api/tags, and hermes gateway restart.",
"context_stress": {
"bands": [
0.9
],
"reserved_output_tokens": 700,
"minimum_context_tokens": 2048,
"record_prefix": "Runbook packet item",
"context_intro": "Migration packet for exposing a previously local-only Ollama server to remote clients while keeping the setup supportable for Aider and Hermes.",
"anchors": {
"early": [
"CFG-21: systemd override must set OLLAMA_HOST=0.0.0.0:11434, OLLAMA_NUM_PARALLEL=1 or another deliberate value, OLLAMA_MAX_LOADED_MODELS=3, and OLLAMA_KEEP_ALIVE=24h.",
"CFG-24: after editing override.conf, operators must run systemctl daemon-reload and restart ollama before testing anything else."
],
"middle": [
"NET-08: open port 11434 only from the developer machine IP when possible. A wide-open firewall rule is simpler but explicitly less safe.",
"NET-11: curl http://localhost:11434/api/tags on the server is not enough; the runbook must also include curl http://SERVER_IP:11434/api/tags from the developer machine."
],
"late": [
"BOT-14: Hermes should not be restarted until the remote tags endpoint works. Otherwise Discord symptoms look like bot errors when the real issue is Ollama reachability.",
"BOT-19: after the endpoint is healthy, hermes gateway restart is the final reconnect step so Discord and custom endpoint settings are refreshed."
]
},
"records": [
"Server baseline: Ollama is installed and running, but historically bound only to localhost. The operator wants to serve remote Aider and Hermes without turning the box into an open relay.",
"Model baseline: the desired operating set is a small router, a medium orchestrator, and one heavier coding worker. OLLAMA_MAX_LOADED_MODELS=3 exists to keep the three hottest models around without pretending all can stay resident.",
"Firewall note: UFW may be inactive on a fresh VPS, in which case adding the rule alone changes nothing until UFW is enabled or provider-side firewall rules are also correct.",
"Developer-machine note: direct curl smoke tests are faster and less ambiguous than jumping straight into Hermes, because they isolate network reachability from agent wrapper behavior.",
"Aider note: ~/.aider.conf.yml should point at http://SERVER_IP:11434/v1 with API key set to ollama. That config proves the remote OpenAI-compatible surface is working before complex agents are blamed.",
"Hermes note: custom endpoint setup requires the same base URL and a model string. Discord is only useful after the base endpoint already responds from the laptop.",
"Rollback note: if remote access fails, revert the systemd override and firewall rule before touching client configs. Otherwise client debugging starts from a broken server assumption.",
"Verification note: the strongest smoke test order is server-local tags, laptop-visible tags, one short chat completion, then Aider or Hermes reconnect."
]
}
},
{
"id": "context_edge_python_context_budget_module",
"title": "Context Edge Python Context Budget Module",
"category": "coding",
"format_rule": "python_module",
"num_predict": 900,
"required_markers": [
"def utc_now",
"def estimate_token_count",
"def target_prompt_tokens",
"def assemble_context_packet",
"def prompt_sha256",
"hashlib",
"typing"
],
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your code answer. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
"followup_required_markers": [
"- Early anchor:",
"- Middle anchor:",
"- Late anchor:",
"REQ-CTX-01",
"FAIL-CTX-07",
"OPS-CTX-12"
],
"followup_format_rule": "three_bullets",
"prompt": "Return only Python code. Write one self-contained module named context_budget.py. The module must expose utc_now(), estimate_token_count(text), target_prompt_tokens(max_context_tokens, band_fraction, reserved_output_tokens), assemble_context_packet(intro, early, middle, late, records, target_tokens, record_prefix='Packet item'), and prompt_sha256(text). Use only the standard library, include type hints, keep the behavior deterministic, and do not emit markdown fences.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}",
"context_stress": {
"bands": [
0.9
],
"reserved_output_tokens": 1400,
"minimum_context_tokens": 2048,
"record_prefix": "Design packet item",
"context_intro": "Design packet for a reusable context-budget helper module intended for benchmark runners and agent wrappers that need deterministic long-prompt assembly plus debuggable metadata.",
"anchors": {
"early": [
"REQ-CTX-01: the module must expose target_prompt_tokens(max_context_tokens, band_fraction, reserved_output_tokens) so context bands are reproducible instead of hand-tuned.",
"REQ-CTX-03: estimate_token_count can be approximate but must be deterministic, cheap, and based only on the input text."
],
"middle": [
"FAIL-CTX-07: a previous Hermes replay consumed a huge prompt, returned finish_reason stop, and produced empty content. Debugging required a prompt hash plus preview and tail slices.",
"FAIL-CTX-09: repeated records are acceptable when stretching a packet, but their ordering must be deterministic or telemetry comparisons become meaningless."
],
"late": [
"OPS-CTX-12: helper output and timestamps must stay human-facing and UTC-friendly because operators debug these suites from terminal logs, not notebooks.",
"OPS-CTX-14: no third-party tokenizer dependency is allowed on the server path because benchmark scripts must run on a clean VPS without pip installs."
]
},
"records": [
"Implementation note: assemble_context_packet should accept intro plus early, middle, late anchor lists and a pool of repeating records. The output should grow until it roughly hits a target token budget.",
"Debug note: prompt_sha256 exists because storing every rendered prompt verbatim can waste disk. A hash plus preview and tail slices gives traceability without keeping giant files by default.",
"Operator note: utc_now should be a tiny helper returning one stable UTC format so benchmark logs across scripts line up naturally.",
"Reliability note: target_prompt_tokens should guard against impossible inputs such as negative reserved output tokens or a band fraction outside the open interval from 0 to 1.",
"Performance note: estimate_token_count should be good enough for shaping packets but not so clever that it becomes the slowest part of the run.",
"Code style note: type hints matter because downstream scripts may import this helper. A small dataclass is fine, but the interface should remain simple and standard-library only.",
"Telemetry note: deterministic packet assembly makes it possible to compare models honestly because the prompt content is the same for every model once the cap and band are fixed.",
"Failure note: previous runs showed that long prompts can fail in clean-looking ways, including empty assistant text. The module therefore needs affordances for reproducible reconstruction."
]
}
},
{
"id": "context_edge_pytest_scheduler_retry_regression",
"title": "Context Edge Pytest Scheduler Retry Regression",
"category": "tests",
"format_rule": "pytest_code",
"num_predict": 1000,
"required_markers": [
"def test_",
"monkeypatch",
"retry_id",
"ACK_AFTER_WRITE",
"deadline_seconds",
"assert"
],
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your test answer. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
"followup_required_markers": [
"- Early anchor:",
"- Middle anchor:",
"- Late anchor:",
"BUG-ACK-01",
"TRACE-ACK-09",
"VERIFY-ACK-22"
],
"followup_format_rule": "three_bullets",
"prompt": "Return only Python code. Write one focused pytest module for the duplicate-ack scheduler regression described in the packet. Requirements: include one deterministic test with monkeypatch or fakes, model the retry_id race, assert that only one logical job commit wins, and make the failure impossible to miss in CI. Use only standard pytest patterns and do not wrap the answer in markdown fences.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}",
"context_stress": {
"bands": [
0.9
],
"reserved_output_tokens": 1500,
"minimum_context_tokens": 2048,
"record_prefix": "Regression packet item",
"context_intro": "Regression packet for a Python scheduler service where retry timing and ack ordering can duplicate downstream side effects. The target is one surgical pytest module, not a whole test suite.",
"anchors": {
"early": [
"BUG-ACK-01: duplicate downstream outputs started after ACK_AFTER_WRITE=true shipped in rc3. The regression test must exercise that ordering change directly.",
"BUG-ACK-03: deadline_seconds tightened from 90 to 45 in the same release, making the retry pickup easier to trigger."
],
"middle": [
"TRACE-ACK-09: the retry worker increments retry_id before the original worker calls ack(), so the test needs two ownership paths and one delayed ack.",
"TRACE-ACK-14: the original write succeeds before the retry starts fan-out, which is why the bug is duplicate side effects rather than missing persistence."
],
"late": [
"VERIFY-ACK-22: the regression test must prove that only one logical job commit and one notify path are treated as authoritative after the fix.",
"VERIFY-ACK-24: a fake clock or explicit delay hook is required so the race is deterministic instead of relying on sleeping threads."
]
},
"records": [
"Service behavior: one worker writes job completion and a downstream notify record, then acknowledges the queue lease. Retry logic watches deadline expiry and can spawn a second worker for the same logical payload.",
"Historical assumption: ack happened before write, so retry pickup rarely overlapped a durable write. After rc3 that assumption no longer holds.",
"Testing note: a good regression harness can stub the notifier and collect emitted payload_ids. Duplicate notification is easier to assert than raw queue internals.",
"Operator note: Redis restarts and clock warnings were noisy but non-causal. The test should focus on ordering and ownership, not infrastructure flakiness.",
"Implementation note: a fake clock or injectable now() hook is preferred over thread sleeps because CI latency is too variable for a race test.",
"Acceptance note: if the fix works, either the retry worker or the original worker should stand down cleanly, but never both proceed to external notify.",
"CI note: the test should fail loudly with a short diff if duplicate notification happens. Silent counting helpers are harder to trust in review.",
"Code review note: a focused module with one strong regression test is worth more than many weak permutations for this specific benchmark."
]
}
},
{
"id": "context_edge_change_review_packet",
"title": "Context Edge Change Review Packet",
"category": "review",
"format_rule": "json_dict",
"num_predict": 700,
"required_markers": [
"gateway/run.py",
"run_python_task_suite.py",
"status.json",
"20260408_add_job_owner.sql",
"discord",
"migration"
],
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your review answer. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
"followup_required_markers": [
"- Early anchor:",
"- Middle anchor:",
"- Late anchor:",
"REV-03",
"REV-17",
"REV-29"
],
"followup_format_rule": "three_bullets",
"prompt": "Return only JSON. You are reviewing a large mixed diff packet that spans Python services, telemetry tooling, Discord gateway behavior, documentation, and one database migration.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}\\n\\nTask: produce a review packet with output keys exactly: likely_regressions, riskiest_files, missing_tests, rollout_risk, safe_merge_condition. Constraints: mention gateway/run.py, run_python_task_suite.py, status.json, 20260408_add_job_owner.sql, and Discord or gateway behavior where relevant.",
"context_stress": {
"bands": [
0.9
],
"reserved_output_tokens": 1100,
"minimum_context_tokens": 2048,
"record_prefix": "Review packet item",
"context_intro": "Mixed diff packet assembled from review summaries, file notes, test output snippets, and rollout comments. The challenge is to surface concrete regressions instead of repeating generic code-review advice.",
"anchors": {
"early": [
"REV-03: migration 20260408_add_job_owner.sql adds a non-null job_owner column without a documented backfill for existing rows. That can fail immediately on populated databases.",
"REV-06: scheduler read paths were updated in code, but one admin query still selects the old nullable shape in a report view."
],
"middle": [
"REV-17: gateway/run.py changed reconnect behavior for stale provider responses, but no test proves how Discord handles empty assistant content or one clean timeout followed by a retry.",
"REV-21: run_python_task_suite.py now records prompt hashes and context metadata, yet no report-level check verifies the new keys are preserved."
],
"late": [
"REV-29: the report builder renamed one status field, but dashboards and status.json examples in docs still use the old key. That will silently break HTML rendering if merged together.",
"REV-34: rollout notes assume the migration and report schema can deploy independently, but the dashboard pull path still reads both in the same morning workflow."
]
},
"records": [
"Diff summary: gateway/run.py now waits longer before declaring the provider stale, and the Discord adapter emits one new reconnect warning line. No new fixture captures an empty successful response body.",
"Telemetry summary: run_python_task_suite.py expanded to support context-stress prompts, custom follow-up prompts, and prompt metadata files. The status markdown and report builder were only partially updated.",
"Migration summary: 20260408_add_job_owner.sql introduces explicit ownership on scheduled jobs so the UI can attribute work cleanly. The migration note mentions new writes, not legacy rows.",
"Dashboard summary: one HTML manual still expects the previous status key name from status.json and has no assertion on unknown-field fallback.",
"Docs summary: operator docs were refreshed for the new Discord reconnect wording, but one screenshots guide still references the prior warning text verbatim.",
"Review note: the riskiest interactions are schema plus runtime reads, and telemetry JSON plus dashboard consumption. Pure doc edits are comparatively safe.",
"Testing note: there are unit tests around scheduler ownership writes and a separate smoke test for Discord login, but nothing that exercises both the new reconnect path and empty assistant content.",
"Rollout note: support wants the dashboard alive on the same morning the migration lands. That makes silent telemetry-key drift more expensive than a normal internal-only contract change."
]
}
}
]
}

View File

@@ -0,0 +1,561 @@
{
"suite_name": "python-context-edge-append-v1",
"version": "1.0",
"purpose": "Append-only long-context stress questions for the overnight Python suite. The runner expands context bands and renders model-specific packets near the configured benchmark context caps.",
"models": [
{
"model": "qwen32-coder-32k",
"display_name": "Qwen32 Coder 32k",
"size_label": "32b"
},
{
"model": "qwen14-coder-32k",
"display_name": "Qwen14 Coder 32k",
"size_label": "14b"
},
{
"model": "codestral-32k",
"display_name": "Codestral 32k",
"size_label": "22b"
},
{
"model": "codellama34-16k",
"display_name": "CodeLlama 34 16k",
"size_label": "34b"
},
{
"model": "phind34-16k",
"display_name": "Phind 34 16k",
"size_label": "34b"
},
{
"model": "qwen14-general-32k",
"display_name": "Qwen14 General 32k",
"size_label": "14b"
},
{
"model": "qwen2.5-coder:3b",
"display_name": "Qwen2.5 Coder 3B",
"size_label": "3b"
},
{
"model": "qwen2.5-coder:1.5b",
"display_name": "Qwen2.5 Coder 1.5B",
"size_label": "1.5b"
},
{
"model": "qwen2.5:3b",
"display_name": "Qwen2.5 3B",
"size_label": "3b"
},
{
"model": "llama3.2:3b",
"display_name": "Llama 3.2 3B",
"size_label": "3b"
},
{
"model": "phi3",
"display_name": "Phi-3 Mini",
"size_label": "3.8b"
}
],
"questions": [
{
"id": "context_edge_release_wave_planner",
"title": "Context Edge Release Wave Planner",
"category": "orchestration",
"format_rule": "json_dict",
"num_predict": 650,
"required_markers": [
"auth/session.py",
"contracts/user-profile.json",
"FLAG_REQUIRE_NEW_TOKEN_CACHE",
"db_migrate --lock-timeout 120",
"billing-webhook",
"search-reindex",
"09:30"
],
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your orchestration answer. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
"followup_required_markers": [
"- Early anchor:",
"- Middle anchor:",
"- Late anchor:",
"WG-02",
"CHK-27",
"BUS-03"
],
"followup_format_rule": "three_bullets",
"prompt": "Return only JSON. You are the release orchestrator for a multi-service Python deployment train. Read the full packet carefully because the decisive blockers are spread across the early, middle, and late parts of the context.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}\\n\\nTask: produce a deployment packet with output keys exactly: objective, blocking_dependencies, execution_waves, owner_handoffs, validation_gates, rollback_triggers. Constraints: mention auth/session.py, contracts/user-profile.json, FLAG_REQUIRE_NEW_TOKEN_CACHE, db_migrate --lock-timeout 120, billing-webhook, search-reindex, and 09:30 customer demo.",
"context_stress": {
"bands": [
0.5,
0.75,
0.9
],
"reserved_output_tokens": 1100,
"minimum_context_tokens": 2048,
"record_prefix": "Release packet item",
"context_intro": "Release train packet for the April wave. Every line came from a planning note, test summary, operator handoff, or business constraint. Treat the packet as authoritative and do not invent hidden systems.",
"anchors": {
"early": [
"WG-02: identity-api and admin-web are both changing auth/session.py and contracts/user-profile.json. If those branches merge out of order, cookie version mismatches break post-login redirects for tenant-scoped routes.",
"OPS-14: db_migrate --lock-timeout 120 must run while FLAG_REQUIRE_NEW_TOKEN_CACHE is disabled. The flag flips cache key shape and makes rollback harder once migration starts."
],
"middle": [
"CHK-27: search-reindex can lag by 35 minutes after the API deploy. That lag is acceptable for customer search results but should not block the release acceptance gate.",
"SEC-09: deploy key rotation already happened. Do not roll back to images older than 2026.04.07-3 because those images still reference the retired package registry key."
],
"late": [
"BUS-03: the billing-webhook queue must keep draining during the 09:30 customer demo. A pause longer than 90 seconds will surface stale invoice state in the live walkthrough.",
"QA-41: mobile login smoke is only meaningful after edge-proxy and identity-api are both serving the same cookie version. Running it earlier produces false failures."
]
},
"records": [
"Service identity-api is green on unit tests but still has one open canary note about tenant header normalization. The branch owner says the change only touches cookie parsing and the response contract for auth bootstrap.",
"Service edge-proxy passes lint and integration tests. The remaining note says cookie-version forwarding was renamed from cookie_build to cookie_version to match the new auth contract.",
"Service admin-web updated the post-login redirect helper and now reads project context after auth bootstrap. QA notes that login, logout, and tenant-missing flows all need one shared smoke pass after deploy.",
"Worker queue-scheduler has no code changes in this train but its cron definitions were regenerated yesterday. Operators want to avoid overlapping scheduler restarts with the migration step.",
"Billing service is not changing code in this train. The operational risk is backlog accumulation in the billing-webhook consumer if the identity rollout accidentally stalls shared Redis access.",
"Search service is receiving a schema-compatible event rename. The reindex job can backfill eventually, and product already accepted a temporary lag in search freshness during the train.",
"QA note: the fastest critical path is identity-api, then edge-proxy, then admin-web, then mobile smoke, then billing observation, then search verification. They do not want optional checks in front of auth safety checks.",
"Rollback note: if cookie validation fails after the proxy deploy, revert edge-proxy first and hold admin-web. Reverting admin-web alone leaves the browser storing the wrong redirect metadata.",
"Observability note: dashboard ORCH-REL-12 tracks tenant-scoped login success, billing-webhook lag, and search event age in one board. Release managers prefer those metrics over raw pod restart counts.",
"Dependency note: the deploy tool can stage identity-api and edge-proxy in separate waves, but shared contract changes mean contracts/user-profile.json must land before admin-web is exposed to users.",
"Comms note: support has a saved macro for minor search delay, but no macro for failed billing state during a customer demo. Business risk is therefore asymmetric toward queue health.",
"Infra note: the release train uses one database migration transaction and one feature-flag flip. Operators only want one irreversible step, and they want it late enough that rollback still exists before then."
]
}
},
{
"id": "context_edge_worker_dispatch_matrix",
"title": "Context Edge Worker Dispatch Matrix",
"category": "worker_coordination",
"format_rule": "json_dict",
"num_predict": 650,
"required_markers": [
"resolve_context.go",
"20260408_add_job_owner.sql",
"toolset_registry.py",
"status.json",
"rebase",
"ops-2"
],
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your worker-coordination answer. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
"followup_required_markers": [
"- Early anchor:",
"- Middle anchor:",
"- Late anchor:",
"orch-3",
"worker-8",
"ops-2"
],
"followup_format_rule": "three_bullets",
"prompt": "Return only JSON. You are coordinating a mixed worker wave across Earth, TruthGraph, and MyServers. The packet is intentionally long because the real risk is file overlap and sequencing, not raw task count.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}\\n\\nTask: produce a dispatch packet with output keys exactly: stalled_workstreams, safe_parallel_groups, files_with_conflict_risk, required_rebases, first_messages_to_send, done_definition. Constraints: mention resolve_context.go, 20260408_add_job_owner.sql, toolset_registry.py, status.json, rebase order, and the ops-2 bottleneck.",
"context_stress": {
"bands": [
0.5,
0.75,
0.9
],
"reserved_output_tokens": 1000,
"minimum_context_tokens": 2048,
"record_prefix": "Dispatch packet item",
"context_intro": "This packet merges worker handoffs, dirty-file reports, and operator availability notes. Every worker owns a different slice, but shared files and sequencing make or break the wave.",
"anchors": {
"early": [
"orch-3: branch tg-query-cleanup and orch-7: branch tg-doc-ingest both touch TruthGraph/internal/query/resolve_context.go. They cannot land independently without a reconciliation pass.",
"worker-5: scheduler ownership cleanup depends on migration 20260408_add_job_owner.sql. Any code merge before the migration lands will leave mixed owner semantics in runtime views."
],
"middle": [
"worker-2: tests are green, but docs/contracts/status.json still reflects the old rollout states. Snapshot tests downstream will churn if the contract file is not refreshed before merge.",
"worker-8: already cherry-picked part of worker-1. Rebase order matters now because tool names were renamed in one branch and only documented in the other."
],
"late": [
"ops-2: the only human with production shell access before 08:00. Anything needing live verification or cron edits must line up behind that window.",
"worker-4: can unblock three others by landing toolset_registry.py first. Until that file stabilizes, downstream command manifests will keep conflicting."
]
},
"records": [
"worker-1 is updating telemetry/build_python_overnight_mini_report.py and the JSON summary contract consumed by the manual page. Their branch also renames one latency field used by dashboards.",
"worker-2 is on TruthGraph/docs plus a small code touch in cmd/truthgraph/status.go. The branch is mostly docs but accidentally edits one shared enum name in the CLI output helper.",
"worker-3 is improving the MyServers cron installer for one-time jobs. Their changes are isolated except for touching a shared helper that prints UTC timestamps for wrapper scripts.",
"worker-4 is consolidating tool declarations in toolset_registry.py. Multiple downstream branches imported old names directly instead of using the registry.",
"worker-5 is adding explicit owner fields to scheduler jobs and matching database rows. The migration is written but has not been reviewed against existing null rows.",
"worker-6 is editing operator docs and runbooks. They do not block code merges directly, but they own the wording that gets copied into incident channels during rollout.",
"worker-7 is adjusting model-routing defaults for Hermes and Discord. Their branch changes both config defaults and one reconnect warning string in gateway/run.py.",
"worker-8 is on lightweight dashboard polish but already cherry-picked worker-1's field rename to unblock local screenshots. Their branch now contains an older copy of the report schema.",
"orch-1 wants the final wave to preserve linear, reviewable commits. They explicitly do not want one mega-merge that hides ordering mistakes.",
"orch-2 notes that MyServers and Earth can merge independently unless the status contract is changed. If status.json shifts shape, the report builder and dashboards need to move together.",
"test note: the riskiest shared files are resolve_context.go, toolset_registry.py, status.json, and the migration plus scheduler read path. Everything else is secondary.",
"communications note: developers are online all morning, but only ops-2 can approve production crontab edits before the normal business day starts."
]
}
},
{
"id": "context_edge_scheduler_incident_forensics",
"title": "Context Edge Scheduler Incident Forensics",
"category": "debugging",
"format_rule": "json_dict",
"num_predict": 700,
"required_markers": [
"ACK_AFTER_WRITE",
"deadline_seconds=45",
"clock_skew_ms",
"retry_id",
"ack",
"duplicate",
"2026.04.08-rc3"
],
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your incident answer. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
"followup_required_markers": [
"- Early anchor:",
"- Middle anchor:",
"- Late anchor:",
"INC-104",
"TRACE-22",
"DB-19"
],
"followup_format_rule": "three_bullets",
"prompt": "Return only JSON. You are reading a long incident packet for a Python scheduler service that produced duplicate downstream outputs. Several clues are noisy; only some of them matter.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}\\n\\nTask: produce an incident packet with output keys exactly: primary_failure, evidence_chain, misleading_signals, immediate_mitigation, durable_fix, verification_sequence. Constraints: mention ACK_AFTER_WRITE, deadline_seconds=45, clock_skew_ms, retry_id, duplicate ack, and deploy 2026.04.08-rc3.",
"context_stress": {
"bands": [
0.5,
0.75,
0.9
],
"reserved_output_tokens": 1100,
"minimum_context_tokens": 2048,
"record_prefix": "Incident packet item",
"context_intro": "This packet combines deploy notes, logs, traces, metrics, and operator comments from a duplicate-output incident in a Python scheduler pipeline. Not every warning is causal.",
"anchors": {
"early": [
"INC-104: the first duplicate outputs started only after deploy 2026.04.08-rc3 changed ACK_AFTER_WRITE from false to true in the scheduler worker configuration.",
"LOG-13: clock_skew_ms spiked to 1820 on one host, but duplicates had already begun before NTP finished correcting the clock warning."
],
"middle": [
"TRACE-22: retry_id increments before the original worker calls ack(), so two workers can believe they own the same logical job when the deadline expires.",
"CFG-07: deadline_seconds=45 replaced the previous value of 90 in the same deploy, shrinking the time between write completion and retry pickup."
],
"late": [
"DB-19: the database write commits successfully, then a second worker acks the same logical job after the retry path already re-issued it.",
"OPS-31: restarting Redis reduced queue noise and warning volume, but duplicate downstream outputs continued afterward."
]
},
"records": [
"The scheduler processes one logical job per payload_id and writes a completion row before acking the queue lease. Prior to rc3, ack happened first and the write path was shorter.",
"Metrics packet: queue lag rose mildly during the incident, but CPU and memory stayed within normal range. The most visible symptom to customers was duplicate email and webhook fan-out.",
"Operator note: one host emitted noisy clock warnings, which pulled attention toward NTP first. A later cross-host trace showed duplicate ownership on hosts without clock issues.",
"Deploy note: rc3 also changed retry logging verbosity and added one trace span around downstream fan-out. That made the incident look larger in logs but was not itself causal.",
"Trace note: original worker wrote success, paused in a post-write hook, then attempted ack. Retry worker acquired the lease after deadline expiration and re-issued fan-out with a new retry_id.",
"Safety note: the downstream consumer is idempotent for storage writes but not for customer notifications, which is why duplicates surfaced in email and webhook channels first.",
"Redis note: one operator restart reduced pending command backlog and made queue metrics calmer. No code paths changed and the duplicate symptom persisted.",
"Config note: deadline_seconds and ACK_AFTER_WRITE were rolled out together. There is no experiment isolating one from the other in production.",
"Postmortem draft: the service lacks a single ownership fence between write completion and lease acknowledgment. Retry semantics assume that ack or durable ownership happens first.",
"Verification note: operators want a fix that can be tested under a fake clock and a delayed post-write hook so the race becomes deterministic in CI.",
"Rollback note: reverting only the retry logging changes would be meaningless. The risky part of rc3 is the ordering change plus the tighter deadline.",
"Customer note: the biggest harm was duplicate human-facing notifications, not raw queue delay. Mitigation must stop duplicate fan-out quickly even if throughput drops."
]
}
},
{
"id": "context_edge_ingest_requirements_contract",
"title": "Context Edge Ingest Requirements Contract",
"category": "structured_extraction",
"format_rule": "json_dict",
"num_predict": 600,
"required_markers": [
"ingestion_mode",
"retry_budget",
"quarantine_rule",
"required_artifacts",
"owner_escalation",
"privacy_constraint",
"rollout_gate",
"kill_switch"
],
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your contract extraction answer. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
"followup_required_markers": [
"- Early anchor:",
"- Middle anchor:",
"- Late anchor:",
"REQ-03",
"POL-11",
"OPS-28"
],
"followup_format_rule": "three_bullets",
"prompt": "Return only JSON. You are extracting a deployment contract for a Python ingestion pipeline from a long mixed packet of requirements, policy notes, rollout notes, and operator reminders.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}\\n\\nTask: return a JSON object with exactly these keys and no extras: ingestion_mode, retry_budget, quarantine_rule, required_artifacts, owner_escalation, privacy_constraint, rollout_gate, kill_switch. Constraints: stay literal, prefer exact values over paraphrase, and do not invent unstated defaults.",
"context_stress": {
"bands": [
0.5,
0.75,
0.9
],
"reserved_output_tokens": 900,
"minimum_context_tokens": 2048,
"record_prefix": "Contract packet item",
"context_intro": "Mixed contract packet for a Python ingestion system. Some details are binding requirements, some are background. The task is to extract only the binding contract values.",
"anchors": {
"early": [
"REQ-03: retry budget is exactly 2 automatic retries after the first failed attempt. A third retry is forbidden because it can duplicate external side effects.",
"REQ-05: ingestion_mode is shadow_then_stream. The pipeline must begin in shadow mode, prove parity, and only then flip to streamed writes."
],
"middle": [
"POL-11: any payload containing raw customer email addresses goes to quarantine bucket pii-review and must not be summarized into human-readable incident reports.",
"ART-07: required_artifacts are manifest.json, validation_report.json, and trace.txt for every promoted ingest run."
],
"late": [
"OPS-28: the kill switch is environment variable INGEST_STOP_AFTER_DOWNLOAD=1 and it must stop promotion before parsing or persistence.",
"OWN-04: owner escalation goes to platform-oncall first, then data-infra lead only if the incident lasts more than 30 minutes."
]
},
"records": [
"Design note: the system ingests partner dumps, validates rows, stages transformed objects, and only later publishes promoted records. Teams want one contract that product and ops can both read.",
"Rollout note: shadow mode exists because partner dumps are often messy. The team wants hard evidence that counts and hashes line up before streamed writes go live.",
"Validation note: row-level errors may be sampled for debugging, but privacy guidance forbids copying raw customer email into broad operator summaries or public incident channels.",
"Ops note: when promotion is blocked, the run still needs complete artifacts so responders can debug without rerunning the partner dump immediately.",
"Noise note: one historical document recommended three retries for network flaps, but that advice pre-dates the current side-effect model and is no longer authoritative.",
"Escalation note: data-infra lead helps on sustained incidents, but the first operational owner is always the platform-oncall rotation because they control the promotion switch.",
"Runbook note: the kill switch exists to stop damage after download if a bad dump arrives. It should preserve downloaded evidence while preventing parse and write phases.",
"Compliance note: the quarantine bucket name is fixed because downstream cleanup tooling keys off pii-review and nothing else.",
"Artifact note: analysts depend on manifest.json, validation_report.json, and trace.txt when they compare shadow and stream runs. Missing any one of them blocks promotion approval.",
"Product note: streamed writes are the end state, but leadership explicitly wants a visible shadow phase first, not an immediate cutover."
]
}
},
{
"id": "context_edge_ollama_runbook_migration_brief",
"title": "Context Edge Ollama Runbook Migration Brief",
"category": "documentation",
"format_rule": "numbered_plan_4",
"num_predict": 480,
"required_markers": [
"OLLAMA_HOST=0.0.0.0:11434",
"OLLAMA_MAX_LOADED_MODELS=3",
"curl http://SERVER_IP:11434/api/tags",
"hermes gateway restart"
],
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your migration brief. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
"followup_required_markers": [
"- Early anchor:",
"- Middle anchor:",
"- Late anchor:",
"CFG-21",
"NET-08",
"BOT-14"
],
"followup_format_rule": "three_bullets",
"prompt": "Return exactly four numbered lines and nothing else. Each line must be one migration step for an operator moving from local-only Ollama to a remote Ollama plus Discord/Hermes setup.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}\\n\\nRequirements: Step 1 must be a precheck, step 2 must be the server cutover, step 3 must be verification from the developer machine, and step 4 must be the bot/client reconnection step. Mention OLLAMA_HOST=0.0.0.0:11434, OLLAMA_MAX_LOADED_MODELS=3, curl http://SERVER_IP:11434/api/tags, and hermes gateway restart.",
"context_stress": {
"bands": [
0.9
],
"reserved_output_tokens": 700,
"minimum_context_tokens": 2048,
"record_prefix": "Runbook packet item",
"context_intro": "Migration packet for exposing a previously local-only Ollama server to remote clients while keeping the setup supportable for Aider and Hermes.",
"anchors": {
"early": [
"CFG-21: systemd override must set OLLAMA_HOST=0.0.0.0:11434, OLLAMA_NUM_PARALLEL=1 or another deliberate value, OLLAMA_MAX_LOADED_MODELS=3, and OLLAMA_KEEP_ALIVE=24h.",
"CFG-24: after editing override.conf, operators must run systemctl daemon-reload and restart ollama before testing anything else."
],
"middle": [
"NET-08: open port 11434 only from the developer machine IP when possible. A wide-open firewall rule is simpler but explicitly less safe.",
"NET-11: curl http://localhost:11434/api/tags on the server is not enough; the runbook must also include curl http://SERVER_IP:11434/api/tags from the developer machine."
],
"late": [
"BOT-14: Hermes should not be restarted until the remote tags endpoint works. Otherwise Discord symptoms look like bot errors when the real issue is Ollama reachability.",
"BOT-19: after the endpoint is healthy, hermes gateway restart is the final reconnect step so Discord and custom endpoint settings are refreshed."
]
},
"records": [
"Server baseline: Ollama is installed and running, but historically bound only to localhost. The operator wants to serve remote Aider and Hermes without turning the box into an open relay.",
"Model baseline: the desired operating set is a small router, a medium orchestrator, and one heavier coding worker. OLLAMA_MAX_LOADED_MODELS=3 exists to keep the three hottest models around without pretending all can stay resident.",
"Firewall note: UFW may be inactive on a fresh VPS, in which case adding the rule alone changes nothing until UFW is enabled or provider-side firewall rules are also correct.",
"Developer-machine note: direct curl smoke tests are faster and less ambiguous than jumping straight into Hermes, because they isolate network reachability from agent wrapper behavior.",
"Aider note: ~/.aider.conf.yml should point at http://SERVER_IP:11434/v1 with API key set to ollama. That config proves the remote OpenAI-compatible surface is working before complex agents are blamed.",
"Hermes note: custom endpoint setup requires the same base URL and a model string. Discord is only useful after the base endpoint already responds from the laptop.",
"Rollback note: if remote access fails, revert the systemd override and firewall rule before touching client configs. Otherwise client debugging starts from a broken server assumption.",
"Verification note: the strongest smoke test order is server-local tags, laptop-visible tags, one short chat completion, then Aider or Hermes reconnect."
]
}
},
{
"id": "context_edge_python_context_budget_module",
"title": "Context Edge Python Context Budget Module",
"category": "coding",
"format_rule": "python_module",
"num_predict": 900,
"required_markers": [
"def utc_now",
"def estimate_token_count",
"def target_prompt_tokens",
"def assemble_context_packet",
"def prompt_sha256",
"hashlib",
"typing"
],
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your code answer. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
"followup_required_markers": [
"- Early anchor:",
"- Middle anchor:",
"- Late anchor:",
"REQ-CTX-01",
"FAIL-CTX-07",
"OPS-CTX-12"
],
"followup_format_rule": "three_bullets",
"prompt": "Return only Python code. Write one self-contained module named context_budget.py. The module must expose utc_now(), estimate_token_count(text), target_prompt_tokens(max_context_tokens, band_fraction, reserved_output_tokens), assemble_context_packet(intro, early, middle, late, records, target_tokens, record_prefix='Packet item'), and prompt_sha256(text). Use only the standard library, include type hints, keep the behavior deterministic, and do not emit markdown fences.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}",
"context_stress": {
"bands": [
0.9
],
"reserved_output_tokens": 1400,
"minimum_context_tokens": 2048,
"record_prefix": "Design packet item",
"context_intro": "Design packet for a reusable context-budget helper module intended for benchmark runners and agent wrappers that need deterministic long-prompt assembly plus debuggable metadata.",
"anchors": {
"early": [
"REQ-CTX-01: the module must expose target_prompt_tokens(max_context_tokens, band_fraction, reserved_output_tokens) so context bands are reproducible instead of hand-tuned.",
"REQ-CTX-03: estimate_token_count can be approximate but must be deterministic, cheap, and based only on the input text."
],
"middle": [
"FAIL-CTX-07: a previous Hermes replay consumed a huge prompt, returned finish_reason stop, and produced empty content. Debugging required a prompt hash plus preview and tail slices.",
"FAIL-CTX-09: repeated records are acceptable when stretching a packet, but their ordering must be deterministic or telemetry comparisons become meaningless."
],
"late": [
"OPS-CTX-12: helper output and timestamps must stay human-facing and UTC-friendly because operators debug these suites from terminal logs, not notebooks.",
"OPS-CTX-14: no third-party tokenizer dependency is allowed on the server path because benchmark scripts must run on a clean VPS without pip installs."
]
},
"records": [
"Implementation note: assemble_context_packet should accept intro plus early, middle, late anchor lists and a pool of repeating records. The output should grow until it roughly hits a target token budget.",
"Debug note: prompt_sha256 exists because storing every rendered prompt verbatim can waste disk. A hash plus preview and tail slices gives traceability without keeping giant files by default.",
"Operator note: utc_now should be a tiny helper returning one stable UTC format so benchmark logs across scripts line up naturally.",
"Reliability note: target_prompt_tokens should guard against impossible inputs such as negative reserved output tokens or a band fraction outside the open interval from 0 to 1.",
"Performance note: estimate_token_count should be good enough for shaping packets but not so clever that it becomes the slowest part of the run.",
"Code style note: type hints matter because downstream scripts may import this helper. A small dataclass is fine, but the interface should remain simple and standard-library only.",
"Telemetry note: deterministic packet assembly makes it possible to compare models honestly because the prompt content is the same for every model once the cap and band are fixed.",
"Failure note: previous runs showed that long prompts can fail in clean-looking ways, including empty assistant text. The module therefore needs affordances for reproducible reconstruction."
]
}
},
{
"id": "context_edge_pytest_scheduler_retry_regression",
"title": "Context Edge Pytest Scheduler Retry Regression",
"category": "tests",
"format_rule": "pytest_code",
"num_predict": 1000,
"required_markers": [
"def test_",
"monkeypatch",
"retry_id",
"ACK_AFTER_WRITE",
"deadline_seconds",
"assert"
],
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your test answer. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
"followup_required_markers": [
"- Early anchor:",
"- Middle anchor:",
"- Late anchor:",
"BUG-ACK-01",
"TRACE-ACK-09",
"VERIFY-ACK-22"
],
"followup_format_rule": "three_bullets",
"prompt": "Return only Python code. Write one focused pytest module for the duplicate-ack scheduler regression described in the packet. Requirements: include one deterministic test with monkeypatch or fakes, model the retry_id race, assert that only one logical job commit wins, and make the failure impossible to miss in CI. Use only standard pytest patterns and do not wrap the answer in markdown fences.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}",
"context_stress": {
"bands": [
0.9
],
"reserved_output_tokens": 1500,
"minimum_context_tokens": 2048,
"record_prefix": "Regression packet item",
"context_intro": "Regression packet for a Python scheduler service where retry timing and ack ordering can duplicate downstream side effects. The target is one surgical pytest module, not a whole test suite.",
"anchors": {
"early": [
"BUG-ACK-01: duplicate downstream outputs started after ACK_AFTER_WRITE=true shipped in rc3. The regression test must exercise that ordering change directly.",
"BUG-ACK-03: deadline_seconds tightened from 90 to 45 in the same release, making the retry pickup easier to trigger."
],
"middle": [
"TRACE-ACK-09: the retry worker increments retry_id before the original worker calls ack(), so the test needs two ownership paths and one delayed ack.",
"TRACE-ACK-14: the original write succeeds before the retry starts fan-out, which is why the bug is duplicate side effects rather than missing persistence."
],
"late": [
"VERIFY-ACK-22: the regression test must prove that only one logical job commit and one notify path are treated as authoritative after the fix.",
"VERIFY-ACK-24: a fake clock or explicit delay hook is required so the race is deterministic instead of relying on sleeping threads."
]
},
"records": [
"Service behavior: one worker writes job completion and a downstream notify record, then acknowledges the queue lease. Retry logic watches deadline expiry and can spawn a second worker for the same logical payload.",
"Historical assumption: ack happened before write, so retry pickup rarely overlapped a durable write. After rc3 that assumption no longer holds.",
"Testing note: a good regression harness can stub the notifier and collect emitted payload_ids. Duplicate notification is easier to assert than raw queue internals.",
"Operator note: Redis restarts and clock warnings were noisy but non-causal. The test should focus on ordering and ownership, not infrastructure flakiness.",
"Implementation note: a fake clock or injectable now() hook is preferred over thread sleeps because CI latency is too variable for a race test.",
"Acceptance note: if the fix works, either the retry worker or the original worker should stand down cleanly, but never both proceed to external notify.",
"CI note: the test should fail loudly with a short diff if duplicate notification happens. Silent counting helpers are harder to trust in review.",
"Code review note: a focused module with one strong regression test is worth more than many weak permutations for this specific benchmark."
]
}
},
{
"id": "context_edge_change_review_packet",
"title": "Context Edge Change Review Packet",
"category": "review",
"format_rule": "json_dict",
"num_predict": 700,
"required_markers": [
"gateway/run.py",
"run_python_task_suite.py",
"status.json",
"20260408_add_job_owner.sql",
"discord",
"migration"
],
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your review answer. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
"followup_required_markers": [
"- Early anchor:",
"- Middle anchor:",
"- Late anchor:",
"REV-03",
"REV-17",
"REV-29"
],
"followup_format_rule": "three_bullets",
"prompt": "Return only JSON. You are reviewing a large mixed diff packet that spans Python services, telemetry tooling, Discord gateway behavior, documentation, and one database migration.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}\\n\\nTask: produce a review packet with output keys exactly: likely_regressions, riskiest_files, missing_tests, rollout_risk, safe_merge_condition. Constraints: mention gateway/run.py, run_python_task_suite.py, status.json, 20260408_add_job_owner.sql, and Discord or gateway behavior where relevant.",
"context_stress": {
"bands": [
0.9
],
"reserved_output_tokens": 1100,
"minimum_context_tokens": 2048,
"record_prefix": "Review packet item",
"context_intro": "Mixed diff packet assembled from review summaries, file notes, test output snippets, and rollout comments. The challenge is to surface concrete regressions instead of repeating generic code-review advice.",
"anchors": {
"early": [
"REV-03: migration 20260408_add_job_owner.sql adds a non-null job_owner column without a documented backfill for existing rows. That can fail immediately on populated databases.",
"REV-06: scheduler read paths were updated in code, but one admin query still selects the old nullable shape in a report view."
],
"middle": [
"REV-17: gateway/run.py changed reconnect behavior for stale provider responses, but no test proves how Discord handles empty assistant content or one clean timeout followed by a retry.",
"REV-21: run_python_task_suite.py now records prompt hashes and context metadata, yet no report-level check verifies the new keys are preserved."
],
"late": [
"REV-29: the report builder renamed one status field, but dashboards and status.json examples in docs still use the old key. That will silently break HTML rendering if merged together.",
"REV-34: rollout notes assume the migration and report schema can deploy independently, but the dashboard pull path still reads both in the same morning workflow."
]
},
"records": [
"Diff summary: gateway/run.py now waits longer before declaring the provider stale, and the Discord adapter emits one new reconnect warning line. No new fixture captures an empty successful response body.",
"Telemetry summary: run_python_task_suite.py expanded to support context-stress prompts, custom follow-up prompts, and prompt metadata files. The status markdown and report builder were only partially updated.",
"Migration summary: 20260408_add_job_owner.sql introduces explicit ownership on scheduled jobs so the UI can attribute work cleanly. The migration note mentions new writes, not legacy rows.",
"Dashboard summary: one HTML manual still expects the previous status key name from status.json and has no assertion on unknown-field fallback.",
"Docs summary: operator docs were refreshed for the new Discord reconnect wording, but one screenshots guide still references the prior warning text verbatim.",
"Review note: the riskiest interactions are schema plus runtime reads, and telemetry JSON plus dashboard consumption. Pure doc edits are comparatively safe.",
"Testing note: there are unit tests around scheduler ownership writes and a separate smoke test for Discord login, but nothing that exercises both the new reconnect path and empty assistant content.",
"Rollout note: support wants the dashboard alive on the same morning the migration lands. That makes silent telemetry-key drift more expensive than a normal internal-only contract change."
]
}
}
]
}

View File

@@ -0,0 +1,310 @@
{
"suite_name": "overnight-python-telemetry-v2-real-context",
"version": "2.0",
"purpose": "A deterministic overnight suite for evaluating big and small Ollama models on vps50 with harder multi-file prompts shaped after Slobodan's real implementation, review, debugging, and orchestration asks.",
"models": [
{
"model": "qwen32-coder-32k",
"display_name": "Qwen32 Coder 32k",
"size_label": "32b"
},
{
"model": "qwen14-coder-32k",
"display_name": "Qwen14 Coder 32k",
"size_label": "14b"
},
{
"model": "codestral-32k",
"display_name": "Codestral 32k",
"size_label": "22b"
},
{
"model": "codellama34-16k",
"display_name": "CodeLlama 34 16k",
"size_label": "34b"
},
{
"model": "phind34-16k",
"display_name": "Phind 34 16k",
"size_label": "34b"
},
{
"model": "qwen14-general-32k",
"display_name": "Qwen14 General 32k",
"size_label": "14b"
},
{
"model": "qwen2.5-coder:3b",
"display_name": "Qwen2.5 Coder 3B",
"size_label": "3b"
},
{
"model": "qwen2.5-coder:1.5b",
"display_name": "Qwen2.5 Coder 1.5B",
"size_label": "1.5b"
},
{
"model": "qwen2.5:3b",
"display_name": "Qwen2.5 3B",
"size_label": "3b"
},
{
"model": "llama3.2:3b",
"display_name": "Llama 3.2 3B",
"size_label": "3b"
},
{
"model": "phi3",
"display_name": "Phi-3 Mini",
"size_label": "3.8b"
}
],
"questions": [
{
"id": "myboard_auth_redirect_triage",
"title": "MyBoard Auth Redirect Triage",
"category": "debugging",
"format_rule": "json_dict",
"num_predict": 700,
"context_files": [
"MyBoard/app/api.py",
"MyBoard/web/nuxt-app/app/composables/useSession.ts",
"MyBoard/web/nuxt-app/app/middleware/auth.global.ts",
"MyBoard/web/nuxt-app/app/pages/login.vue",
"MyBoard/tests/browser/flow-coverage-manifest.json",
"MyBoard/docs/user flows/access-onboarding-and-account-flows.md"
],
"prompt": "Return only JSON. You are debugging a real MyBoard issue where password login can succeed but the user still lands on the wrong route or loses tenant context.\nTask: produce a repo-grounded triage packet.\nContext files:\n1. MyBoard/app/api.py: exposes /auth/login and /auth/me.\n2. MyBoard/web/nuxt-app/app/composables/useSession.ts: loginWithPassword(), setSession(), fetchContext(), applyTenant(), applyProject().\n3. MyBoard/web/nuxt-app/app/middleware/auth.global.ts: redirects to /login, /tenant-missing, /403 and reads myboard:post-login-redirect.\n4. MyBoard/web/nuxt-app/app/pages/login.vue: saveRedirectTarget(), redirectAfterLogin(), resolveOidcErrorMessage().\n5. MyBoard/tests/browser/flow-coverage-manifest.json: login and tenant-missing flows are acceptance coverage.\n6. MyBoard/docs/user flows/access-onboarding-and-account-flows.md: expected post-login workspace behavior.\nOutput keys exactly: issue_summary, likely_root_causes, backend_touchpoints, frontend_touchpoints, tests_to_run, safe_fix_plan.\nConstraints: mention /auth/login, /auth/me, myboard:post-login-redirect, tenant-missing, X-Tenant, and do not invent files outside the context list.",
"required_markers": [
"/auth/login",
"/auth/me",
"useSession.ts",
"auth.global.ts",
"login.vue",
"tenant-missing",
"myboard:post-login-redirect",
"X-Tenant"
]
},
{
"id": "myboard_board_snapshot_regression_test",
"title": "Board Snapshot Regression Test",
"category": "tests",
"format_rule": "pytest_code",
"num_predict": 900,
"context_files": [
"MyBoard/app/api.py",
"MyBoard/app/models.py",
"MyBoard/tests/api/test_board_snapshot.py",
"MyBoard/tests/api/test_task_bulk_jobs.py",
"MyBoard/web/nuxt-app/app/composables/queries/boards.ts",
"MyBoard/web/nuxt-app/app/pages/board/[[projectSlug]].vue"
],
"prompt": "Return only Python code. Write one focused pytest module for a real MyBoard regression around /board/snapshot after bulk task assignment and lane movement.\nContext files:\n1. MyBoard/app/api.py: exposes /board/snapshot, /tasks/{task_id}, and bulk job endpoints.\n2. MyBoard/app/models.py: workflow statuses and lane ordering are defined here.\n3. MyBoard/tests/api/test_board_snapshot.py: current lane-count and project-scope coverage.\n4. MyBoard/tests/api/test_task_bulk_jobs.py: helper patterns for creating stories/tasks and checking snapshot sync after assign/move.\n5. MyBoard/web/nuxt-app/app/composables/queries/boards.ts: frontend expects board data to stay lane-consistent.\n6. MyBoard/web/nuxt-app/app/pages/board/[[projectSlug]].vue: board page consumes the snapshot.\nRequirements: include async auth helper, create at least one story and one task, move the task to session, verify /board/snapshot returns the task in the session lane, and assert assignee_ids survive the move. Do not invent endpoints outside the context list.",
"required_markers": [
"def test_",
"/board/snapshot",
"/tasks/",
"session",
"assignee_ids",
"story_id",
"api_client"
]
},
{
"id": "myboard_lane_config_patch_plan",
"title": "Lane Config Patch Plan",
"category": "planning",
"format_rule": "json_dict",
"num_predict": 700,
"context_files": [
"MyBoard/app/models.py",
"MyBoard/app/api.py",
"MyBoard/tests/api/test_lane_config.py",
"MyBoard/web/nuxt-app/app/composables/queries/lane-config.ts",
"MyBoard/web/nuxt-app/app/lib/workflow.ts",
"MyBoard/web/nuxt-app/app/pages/board/[[projectSlug]].vue"
],
"prompt": "Return only JSON. A new regression report says project lane overrides can drift from the canonical workflow and confuse the board page.\nTask: prepare a concrete patch plan.\nContext files:\n1. MyBoard/app/models.py: default_lane_sequence() and workflow enums are canonical.\n2. MyBoard/app/api.py: organization and project lane-config endpoints live here.\n3. MyBoard/tests/api/test_lane_config.py: round-trip and inheritance tests already exist.\n4. MyBoard/web/nuxt-app/app/composables/queries/lane-config.ts: frontend query behavior.\n5. MyBoard/web/nuxt-app/app/lib/workflow.ts: frontend lane semantics.\n6. MyBoard/web/nuxt-app/app/pages/board/[[projectSlug]].vue: board rendering depends on effective lanes.\nOutput keys exactly: regression_summary, invariants_to_protect, backend_changes, frontend_changes, tests_to_add, rollout_checks.\nConstraints: mention default_lane_sequence, use_organization_default, effective_lanes, /organizations/{organization_id}/lane-config, /projects/{project_id}/lane-config, and do not invent new persistence layers.",
"required_markers": [
"default_lane_sequence",
"use_organization_default",
"effective_lanes",
"/organizations/{organization_id}/lane-config",
"/projects/{project_id}/lane-config",
"test_lane_config.py"
]
},
{
"id": "myboard_api_token_audit_regression_test",
"title": "API Token Audit Regression Test",
"category": "tests",
"format_rule": "pytest_code",
"num_predict": 900,
"context_files": [
"MyBoard/app/api.py",
"MyBoard/app/models.py",
"MyBoard/tests/api/test_api_tokens.py",
"MyBoard/web/nuxt-app/app/composables/queries/api-tokens.ts",
"MyBoard/web/nuxt-app/app/pages/settings/api-tokens.vue",
"MyBoard/contracts/myboard-api.openapi.json"
],
"prompt": "Return only Python code. Write one pytest module that hardens the API token lifecycle against an audit-ordering regression.\nContext files:\n1. MyBoard/app/api.py: /api-tokens endpoints, regenerate, revoke, and audits.\n2. MyBoard/app/models.py: APIToken, APITokenAudit, APITokenAction.\n3. MyBoard/tests/api/test_api_tokens.py: existing lifecycle coverage and auth header pattern.\n4. MyBoard/web/nuxt-app/app/composables/queries/api-tokens.ts: frontend sorts audits descending by created timestamp.\n5. MyBoard/web/nuxt-app/app/pages/settings/api-tokens.vue: UI expects regenerated and revoked tokens to refresh correctly.\n6. MyBoard/contracts/myboard-api.openapi.json: contract surface must stay aligned.\nRequirements: include create, machine-use, regenerate, revoke, audit fetch, and assertions that CREATED, REGENERATED, and REVOKED are all present in audit history and the revoked token is inactive. Do not use code fences.",
"required_markers": [
"def test_",
"/api-tokens",
"/auth/me",
"APITokenAudit",
"REGENERATED",
"REVOKED",
"CREATED"
]
},
{
"id": "myboard_announcements_state_sync_review",
"title": "Announcements State Sync Review",
"category": "review",
"format_rule": "json_dict",
"num_predict": 700,
"context_files": [
"MyBoard/app/api.py",
"MyBoard/app/models.py",
"MyBoard/tests/api/test_announcements.py",
"MyBoard/web/nuxt-app/app/composables/queries/announcements.ts",
"MyBoard/web/nuxt-app/app/pages/announcements.vue",
"MyBoard/docs/app-feature-inventory.md"
],
"prompt": "Return only JSON. Review a suspected frontend/backend sync bug where announcement read and dismiss state can diverge after mark-all-read and list refresh.\nContext files:\n1. MyBoard/app/api.py: CRUD, read, unread, dismiss, undismiss, and mark-all-read endpoints.\n2. MyBoard/app/models.py: announcement read/dismiss persistence objects live here.\n3. MyBoard/tests/api/test_announcements.py: current API coverage.\n4. MyBoard/web/nuxt-app/app/composables/queries/announcements.ts: mergeAnnouncement(), mark-all-read cache behavior, include_dismissed handling.\n5. MyBoard/web/nuxt-app/app/pages/announcements.vue: UI depends on query cache correctness.\n6. MyBoard/docs/app-feature-inventory.md: announcements are a user-visible feature surface.\nOutput keys exactly: failure_modes, most_suspicious_cache_paths, backend_contract_checks, frontend_fix_options, regression_tests, rollout_risk.\nConstraints: mention mergeAnnouncement, include_dismissed, /announcements/mark-all-read, /announcements/{announcement_id}/dismiss, /announcements/{announcement_id}/read, and dismissed.",
"required_markers": [
"mergeAnnouncement",
"include_dismissed",
"/announcements/mark-all-read",
"/announcements/{announcement_id}/dismiss",
"/announcements/{announcement_id}/read",
"dismissed"
]
},
{
"id": "myboard_feature_flag_lifecycle_test",
"title": "Feature Flag Lifecycle Test",
"category": "tests",
"format_rule": "pytest_code",
"num_predict": 900,
"context_files": [
"MyBoard/app/api.py",
"MyBoard/app/models.py",
"MyBoard/tests/api/test_feature_flags.py",
"MyBoard/web/nuxt-app/app/composables/queries/feature-flags.ts",
"MyBoard/web/nuxt-app/app/pages/admin/index.vue",
"MyBoard/contracts/myboard-api.openapi.json"
],
"prompt": "Return only Python code. Write one pytest module for a real MyBoard feature-flag regression around environment toggles and history.\nContext files:\n1. MyBoard/app/api.py: feature-flag and feature-flag-environment endpoints.\n2. MyBoard/app/models.py: FeatureFlag, FeatureFlagEnvironment, FeatureFlagHistory, FeatureFlagState.\n3. MyBoard/tests/api/test_feature_flags.py: current lifecycle coverage.\n4. MyBoard/web/nuxt-app/app/composables/queries/feature-flags.ts: frontend expects detail and history cache to stay aligned.\n5. MyBoard/web/nuxt-app/app/pages/admin/index.vue: admin console consumes this data.\n6. MyBoard/contracts/myboard-api.openapi.json: response shapes must remain stable.\nRequirements: create two environments, create one flag, toggle dev to enabled with rollout percentage, fetch history, verify latest history action is toggle, verify non-admin toggle is rejected with 403, and verify delete cleanup. Do not invent helper libraries.",
"required_markers": [
"def test_",
"/feature-flags",
"/feature-flag-environments",
"/history",
"rollout_percentage",
"403",
"toggle"
]
},
{
"id": "myboard_task_bulk_job_debug_packet",
"title": "Task Bulk Job Debug Packet",
"category": "debugging",
"format_rule": "json_dict",
"num_predict": 750,
"context_files": [
"MyBoard/app/api.py",
"MyBoard/app/models.py",
"MyBoard/tests/api/test_task_bulk_jobs.py",
"MyBoard/web/nuxt-app/app/composables/queries/task-bulk.ts",
"MyBoard/web/nuxt-app/app/composables/queries/boards.ts",
"MyBoard/web/nuxt-app/app/components/ui/productivity/BulkActionToolbar.vue"
],
"prompt": "Return only JSON. A production-like report says bulk assign jobs complete, but some task detail panels and board lanes stay stale until a hard refresh.\nTask: produce a debug packet grounded in the repo.\nContext files:\n1. MyBoard/app/api.py: /tasks/bulk/jobs, /tasks/bulk/preview, /tasks/{task_id}, /board/snapshot.\n2. MyBoard/app/models.py: TaskBulkJob, TaskBulkJobEntry, task assignment fields.\n3. MyBoard/tests/api/test_task_bulk_jobs.py: happy-path completion and board sync tests.\n4. MyBoard/web/nuxt-app/app/composables/queries/task-bulk.ts: invalidateBulkAffectedTaskCaches() and polling behavior.\n5. MyBoard/web/nuxt-app/app/composables/queries/boards.ts: board query cache consumers.\n6. MyBoard/web/nuxt-app/app/components/ui/productivity/BulkActionToolbar.vue: user trigger surface.\nOutput keys exactly: suspected_root_causes, cache_invalidation_gaps, backend_checks, frontend_checks, additional_tests, smallest_safe_fix.\nConstraints: mention invalidateBulkAffectedTaskCaches, queryKeys.bulkJobs.detail, queryKeys.boards.lanes, /tasks/bulk/jobs/{job_id}, /board/snapshot, and assignee_ids.",
"required_markers": [
"invalidateBulkAffectedTaskCaches",
"queryKeys.bulkJobs.detail",
"queryKeys.boards.lanes",
"/tasks/bulk/jobs/{job_id}",
"/board/snapshot",
"assignee_ids"
]
},
{
"id": "myboard_user_preferences_contract_test",
"title": "User Preferences Contract Test",
"category": "tests",
"format_rule": "pytest_code",
"num_predict": 950,
"context_files": [
"MyBoard/app/api.py",
"MyBoard/app/models.py",
"MyBoard/app/services/preferences.py",
"MyBoard/tests/api/test_user_preferences.py",
"MyBoard/web/nuxt-app/app/stores/preferences.ts",
"MyBoard/web/nuxt-app/app/plugins/00-preferences-bootstrap.ts"
],
"prompt": "Return only Python code. Write one pytest module that strengthens the user-preferences contract around nested payload updates and theme preview.\nContext files:\n1. MyBoard/app/api.py: /user/preferences and /user/preferences/theme-preview endpoints.\n2. MyBoard/app/models.py: ThemeMode, ThemePreset, BoardViewPreference, UserPreferences.\n3. MyBoard/app/services/preferences.py: normalization and validation live here.\n4. MyBoard/tests/api/test_user_preferences.py: existing nested payload coverage.\n5. MyBoard/web/nuxt-app/app/stores/preferences.ts: frontend consumes the persisted shape.\n6. MyBoard/web/nuxt-app/app/plugins/00-preferences-bootstrap.ts: bootstrap path depends on stable defaults.\nRequirements: include auth helper, one successful nested update assertion, one invalid timezone assertion, one theme-preview non-persistence assertion, and direct checks for locale, theme preset, and board default lane. Do not emit markdown fences.",
"required_markers": [
"def test_",
"/user/preferences",
"/user/preferences/theme-preview",
"ThemePreset",
"locale",
"timezone",
"default_lane"
]
},
{
"id": "myboard_orchestration_timeline_forensics",
"title": "Orchestration Timeline Forensics",
"category": "forensics",
"format_rule": "json_dict",
"num_predict": 800,
"context_files": [
"MyBoard/app/api.py",
"MyBoard/app/models.py",
"MyBoard/tests/api/test_orchestration_events.py",
"MyBoard/docs/user flows/orchestration-and-dependency-api-flows.md",
"MyBoard/web/nuxt-app/app/pages/admin/index.vue",
"MyBoard/web/nuxt-app/app/composables/queries/meta.ts"
],
"prompt": "Return only JSON. You are investigating a real operator complaint: run history exists, but retry chains and handoff evidence are hard to explain from the admin surface.\nTask: produce a forensics packet.\nContext files:\n1. MyBoard/app/api.py: orchestration event, run, dependency, failure, and timeline endpoints.\n2. MyBoard/app/models.py: OrchestrationRun and related enums and evidence structures.\n3. MyBoard/tests/api/test_orchestration_events.py: canonical event ingestion, retry, and handoff timeline expectations.\n4. MyBoard/docs/user flows/orchestration-and-dependency-api-flows.md: user-visible operator flows.\n5. MyBoard/web/nuxt-app/app/pages/admin/index.vue: admin console surface.\n6. MyBoard/web/nuxt-app/app/composables/queries/meta.ts: operator metadata fetch patterns.\nOutput keys exactly: operator_problem_statement, timeline_questions_to_answer, endpoints_to_query, evidence_fields_that_matter, missing_tests, recommended_ui_improvements.\nConstraints: mention /orchestration/events, /orchestration/runs, /orchestration/dependencies, handoff_requested, run_failed, and retry chain.",
"required_markers": [
"/orchestration/events",
"/orchestration/runs",
"/orchestration/dependencies",
"handoff_requested",
"run_failed",
"retry"
]
},
{
"id": "truthgraph_ingest_log_triage",
"title": "TruthGraph Ingest Log Triage",
"category": "cross_repo_debugging",
"format_rule": "json_dict",
"num_predict": 800,
"context_files": [
"Earth/PHASE2_PROMPT_COMPLEXITY_METRIC_V1.md",
"TruthGraph/docs/TRUTHGRAPH_DOC_INGESTION_CONTRACT.md",
"TruthGraph/contracts/doc_ingest_manifest.schema.json",
"TruthGraph/internal/query/resolve_context.go",
"TruthGraph/internal/truthgraph/ingest/preflight/preflight.go",
"TruthGraph/cmd/truthgraph/status.go"
],
"prompt": "Return only JSON. This task mirrors Slobodan's real cross-repo debugging asks. Given a TruthGraph ingest run that discovers repositories but later produces stale or incomplete query answers, produce a triage packet.\nContext files:\n1. Earth/PHASE2_PROMPT_COMPLEXITY_METRIC_V1.md: prompts above threshold should be decomposed.\n2. TruthGraph/docs/TRUTHGRAPH_DOC_INGESTION_CONTRACT.md: intended doc-ingest behavior.\n3. TruthGraph/contracts/doc_ingest_manifest.schema.json: manifest contract surface.\n4. TruthGraph/internal/query/resolve_context.go: context resolution path.\n5. TruthGraph/internal/truthgraph/ingest/preflight/preflight.go: ingest preflight checks.\n6. TruthGraph/cmd/truthgraph/status.go: operator-visible status reporting.\nOutput keys exactly: observed_symptoms, likely_failure_surfaces, preflight_checks, status_gaps, code_paths_to_review, follow_up_commands.\nConstraints: mention doc_ingest_manifest.schema.json, resolve_context, preflight, status, stale index, and prompt complexity. Do not invent files outside the context list.",
"required_markers": [
"doc_ingest_manifest.schema.json",
"resolve_context",
"preflight",
"status",
"stale",
"prompt complexity"
]
}
]
}

View File

@@ -0,0 +1,74 @@
{
"suite_name": "small-model-coding-eval-v1",
"version": "1.0",
"purpose": "A deterministic five-question coding and DevOps comparison suite for smaller Ollama models on vps50.",
"models": [
{
"model": "qwen2.5-coder:3b",
"display_name": "Qwen2.5 Coder 3B",
"size_label": "3b"
},
{
"model": "qwen2.5-coder:1.5b",
"display_name": "Qwen2.5 Coder 1.5B",
"size_label": "1.5b"
},
{
"model": "qwen2.5:3b",
"display_name": "Qwen2.5 3B",
"size_label": "3b"
},
{
"model": "llama3.2:3b",
"display_name": "Llama 3.2 3B",
"size_label": "3b"
},
{
"model": "phi3",
"display_name": "Phi-3 Mini",
"size_label": "3.8b"
}
],
"questions": [
{
"id": "disk_guard_bash",
"title": "Disk Guard Script",
"category": "shell",
"prompt": "Return only Bash code. Write a script that checks disk usage for /, prints a human-readable warning, and exits with status 1 when usage is above 85 percent. Requirements: include a shebang, use df -P /, parse the numeric percentage, and keep the script production-safe.",
"required_markers": ["#!/usr/bin/env bash", "df -P /", "85", "exit 1"],
"format_rule": "bash_code"
},
{
"id": "ipv4_python_tests",
"title": "IPv4 Validator",
"category": "python",
"prompt": "Return only Python code. Write a function named is_valid_ipv4(value: str) -> bool and include exactly three pytest tests that cover a valid address, an out-of-range octet, and a non-numeric input.",
"required_markers": ["def is_valid_ipv4", "def test_", "assert", "split('.')"],
"format_rule": "python_code"
},
{
"id": "nginx_safe_reload",
"title": "Nginx Safe Reload",
"category": "ops",
"prompt": "Return only Bash commands, one per line. Back up /etc/nginx/nginx.conf, validate nginx config, and reload nginx only if validation passes.",
"required_markers": ["cp /etc/nginx/nginx.conf", "nginx -t", "systemctl reload nginx", "&&"],
"format_rule": "shell_lines"
},
{
"id": "yaml_cli_plan",
"title": "YAML Validator Plan",
"category": "planning",
"prompt": "Return exactly four numbered steps. Plan a Python CLI that scans a git repo for changed YAML files, validates them against a JSON schema, and exits nonzero on failure.",
"required_markers": ["1.", "2.", "3.", "4.", "JSON schema", "git"],
"format_rule": "four_numbered_steps"
},
{
"id": "ssh_lockout_triage",
"title": "SSH Lockout Triage",
"category": "debugging",
"prompt": "Return exactly five bullet points. After hardening, SSH started returning Permission denied (publickey,password). List the safest first checks before changing config. Mention sshd_config, authorized_keys, journalctl, rollback, and PasswordAuthentication.",
"required_markers": ["sshd_config", "authorized_keys", "journalctl", "rollback", "PasswordAuthentication"],
"format_rule": "five_bullets"
}
]
}

View File

@@ -0,0 +1,29 @@
{
"schema_version": "hardware-1.0",
"device_tag": "mac-m1-8gb",
"manufacturer_model": "Apple MacBook Air (Mac14,2) — example, not a real submission",
"os": {"name": "macOS", "version": "14.5", "kernel": "23.5.0"},
"cpu": {
"name": "Apple M1",
"cores": 8,
"threads": 8,
"max_ghz": 3.2,
"arch": "arm64",
"isa": ["NEON"]
},
"memory_gb_total": 8,
"memory_gb_available_at_run_start": 4.2,
"gpu": [
{
"name": "Apple M1 GPU",
"kind": "integrated",
"vram_gb": null,
"driver": "Metal/macOS-14",
"compute_cap": null
}
],
"storage": {"kind": "ssd", "free_gb_at_run_start": 220},
"thermal_or_power_notes": "default OS thermal mgmt; on AC power throughout the run; no swap pressure observed",
"network_used_for_model_fetch": "wifi-100mbps (only used for `ollama pull` before benchmark; not on the timing path)",
"container_or_vm": null
}

View File

@@ -0,0 +1,23 @@
{
"schema_version": "manifest-1.0",
"run_id": "00000000-0000-0000-0000-000000000000",
"harness_version": "public-1",
"submitter_handle": "EXAMPLE",
"device_tag": "mac-m1-8gb",
"cell_id_prefix": "mac-m1:ollama",
"target_url": "http://127.0.0.1:11434",
"phases_run": ["hello", "5q", "20q"],
"models_run": ["qwen3.5:0.8b"],
"canonical_options": {
"temperature": 0.1,
"num_ctx": 4096,
"num_predict": 2048
},
"canonical_options_overrides": {},
"timeout_seconds": 360,
"started_at_utc": "2026-05-12T14:32:11Z",
"host_hostname_short": "alices-mbp",
"platform_system": "Darwin",
"platform_release": "23.5.0",
"python_version": "3.12.4"
}

View File

@@ -0,0 +1,58 @@
{
"schema_version": "metadata-1.0",
"run_id": "00000000-0000-0000-0000-000000000000",
"submitter_handle": "EXAMPLE",
"device_tag": "mac-m1-8gb",
"computed_at_utc": "2026-05-12T14:48:30Z",
"computed_by": "agent (Claude Code 4.6) — see run.md §Methodology",
"cells": [
{
"cell_id": "mac-m1:ollama:qwen3.5:0.8b",
"phase": "hello",
"n_calls": 1,
"n_errors": 0,
"duration_ms_p50": 1847,
"duration_ms_p95": 1847,
"duration_ms_mean": 1847,
"tokens_per_sec_p50": 22.74,
"tokens_per_sec_p95": 22.74,
"tokens_per_sec_mean": 22.74,
"tokens_per_sec_max": 22.74,
"completion_tokens_total": 42,
"format_ok_rate": null,
"marker_hit_rate_mean": null
},
{
"cell_id": "mac-m1:ollama:qwen3.5:0.8b",
"phase": "5q",
"n_calls": 5,
"n_errors": 0,
"duration_ms_p50": 4210,
"duration_ms_p95": 7800,
"duration_ms_mean": 4900,
"tokens_per_sec_p50": 22.3,
"tokens_per_sec_p95": 18.7,
"tokens_per_sec_mean": 21.4,
"tokens_per_sec_max": 23.1,
"completion_tokens_total": 487,
"format_ok_rate": 0.8,
"marker_hit_rate_mean": 0.92
},
{
"cell_id": "mac-m1:ollama:qwen3.5:0.8b",
"phase": "20q",
"n_calls": 20,
"n_errors": 0,
"duration_ms_p50": 9612,
"duration_ms_p95": 41200,
"duration_ms_mean": 12180,
"tokens_per_sec_p50": 20.9,
"tokens_per_sec_p95": 7.4,
"tokens_per_sec_mean": 17.0,
"tokens_per_sec_max": 24.8,
"completion_tokens_total": 4280,
"format_ok_rate": 0.7,
"marker_hit_rate_mean": 0.78
}
]
}

View File

@@ -0,0 +1,6 @@
{"type":"meta","benchmark_run_id":"00000000-0000-0000-0000-000000000000","harness_version":"public-1","started_at_utc":"2026-05-12T14:32:11Z","host_hostname_short":"alices-mbp","load_avg_start":[1.2,1.4,1.6],"target_url":"http://127.0.0.1:11434","cell_id_prefix":"mac-m1:ollama","submitter_handle":"EXAMPLE","device_tag":"mac-m1-8gb","execution_shape":"per-model-block","phases_planned":["hello","5q","20q"],"models_planned":["qwen3.5:0.8b"],"canonical_options":{"temperature":0.1,"num_ctx":4096,"num_predict":2048},"canonical_options_effective":{"temperature":0.1,"num_ctx":4096,"num_predict":2048},"timeout_seconds":360,"platform_system":"Darwin","platform_release":"23.5.0","python_version":"3.12.4"}
{"type":"call","ts_utc":"2026-05-12T14:32:13Z","cell_id":"mac-m1:ollama:qwen3.5:0.8b","model":"qwen3.5:0.8b","phase":"hello","question_id":"hello_check","run_idx":0,"duration_seconds":1.847,"prompt_tokens":17,"completion_tokens":42,"tokens_per_second":22.74,"finish_reason":"stop","status_code":200,"response_chars":167,"response_preview":"Hello! Of course, I'd be happy to help. What can I assist you with today? Whether it's a question, a task, or just a chat, I'm here to help.","required_markers":[],"markers_hit":[],"marker_hit_rate":null,"format_rule":"","format_ok":null,"usable_answer":true,"error":null}
{"type":"call","ts_utc":"2026-05-12T14:32:21Z","cell_id":"mac-m1:ollama:qwen3.5:0.8b","model":"qwen3.5:0.8b","phase":"5q","question_id":"Q1","run_idx":0,"duration_seconds":4.21,"prompt_tokens":58,"completion_tokens":94,"tokens_per_second":22.32,"finish_reason":"stop","status_code":200,"response_chars":312,"response_preview":"#!/usr/bin/env bash\nset -euo pipefail\n\nif [[ ! -d \"$1\" ]]; then\n echo \"err: $1 not a directory\" >&2\n exit 1\nfi\n\nfor f in \"$1\"/*.log; do\n [[ -e \"$f\" ]] || continue\n gzip -k \"$f\"\ndone","required_markers":["gzip","#!/usr/bin/env bash"],"markers_hit":["gzip","#!/usr/bin/env bash"],"marker_hit_rate":1.0,"format_rule":"bash_code","format_ok":true,"usable_answer":true,"error":null}
{"type":"call","ts_utc":"2026-05-12T14:33:48Z","cell_id":"mac-m1:ollama:qwen3.5:0.8b","model":"qwen3.5:0.8b","phase":"20q","question_id":"Q01","run_idx":0,"duration_seconds":9.612,"prompt_tokens":142,"completion_tokens":201,"tokens_per_second":20.91,"finish_reason":"stop","status_code":200,"response_chars":784,"response_preview":"def is_valid_ipv4(addr: str) -> bool:\n parts = addr.split('.')\n if len(parts) != 4:\n return False\n for p in parts:\n if not p.isdigit():\n return False\n n = int(p)","required_markers":["is_valid_ipv4","def test_"],"markers_hit":["is_valid_ipv4","def test_"],"marker_hit_rate":1.0,"format_rule":"python_code","format_ok":true,"usable_answer":true,"error":null}
{"type":"call","ts_utc":"2026-05-12T14:36:42Z","cell_id":"mac-m1:ollama:qwen3.5:0.8b","model":"qwen3.5:0.8b","phase":"20q","question_id":"Q14","run_idx":13,"duration_seconds":42.118,"prompt_tokens":189,"completion_tokens":312,"tokens_per_second":7.41,"finish_reason":"stop","status_code":200,"response_chars":1240,"response_preview":"To debug this MyBoard auth issue, the triage should focus on…","required_markers":["/auth/login","/auth/me","myboard:post-login-redirect","tenant-missing"],"markers_hit":["/auth/login","/auth/me","tenant-missing"],"marker_hit_rate":0.75,"format_rule":"json_dict","format_ok":false,"usable_answer":true,"error":null}
{"type":"footer","ts_utc":"2026-05-12T14:48:03Z","finished_at_utc":"2026-05-12T14:48:03Z","load_avg_end":[1.6,1.5,1.6]}

View File

@@ -0,0 +1,92 @@
# EXAMPLE — mac-m1-8gb — qwen3.5:0.8b — 2026-05-12
> **This is a synthetic example so contributors can see the shape of a
> submission end-to-end. The numbers are plausible but not from a real run.
> Don't cite this directory in analysis. Don't copy-paste these numbers.
> Real submissions live alongside this folder under `submissions/<handle>/`.**
**Run ID:** `00000000-0000-0000-0000-000000000000`
**Submitter:** EXAMPLE (synthetic)
**Hardware:** Apple MacBook Air M1, 8 GB unified, macOS 14.5
**Runtime:** Ollama 0.5.13 (default settings; NUM_PARALLEL=1, KEEP_ALIVE=5m)
**Models:** qwen3.5:0.8b
**Phases run:** hello, 5q, 20q
**Phases skipped:** parallel_same, parallel_mixed, edge_append, edge_suite — RAM constraint, parallel suites need ≥2 warm copies of the model and 8 GB unified didn't fit; edge suites time-budget skipped (would have been ~30 min more)
## Headline numbers
| Cell | Phase | n_calls | tok/s mean | tok/s p50 | duration p50 | format_ok rate |
|---|---|---|---|---|---|---|
| mac-m1:ollama:qwen3.5:0.8b | hello | 1 | 22.7 | 22.7 | 1.8 s | n/a |
| mac-m1:ollama:qwen3.5:0.8b | 5q | 5 | 21.4 | 22.3 | 4.2 s | 80% |
| mac-m1:ollama:qwen3.5:0.8b | 20q | 20 | 17.0 | 20.9 | 9.6 s | 70% |
## What I observed (qualitative)
- **Hello-call cold-start was fast** — 1.8 s including initial model load.
Ollama reports the 0.8B GGUF as ~600 MB; on Apple Silicon unified memory
this loads in well under 2 s.
- **5Q tasks were uniformly handled** — all five formats (bash, python,
shell, four-numbered-steps, json) parsed correctly except one
(Q3, "shell_lines" — model started with `1.` numbered list instead of
raw shell command).
- **20Q tasks bifurcated** — the simple ones (Q01-Q08) ran at full
~20 tok/s with high format-correctness; the longer ones (Q09+ with
multi-paragraph context) saw throughput drop to ~12-15 tok/s, with
format_ok dropping to ~60%. p95 duration of 41 s was Q14 (the MyBoard
triage prompt — long context, mixed format).
- **No errors, no timeouts.** Cleanest run was on AC power; the laptop
fan never spun up.
## Methodology
Followed the canonical Pavilion methodology with these deviations:
- **NUM_PARALLEL=1** instead of canonical 3 — laptop, not server; one slot
is enough for sequential per-model-block execution.
- **KEEP_ALIVE=5m** instead of canonical 2400h — laptop, no need to pin.
- **Phases `parallel_same`, `parallel_mixed`, `edge_append`, `edge_suite`
skipped** — see top of file. Run not eligible for `flagship` grade,
intended as `standard`.
## Caveats
- 8 GB unified RAM is below the comfort floor for parallel suites with this
model; results above are NOT a refutation of the canonical parallel
numbers — they're from a different shape of run.
- macOS Spotlight indexing was disabled before the run started. If you
rerun without disabling, expect ~5-10% additional variance from
background I/O.
- `format_ok` rate of 70% on 20Q is consistent with Sloba's flagship 20Q
numbers for qwen3.5:0.8b on Pavilion (~74-78% in the v1 baseline) within
measurement noise.
## Reproducibility
```
ollama pull qwen3.5:0.8b
ollama serve # in a separate terminal
python3 harness/run_benchmark.py \
--target-url http://127.0.0.1:11434 \
--models qwen3.5:0.8b \
--cell-id-prefix mac-m1:ollama \
--phases hello,5q,20q \
--submitter-handle alice \
--device-tag mac-m1-8gb
```
Took ~16 minutes wall-clock on this hardware.
## Privacy attestation
I scanned `run.jsonl` for personal paths, API tokens, SSH keys, and
home-directory leakage:
```
grep -nE "Bearer |sk-|api_key|/Users/|/home/|password|ssh-rsa|ssh-ed25519" \
submissions/EXAMPLE/mac-m1-8gb/run-00000000-0000-0000-0000-000000000000/*
```
No matches outside the SSH-troubleshooting prompt in 5Q (Q3) which is
intentional curriculum. Safe to ship.
— EXAMPLE (synthetic; not a real contributor)

78
submissions/README.md Normal file
View File

@@ -0,0 +1,78 @@
# `submissions/`
Friends' benchmark contributions land here, one directory per submitter,
one subdirectory per device, one sub-subdirectory per run.
## Layout
```
submissions/
├── README.md — this file
├── EXAMPLE/ — template; see below
│ └── mac-m1-8gb/
│ └── run-00000000-...-000000000000/
│ ├── manifest.json
│ ├── hardware.json
│ ├── run.jsonl
│ ├── metadata.json
│ └── run.md
├── alice/ — first real friend's contributions
│ └── mac-m1-8gb/
│ └── run-<uuid>/...
└── bob/ — etc.
└── rtx-4090-pc/
└── run-<uuid>/...
```
## Per-submission contents
Five files inside each `run-<uuid>/`:
- **`manifest.json`** — automatic; `run_benchmark.py` writes it at run start. Contains submitter handle, device tag, target URL, model list, phase plan, canonical-options overrides, host hostname (short), platform, started-at timestamp.
- **`hardware.json`** — agent fills from a hardware probe (see `CLAUDE.md` §2). Schema version `hardware-1.0`.
- **`run.jsonl`** — automatic; the canonical event ledger. Line 1 is `type=meta`; subsequent lines are `type=call` or `type=skipped`; final line is `type=footer`.
- **`metadata.json`** — agent fills with computed aggregates per `(cell_id, phase)` cell. Schema version `metadata-1.0`. The catalogue builder will recompute on Sloba's side; having it in the PR makes review fast.
- **`run.md`** — agent fills using the `CLAUDE.md` §6b template. Honest narrative — methodology deviations, caveats, headline numbers.
## Why per-submitter folders?
- **Attribution** — your handle lives next to your data
- **Reviewability** — a PR adds files only under `submissions/<your-handle>/...`; reviewer can see the whole contribution at a glance
- **No collisions** — two friends submitting from "macbook-pro" don't overwrite each other
- **History stays clean** — re-runs go into new `run-<uuid>/` subdirs, not on top of the old one
## Naming conventions
- **`<submitter-handle>`** — your Gitea username, or any other handle you'd like to be credited as. Lowercase; ASCII letters / digits / hyphens only.
- **`<device-tag>`** — short descriptor of the hardware. Pattern: `<chip-or-platform>-<key-spec>`. Examples:
- `mac-m1-8gb`, `mac-m2-pro-16gb`, `mac-m3-max-64gb`
- `rtx-4090-pc`, `rtx-3060-laptop`, `gtx-1060-6gb`
- `ryzen-7950x-cpu`, `intel-i9-13900k-cpu`
- `pixel-8-pro`, `samsung-s24-ultra` (yes, phones — if you've got termux working)
- `runpod-h100-pcie`, `runpod-rtx-a6000`
- **`run-<uuid>`** — `run-` prefix + a UUID v4 from `run_benchmark.py`. Don't shorten.
## What the EXAMPLE folder is for
A complete-but-tiny submission you can read end-to-end to understand the
shapes. **Don't modify the EXAMPLE folder in a benchmark-submission PR**; if
you spot a bug in the example, that's a separate PR with the title
`fix: submissions/EXAMPLE/...`.
## When a submission is merged
Sloba reviews and merges manually. After merge:
1. The catalogue builder on Sloba's side picks up your run, computes a
`cell_id` from your `device-tag` + model, and assigns it a `site_grade`
(flagship / standard / archive-only based on the criteria in
`methodology.md`).
2. Janie (the benchmarks blogger) may write a `janie_blurb_md` for it.
3. It appears on `benchmarks.weeyuga.com` (when the site is live).
4. Your `device-tag` becomes a permanent comparison axis on the catalogue.
## What if I want to delete a submission later?
Open an issue, we'll honor the request promptly. We'll keep the run
directory but mark it `visibility: redacted` in the catalogue overlay so
the data still validates historical analysis claims while disappearing
from the browse surface.