# `harness/` — runner + suites + prompts Self-contained, dependency-free Python 3 benchmark runner. Drives any OpenAI-compatible `/v1/chat/completions` endpoint with the canonical Weeyuga prompt set; emits a JSONL event ledger. ## Files ``` harness/ ├── README.md — this file ├── run_benchmark.py — the runner (one Python 3 process, stdlib only) ├── prompts.py — 3 frozen reference prompts (P-EASY/P-MEDIUM/P-HARD) ├── requirements.txt — empty by intent (stdlib only); listed for tooling └── suites/ ├── small_model_eval_questions.json — 5Q (5 short tasks, format-checked) ├── python_task_suite_questions.json — 20Q (20 realistic Python prompts) ├── parallel_qwen_same_model_20q_suite.json — same-model parallel-lane stress ├── parallel_qwen_mixed_model_20q_suite.json — mixed-model parallel-lane stress ├── python_context_edge_append_questions.json — long-context append behavior └── python_context_edge_suite_only.json — long-context whole-suite reasoning ``` ## Quick reference ```bash # Smoke (one hello call, end-to-end runtime check) python3 harness/run_benchmark.py --smoke \ --target-url http://127.0.0.1:11434 \ --models qwen3.5:0.8b \ --cell-id-prefix mac:ollama \ --submitter-handle alice --device-tag mac-m1-8gb # Default phases (hello + 5q + 20q) python3 harness/run_benchmark.py \ --target-url http://127.0.0.1:11434 \ --models qwen3.5:0.8b \ --cell-id-prefix mac:ollama \ --submitter-handle alice --device-tag mac-m1-8gb # Full sweep (all six suites) python3 harness/run_benchmark.py --phases all \ --target-url http://127.0.0.1:11434 \ --models qwen3.5:0.8b \ --cell-id-prefix mac:ollama \ --submitter-handle alice --device-tag mac-m1-8gb # Probe (list models + one hello, no ledger written) python3 harness/run_benchmark.py --probe \ --target-url http://127.0.0.1:11434 \ --cell-id-prefix mac:ollama ``` ## Output layout The runner writes to `submissions///run-/`: ``` submissions/alice/mac-m1-8gb/run-/ ├── run.jsonl — event ledger; one JSON object per line ├── manifest.json — automatic; written at run start ├── hardware.json — agent fills from hardware probe (see CLAUDE.md §2) ├── metadata.json — agent fills from aggregates (see CLAUDE.md §6) └── run.md — agent fills from template (see CLAUDE.md §6) ``` ## Knobs CLI flags (see `--help`): - `--target-url` — OpenAI-compat base URL (default `http://127.0.0.1:11434`) - `--models` — comma-separated, or `auto` for `/v1/models` discovery - `--cell-id-prefix` — `:` for the JSONL `cell_id` field - `--phases` — subset of `hello, frozen, 5q, 20q, parallel_same, parallel_mixed, edge_append, edge_suite`, or `all` - `--timeout` — per-call wall-clock cap (default 360 s) - `--temperature` / `--num-ctx` / `--num-predict` — override canonical knobs - `--probe` / `--smoke` — health-check shortcuts - `--run-id` / `--out-dir` — resume / custom output Canonical defaults (in code): ```python CANONICAL_OPTIONS = { "temperature": 0.1, "num_ctx": 4096, "num_predict": 2048, } ``` Any deviation is recorded automatically in `manifest.json.canonical_options_overrides`. ## Dependencies **None beyond Python 3 stdlib.** `urllib.request` does the HTTP, `json` does serde, `uuid` makes the run-id. The empty `requirements.txt` exists so tools like `pip-tools` and reproducibility scripts have a hook; if a future version adds dependencies they'll land there with pinned versions. Tested against Python 3.10, 3.11, 3.12. Earlier 3.x may work but isn't tested. ## Suite shapes All six `suites/*.json` follow the same top-level shape: ```json { "suite_name": "...", "version": "1", "purpose": "...", "models": ["..."], // advisory; runner uses --models flag "questions": [ { "id": "Q01", "prompt": "...", "required_markers": ["..."], // optional; lower-cased substring matches "format_rule": "..." // optional; one of: bash_code, python_code, shell_lines, four_numbered_steps, five_bullets, json_dict, pytest_code } ] } ``` `required_markers` and `format_rule` are heuristic — they exist to flag "obviously wrong shape" answers without claiming semantic correctness. Don't treat them as ground truth; treat them as a sanity check. The parallel and edge suites add more top-level fields (`run_mode`, `lanes`, `question_assignment`, etc.) for advisory context; the runner reads only `questions[]` from any suite. ## Adding a new suite For now: don't. The six suites are stable and adding more in this branch breaks comparability with the existing 21 reference runs. If you want a new suite, open an issue on this repo proposing it; we'll discuss whether it warrants a `HARNESS_VERSION = public-2` bump (suites would still need to be backwards-compatible — adding new phase keys is fine, redefining existing ones is not). ## Why no `pip install -e` / no Python package? This is a **scripts directory**, not a library. The runner is one file. The suites are data files. Friends running this from a fresh clone shouldn't have to deal with packaging, virtualenvs (beyond what their agent recommends), or upgrade flows. If/when this grows past one runner, we'll split it. ## License `prompts.py`, `run_benchmark.py`, and any future `harness/*.py` code: MIT (see [LICENSE-MIT](../LICENSE-MIT)). `suites/*.json`: CC-BY-4.0 (see [LICENSE](../LICENSE)) — same as the bench data they test against.