weeyuga-benchmarks-public/harness/README.md

# `harness/` — runner + suites + prompts

Self-contained, dependency-free Python 3 benchmark runner. Drives any
OpenAI-compatible `/v1/chat/completions` endpoint with the canonical
Weeyuga prompt set; emits a JSONL event ledger.

## Files

```
harness/
├── README.md                       — this file
├── run_benchmark.py                — the runner (one Python 3 process, stdlib only)
├── prompts.py                      — 3 frozen reference prompts (P-EASY/P-MEDIUM/P-HARD)
├── requirements.txt                — empty by intent (stdlib only); listed for tooling
└── suites/
    ├── small_model_eval_questions.json          — 5Q (5 short tasks, format-checked)
    ├── python_task_suite_questions.json         — 20Q (20 realistic Python prompts)
    ├── parallel_qwen_same_model_20q_suite.json  — same-model parallel-lane stress
    ├── parallel_qwen_mixed_model_20q_suite.json — mixed-model parallel-lane stress
    ├── python_context_edge_append_questions.json — long-context append behavior
    └── python_context_edge_suite_only.json      — long-context whole-suite reasoning
```

## Quick reference

```bash
# Smoke (one hello call, end-to-end runtime check)
python3 harness/run_benchmark.py --smoke \
    --target-url http://127.0.0.1:11434 \
    --models qwen3.5:0.8b \
    --cell-id-prefix mac:ollama \
    --submitter-handle alice --device-tag mac-m1-8gb

# Default phases (hello + 5q + 20q)
python3 harness/run_benchmark.py \
    --target-url http://127.0.0.1:11434 \
    --models qwen3.5:0.8b \
    --cell-id-prefix mac:ollama \
    --submitter-handle alice --device-tag mac-m1-8gb

# Full sweep (all six suites)
python3 harness/run_benchmark.py --phases all \
    --target-url http://127.0.0.1:11434 \
    --models qwen3.5:0.8b \
    --cell-id-prefix mac:ollama \
    --submitter-handle alice --device-tag mac-m1-8gb

# Probe (list models + one hello, no ledger written)
python3 harness/run_benchmark.py --probe \
    --target-url http://127.0.0.1:11434 \
    --cell-id-prefix mac:ollama
```

## Output layout

The runner writes to `submissions/<submitter-handle>/<device-tag>/run-<uuid>/`:

```
submissions/alice/mac-m1-8gb/run-<uuid>/
├── run.jsonl       — event ledger; one JSON object per line
├── manifest.json   — automatic; written at run start
├── hardware.json   — agent fills from hardware probe (see CLAUDE.md §2)
├── metadata.json   — agent fills from aggregates (see CLAUDE.md §6)
└── run.md          — agent fills from template (see CLAUDE.md §6)
```

## Knobs

CLI flags (see `--help`):
- `--target-url` — OpenAI-compat base URL (default `http://127.0.0.1:11434`)
- `--models` — comma-separated, or `auto` for `/v1/models` discovery
- `--cell-id-prefix` — `<node-tag>:<engine>` for the JSONL `cell_id` field
- `--phases` — subset of `hello, frozen, 5q, 20q, parallel_same, parallel_mixed, edge_append, edge_suite`, or `all`
- `--timeout` — per-call wall-clock cap (default 360 s)
- `--temperature` / `--num-ctx` / `--num-predict` — override canonical knobs
- `--probe` / `--smoke` — health-check shortcuts
- `--run-id` / `--out-dir` — resume / custom output

Canonical defaults (in code):

```python
CANONICAL_OPTIONS = {
    "temperature": 0.1,
    "num_ctx":     4096,
    "num_predict": 2048,
}
```

Any deviation is recorded automatically in `manifest.json.canonical_options_overrides`.

## Dependencies

**None beyond Python 3 stdlib.** `urllib.request` does the HTTP, `json` does
serde, `uuid` makes the run-id. The empty `requirements.txt` exists so tools
like `pip-tools` and reproducibility scripts have a hook; if a future version
adds dependencies they'll land there with pinned versions.

Tested against Python 3.10, 3.11, 3.12. Earlier 3.x may work but isn't tested.

## Suite shapes

All six `suites/*.json` follow the same top-level shape:

```json
{
  "suite_name": "...",
  "version": "1",
  "purpose": "...",
  "models": ["..."],          // advisory; runner uses --models flag
  "questions": [
    {
      "id": "Q01",
      "prompt": "...",
      "required_markers": ["..."],   // optional; lower-cased substring matches
      "format_rule": "..."           // optional; one of: bash_code, python_code, shell_lines, four_numbered_steps, five_bullets, json_dict, pytest_code
    }
  ]
}
```

`required_markers` and `format_rule` are heuristic — they exist to flag
"obviously wrong shape" answers without claiming semantic correctness. Don't
treat them as ground truth; treat them as a sanity check.

The parallel and edge suites add more top-level fields (`run_mode`, `lanes`,
`question_assignment`, etc.) for advisory context; the runner reads only
`questions[]` from any suite.

## Adding a new suite

For now: don't. The six suites are stable and adding more in this branch
breaks comparability with the existing 21 reference runs. If you want a new
suite, open an issue on this repo proposing it; we'll discuss whether it
warrants a `HARNESS_VERSION = public-2` bump (suites would still need to be
backwards-compatible — adding new phase keys is fine, redefining existing
ones is not).

## Why no `pip install -e` / no Python package?

This is a **scripts directory**, not a library. The runner is one file. The
suites are data files. Friends running this from a fresh clone shouldn't have
to deal with packaging, virtualenvs (beyond what their agent recommends), or
upgrade flows. If/when this grows past one runner, we'll split it.

## License

`prompts.py`, `run_benchmark.py`, and any future `harness/*.py` code: MIT
(see [LICENSE-MIT](../LICENSE-MIT)).

`suites/*.json`: CC-BY-4.0 (see [LICENSE](../LICENSE)) — same as the bench
data they test against.