slobodanmargetic988/weeyuga-benchmarks-public

Files

Slobodan Margetic 97a9245d9e feat: harness + agent runbook — flip repo from archive-only to

crowdsourced runner

Sloba's chat directive 2026-05-06: "this project is preparation for
going public ... ship the harness along so others can join in."

The repo's original purpose (Ben's catalogue + 21 reference run
ledgers, shipped 2026-05-05) stays intact. This commit ADDS a second
purpose: a portable harness + agent runbook so a friend's coding agent
can clone, read CLAUDE.md, run the same suite on the friend's hardware,
and submit results back as a PR.

What landed:

CLAUDE.md + AGENTS.md (byte-identical, ~520 lines)
  Full agent runbook: hardware probe, runtime + model selection,
  canonical knob reference (Sloba's Pavilion methodology values),
  hardware-adaptation decision rules, run-instructions, output-schema
  templates for hardware.json + metadata.json + run.md, PR submission
  flow (fork → branch → push → PR; nothing auto-merges), privacy
  guardrails, methodology lineage. Per Sloba's Q3 directive: the
  runbook explicitly tells the friend's agent to ADAPT to hardware
  reality and document deviations rather than blindly run defaults.

CONTRIBUTING.md (~110 lines)
  Human-readable companion for the friend (not the agent). What you
  need, how it works, what we ask, what maintainers commit to,
  license, code-of-conduct short version.

harness/
  ├── README.md        Technical readme for the harness folder
  ├── run_benchmark.py ~520 LOC runner. Stdlib-only. Adapted from
  │                    WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py
  │                    v3 with the cluster-internal IP defaults
  │                    (10.8.0.x) replaced by 127.0.0.1:11434, the
  │                    cluster /v1/cluster/* endpoints removed, the
  │                    canonical-suite paths under ~/Documents/MyServers
  │                    replaced by harness/suites/ paths, the git-sha
  │                    enforcement on WeeyugaWeb dropped, and the
  │                    output written under submissions/<handle>/<tag>/
  │                    instead of docs/BENCHMARKS/runs/. Supports all
  │                    six suite phases via --phases, plus 'all'.
  ├── prompts.py       Verbatim copy of the canonical 3 frozen prompts
  │                    (P-EASY/P-MEDIUM/P-HARD) from
  │                    WeeyugaWeb/scripts/benchmarks/prompts.py.
  ├── requirements.txt Empty by intent (stdlib-only); placeholder for
  │                    pip-tools / agent auto-install patterns.
  ├── .gitignore       __pycache__/ etc.
  └── suites/          Six bundled JSON suites copied verbatim from
       Sloba's MyServers/instances/vps-81-17-99-14/telemetry/:
       small_model_eval_questions.json, python_task_suite_questions.json,
       parallel_qwen_same_model_20q_suite.json,
       parallel_qwen_mixed_model_20q_suite.json,
       python_context_edge_append_questions.json,
       python_context_edge_suite_only.json.

submissions/
  README.md            Folder convention + naming + reviewability rules
  EXAMPLE/mac-m1-8gb/run-00000000-...-000000000000/
       Synthetic-but-shape-complete contribution template:
       manifest.json, hardware.json, run.jsonl (5 example lines),
       metadata.json, run.md (with privacy attestation, methodology
       deviations, reproducibility command). Marked as synthetic at
       the top so future analysis doesn't accidentally cite it.

LICENSE-MIT
  MIT for harness/*.py and future helper code. Existing LICENSE
  (CC-BY-4.0) covers data files.

README.md (modified)
  Updated to reflect dual purpose. Layout diagram updated.
  Maintainer credits: Ben for catalogue/methodology + Bane for harness.
  Contributor quick-start added. Status table extended.

Privacy posture:
  - All 6 suite JSON files privacy-scanned for cluster IPs / hostnames /
    paths / tokens. Two prompts contain project names ("MyBoard" auth
    debugging in 20Q-Q14, generic SSH troubleshooting in 5Q-Q03);
    flagged in chat for Sloba's review. Otherwise clean.
  - run_benchmark.py default target_url is 127.0.0.1:11434 (no internal
    IPs leaked).
  - manifest.json captures host_hostname_short via socket.gethostname()
    .split('.')[0] — agent should review before PR if hostname is
    sensitive.
  - CLAUDE.md §8 spells out the privacy-grep before push.

Verification:
  - py_compile run_benchmark.py: OK
  - --help renders cleanly
  - All 6 suite JSON files: valid
  - All 4 example JSON files: valid
  - Example run.jsonl (5 lines): valid

This commit lands on branch feature/runner-and-agent-instructions.
NOT pushed to main; staying on the feature branch until Sloba reviews
on Gitea and merges. Bus dispatch to Ben + Sam announcing the
architectural pivot lives in the WeeyugaWeb coordination repo.

2026-05-06 19:05:22 +02:00

suites

feat: harness + agent runbook — flip repo from archive-only to

2026-05-06 19:05:22 +02:00

.gitignore

feat: harness + agent runbook — flip repo from archive-only to

2026-05-06 19:05:22 +02:00

prompts.py

feat: harness + agent runbook — flip repo from archive-only to

2026-05-06 19:05:22 +02:00

README.md

feat: harness + agent runbook — flip repo from archive-only to

2026-05-06 19:05:22 +02:00

requirements.txt

feat: harness + agent runbook — flip repo from archive-only to

2026-05-06 19:05:22 +02:00

run_benchmark.py

feat: harness + agent runbook — flip repo from archive-only to

2026-05-06 19:05:22 +02:00

README.md

`harness/` — runner + suites + prompts

Self-contained, dependency-free Python 3 benchmark runner. Drives any OpenAI-compatible /v1/chat/completions endpoint with the canonical Weeyuga prompt set; emits a JSONL event ledger.

Files

harness/
├── README.md                       — this file
├── run_benchmark.py                — the runner (one Python 3 process, stdlib only)
├── prompts.py                      — 3 frozen reference prompts (P-EASY/P-MEDIUM/P-HARD)
├── requirements.txt                — empty by intent (stdlib only); listed for tooling
└── suites/
    ├── small_model_eval_questions.json          — 5Q (5 short tasks, format-checked)
    ├── python_task_suite_questions.json         — 20Q (20 realistic Python prompts)
    ├── parallel_qwen_same_model_20q_suite.json  — same-model parallel-lane stress
    ├── parallel_qwen_mixed_model_20q_suite.json — mixed-model parallel-lane stress
    ├── python_context_edge_append_questions.json — long-context append behavior
    └── python_context_edge_suite_only.json      — long-context whole-suite reasoning

Quick reference

# Smoke (one hello call, end-to-end runtime check)
python3 harness/run_benchmark.py --smoke \
    --target-url http://127.0.0.1:11434 \
    --models qwen3.5:0.8b \
    --cell-id-prefix mac:ollama \
    --submitter-handle alice --device-tag mac-m1-8gb

# Default phases (hello + 5q + 20q)
python3 harness/run_benchmark.py \
    --target-url http://127.0.0.1:11434 \
    --models qwen3.5:0.8b \
    --cell-id-prefix mac:ollama \
    --submitter-handle alice --device-tag mac-m1-8gb

# Full sweep (all six suites)
python3 harness/run_benchmark.py --phases all \
    --target-url http://127.0.0.1:11434 \
    --models qwen3.5:0.8b \
    --cell-id-prefix mac:ollama \
    --submitter-handle alice --device-tag mac-m1-8gb

# Probe (list models + one hello, no ledger written)
python3 harness/run_benchmark.py --probe \
    --target-url http://127.0.0.1:11434 \
    --cell-id-prefix mac:ollama

Output layout

The runner writes to submissions/<submitter-handle>/<device-tag>/run-<uuid>/:

submissions/alice/mac-m1-8gb/run-<uuid>/
├── run.jsonl       — event ledger; one JSON object per line
├── manifest.json   — automatic; written at run start
├── hardware.json   — agent fills from hardware probe (see CLAUDE.md §2)
├── metadata.json   — agent fills from aggregates (see CLAUDE.md §6)
└── run.md          — agent fills from template (see CLAUDE.md §6)

Knobs

CLI flags (see --help):

--target-url — OpenAI-compat base URL (default http://127.0.0.1:11434)
--models — comma-separated, or auto for /v1/models discovery
--cell-id-prefix — <node-tag>:<engine> for the JSONL cell_id field
--phases — subset of hello, frozen, 5q, 20q, parallel_same, parallel_mixed, edge_append, edge_suite, or all
--timeout — per-call wall-clock cap (default 360 s)
--temperature / --num-ctx / --num-predict — override canonical knobs
--probe / --smoke — health-check shortcuts
--run-id / --out-dir — resume / custom output

Canonical defaults (in code):

CANONICAL_OPTIONS = {
    "temperature": 0.1,
    "num_ctx":     4096,
    "num_predict": 2048,
}

Any deviation is recorded automatically in manifest.json.canonical_options_overrides.

Dependencies

None beyond Python 3 stdlib. urllib.request does the HTTP, json does serde, uuid makes the run-id. The empty requirements.txt exists so tools like pip-tools and reproducibility scripts have a hook; if a future version adds dependencies they'll land there with pinned versions.

Tested against Python 3.10, 3.11, 3.12. Earlier 3.x may work but isn't tested.

Suite shapes

All six suites/*.json follow the same top-level shape:

{
  "suite_name": "...",
  "version": "1",
  "purpose": "...",
  "models": ["..."],          // advisory; runner uses --models flag
  "questions": [
    {
      "id": "Q01",
      "prompt": "...",
      "required_markers": ["..."],   // optional; lower-cased substring matches
      "format_rule": "..."           // optional; one of: bash_code, python_code, shell_lines, four_numbered_steps, five_bullets, json_dict, pytest_code
    }
  ]
}

required_markers and format_rule are heuristic — they exist to flag "obviously wrong shape" answers without claiming semantic correctness. Don't treat them as ground truth; treat them as a sanity check.

The parallel and edge suites add more top-level fields (run_mode, lanes, question_assignment, etc.) for advisory context; the runner reads only questions[] from any suite.

Adding a new suite

For now: don't. The six suites are stable and adding more in this branch breaks comparability with the existing 21 reference runs. If you want a new suite, open an issue on this repo proposing it; we'll discuss whether it warrants a HARNESS_VERSION = public-2 bump (suites would still need to be backwards-compatible — adding new phase keys is fine, redefining existing ones is not).

Why no `pip install -e` / no Python package?

This is a scripts directory, not a library. The runner is one file. The suites are data files. Friends running this from a fresh clone shouldn't have to deal with packaging, virtualenvs (beyond what their agent recommends), or upgrade flows. If/when this grows past one runner, we'll split it.

License

prompts.py, run_benchmark.py, and any future harness/*.py code: MIT (see LICENSE-MIT).

suites/*.json: CC-BY-4.0 (see LICENSE) — same as the bench data they test against.

README.md

harness/ — runner + suites + prompts

Files

Quick reference

Output layout

Knobs

Dependencies

Suite shapes

Adding a new suite

Why no pip install -e / no Python package?

License

`harness/` — runner + suites + prompts

Why no `pip install -e` / no Python package?