slobodanmargetic988/weeyuga-benchmarks-public

Files

Slobodan Margetic 97a9245d9e feat: harness + agent runbook — flip repo from archive-only to

crowdsourced runner

Sloba's chat directive 2026-05-06: "this project is preparation for
going public ... ship the harness along so others can join in."

The repo's original purpose (Ben's catalogue + 21 reference run
ledgers, shipped 2026-05-05) stays intact. This commit ADDS a second
purpose: a portable harness + agent runbook so a friend's coding agent
can clone, read CLAUDE.md, run the same suite on the friend's hardware,
and submit results back as a PR.

What landed:

CLAUDE.md + AGENTS.md (byte-identical, ~520 lines)
  Full agent runbook: hardware probe, runtime + model selection,
  canonical knob reference (Sloba's Pavilion methodology values),
  hardware-adaptation decision rules, run-instructions, output-schema
  templates for hardware.json + metadata.json + run.md, PR submission
  flow (fork → branch → push → PR; nothing auto-merges), privacy
  guardrails, methodology lineage. Per Sloba's Q3 directive: the
  runbook explicitly tells the friend's agent to ADAPT to hardware
  reality and document deviations rather than blindly run defaults.

CONTRIBUTING.md (~110 lines)
  Human-readable companion for the friend (not the agent). What you
  need, how it works, what we ask, what maintainers commit to,
  license, code-of-conduct short version.

harness/
  ├── README.md        Technical readme for the harness folder
  ├── run_benchmark.py ~520 LOC runner. Stdlib-only. Adapted from
  │                    WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py
  │                    v3 with the cluster-internal IP defaults
  │                    (10.8.0.x) replaced by 127.0.0.1:11434, the
  │                    cluster /v1/cluster/* endpoints removed, the
  │                    canonical-suite paths under ~/Documents/MyServers
  │                    replaced by harness/suites/ paths, the git-sha
  │                    enforcement on WeeyugaWeb dropped, and the
  │                    output written under submissions/<handle>/<tag>/
  │                    instead of docs/BENCHMARKS/runs/. Supports all
  │                    six suite phases via --phases, plus 'all'.
  ├── prompts.py       Verbatim copy of the canonical 3 frozen prompts
  │                    (P-EASY/P-MEDIUM/P-HARD) from
  │                    WeeyugaWeb/scripts/benchmarks/prompts.py.
  ├── requirements.txt Empty by intent (stdlib-only); placeholder for
  │                    pip-tools / agent auto-install patterns.
  ├── .gitignore       __pycache__/ etc.
  └── suites/          Six bundled JSON suites copied verbatim from
       Sloba's MyServers/instances/vps-81-17-99-14/telemetry/:
       small_model_eval_questions.json, python_task_suite_questions.json,
       parallel_qwen_same_model_20q_suite.json,
       parallel_qwen_mixed_model_20q_suite.json,
       python_context_edge_append_questions.json,
       python_context_edge_suite_only.json.

submissions/
  README.md            Folder convention + naming + reviewability rules
  EXAMPLE/mac-m1-8gb/run-00000000-...-000000000000/
       Synthetic-but-shape-complete contribution template:
       manifest.json, hardware.json, run.jsonl (5 example lines),
       metadata.json, run.md (with privacy attestation, methodology
       deviations, reproducibility command). Marked as synthetic at
       the top so future analysis doesn't accidentally cite it.

LICENSE-MIT
  MIT for harness/*.py and future helper code. Existing LICENSE
  (CC-BY-4.0) covers data files.

README.md (modified)
  Updated to reflect dual purpose. Layout diagram updated.
  Maintainer credits: Ben for catalogue/methodology + Bane for harness.
  Contributor quick-start added. Status table extended.

Privacy posture:
  - All 6 suite JSON files privacy-scanned for cluster IPs / hostnames /
    paths / tokens. Two prompts contain project names ("MyBoard" auth
    debugging in 20Q-Q14, generic SSH troubleshooting in 5Q-Q03);
    flagged in chat for Sloba's review. Otherwise clean.
  - run_benchmark.py default target_url is 127.0.0.1:11434 (no internal
    IPs leaked).
  - manifest.json captures host_hostname_short via socket.gethostname()
    .split('.')[0] — agent should review before PR if hostname is
    sensitive.
  - CLAUDE.md §8 spells out the privacy-grep before push.

Verification:
  - py_compile run_benchmark.py: OK
  - --help renders cleanly
  - All 6 suite JSON files: valid
  - All 4 example JSON files: valid
  - Example run.jsonl (5 lines): valid

This commit lands on branch feature/runner-and-agent-instructions.
NOT pushed to main; staying on the feature branch until Sloba reviews
on Gitea and merges. Bus dispatch to Ben + Sam announcing the
architectural pivot lives in the WeeyugaWeb coordination repo.

2026-05-06 19:05:22 +02:00

3.8 KiB

Raw Blame History

EXAMPLE — mac-m1-8gb — qwen3.5:0.8b — 2026-05-12

This is a synthetic example so contributors can see the shape of a submission end-to-end. The numbers are plausible but not from a real run. Don't cite this directory in analysis. Don't copy-paste these numbers. Real submissions live alongside this folder under submissions/<handle>/.

Run ID: 00000000-0000-0000-0000-000000000000 Submitter: EXAMPLE (synthetic) Hardware: Apple MacBook Air M1, 8 GB unified, macOS 14.5 Runtime: Ollama 0.5.13 (default settings; NUM_PARALLEL=1, KEEP_ALIVE=5m) Models: qwen3.5:0.8b Phases run: hello, 5q, 20q Phases skipped: parallel_same, parallel_mixed, edge_append, edge_suite — RAM constraint, parallel suites need ≥2 warm copies of the model and 8 GB unified didn't fit; edge suites time-budget skipped (would have been ~30 min more)

Headline numbers

Cell	Phase	n_calls	tok/s mean	tok/s p50	duration p50	format_ok rate
mac-m1:ollama:qwen3.5:0.8b	hello	1	22.7	22.7	1.8 s	n/a
mac-m1:ollama:qwen3.5:0.8b	5q	5	21.4	22.3	4.2 s	80%
mac-m1:ollama:qwen3.5:0.8b	20q	20	17.0	20.9	9.6 s	70%

What I observed (qualitative)

Hello-call cold-start was fast — 1.8 s including initial model load. Ollama reports the 0.8B GGUF as ~600 MB; on Apple Silicon unified memory this loads in well under 2 s.
5Q tasks were uniformly handled — all five formats (bash, python, shell, four-numbered-steps, json) parsed correctly except one (Q3, "shell_lines" — model started with 1. numbered list instead of raw shell command).
20Q tasks bifurcated — the simple ones (Q01-Q08) ran at full ~20 tok/s with high format-correctness; the longer ones (Q09+ with multi-paragraph context) saw throughput drop to ~12-15 tok/s, with format_ok dropping to ~60%. p95 duration of 41 s was Q14 (the MyBoard triage prompt — long context, mixed format).
No errors, no timeouts. Cleanest run was on AC power; the laptop fan never spun up.

Methodology

Followed the canonical Pavilion methodology with these deviations:

NUM_PARALLEL=1 instead of canonical 3 — laptop, not server; one slot is enough for sequential per-model-block execution.
KEEP_ALIVE=5m instead of canonical 2400h — laptop, no need to pin.
Phases parallel_same, parallel_mixed, edge_append, edge_suite skipped — see top of file. Run not eligible for flagship grade, intended as standard.

Caveats

8 GB unified RAM is below the comfort floor for parallel suites with this model; results above are NOT a refutation of the canonical parallel numbers — they're from a different shape of run.
macOS Spotlight indexing was disabled before the run started. If you rerun without disabling, expect ~5-10% additional variance from background I/O.
format_ok rate of 70% on 20Q is consistent with Sloba's flagship 20Q numbers for qwen3.5:0.8b on Pavilion (~74-78% in the v1 baseline) within measurement noise.

Reproducibility

ollama pull qwen3.5:0.8b
ollama serve  # in a separate terminal

python3 harness/run_benchmark.py \
    --target-url http://127.0.0.1:11434 \
    --models qwen3.5:0.8b \
    --cell-id-prefix mac-m1:ollama \
    --phases hello,5q,20q \
    --submitter-handle alice \
    --device-tag mac-m1-8gb

Took ~16 minutes wall-clock on this hardware.

Privacy attestation

I scanned run.jsonl for personal paths, API tokens, SSH keys, and home-directory leakage:

grep -nE "Bearer |sk-|api_key|/Users/|/home/|password|ssh-rsa|ssh-ed25519" \
    submissions/EXAMPLE/mac-m1-8gb/run-00000000-0000-0000-0000-000000000000/*

No matches outside the SSH-troubleshooting prompt in 5Q (Q3) which is intentional curriculum. Safe to ship.

— EXAMPLE (synthetic; not a real contributor)

3.8 KiB Raw Blame History