feat: harness + agent runbook — flip repo from archive-only to

crowdsourced runner Sloba's chat directive 2026-05-06: "this project is preparation for going public ... ship the harness along so others can join in." The repo's original purpose (Ben's catalogue + 21 reference run ledgers, shipped 2026-05-05) stays intact. This commit ADDS a second purpose: a portable harness + agent runbook so a friend's coding agent can clone, read CLAUDE.md, run the same suite on the friend's hardware, and submit results back as a PR. What landed: CLAUDE.md + AGENTS.md (byte-identical, ~520 lines) Full agent runbook: hardware probe, runtime + model selection, canonical knob reference (Sloba's Pavilion methodology values), hardware-adaptation decision rules, run-instructions, output-schema templates for hardware.json + metadata.json + run.md, PR submission flow (fork → branch → push → PR; nothing auto-merges), privacy guardrails, methodology lineage. Per Sloba's Q3 directive: the runbook explicitly tells the friend's agent to ADAPT to hardware reality and document deviations rather than blindly run defaults. CONTRIBUTING.md (~110 lines) Human-readable companion for the friend (not the agent). What you need, how it works, what we ask, what maintainers commit to, license, code-of-conduct short version. harness/ ├── README.md Technical readme for the harness folder ├── run_benchmark.py ~520 LOC runner. Stdlib-only. Adapted from │ WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py │ v3 with the cluster-internal IP defaults │ (10.8.0.x) replaced by 127.0.0.1:11434, the │ cluster /v1/cluster/* endpoints removed, the │ canonical-suite paths under ~/Documents/MyServers │ replaced by harness/suites/ paths, the git-sha │ enforcement on WeeyugaWeb dropped, and the │ output written under submissions/<handle>/<tag>/ │ instead of docs/BENCHMARKS/runs/. Supports all │ six suite phases via --phases, plus 'all'. ├── prompts.py Verbatim copy of the canonical 3 frozen prompts │ (P-EASY/P-MEDIUM/P-HARD) from │ WeeyugaWeb/scripts/benchmarks/prompts.py. ├── requirements.txt Empty by intent (stdlib-only); placeholder for │ pip-tools / agent auto-install patterns. ├── .gitignore __pycache__/ etc. └── suites/ Six bundled JSON suites copied verbatim from Sloba's MyServers/instances/vps-81-17-99-14/telemetry/: small_model_eval_questions.json, python_task_suite_questions.json, parallel_qwen_same_model_20q_suite.json, parallel_qwen_mixed_model_20q_suite.json, python_context_edge_append_questions.json, python_context_edge_suite_only.json. submissions/ README.md Folder convention + naming + reviewability rules EXAMPLE/mac-m1-8gb/run-00000000-...-000000000000/ Synthetic-but-shape-complete contribution template: manifest.json, hardware.json, run.jsonl (5 example lines), metadata.json, run.md (with privacy attestation, methodology deviations, reproducibility command). Marked as synthetic at the top so future analysis doesn't accidentally cite it. LICENSE-MIT MIT for harness/*.py and future helper code. Existing LICENSE (CC-BY-4.0) covers data files. README.md (modified) Updated to reflect dual purpose. Layout diagram updated. Maintainer credits: Ben for catalogue/methodology + Bane for harness. Contributor quick-start added. Status table extended. Privacy posture: - All 6 suite JSON files privacy-scanned for cluster IPs / hostnames / paths / tokens. Two prompts contain project names ("MyBoard" auth debugging in 20Q-Q14, generic SSH troubleshooting in 5Q-Q03); flagged in chat for Sloba's review. Otherwise clean. - run_benchmark.py default target_url is 127.0.0.1:11434 (no internal IPs leaked). - manifest.json captures host_hostname_short via socket.gethostname() .split('.')[0] — agent should review before PR if hostname is sensitive. - CLAUDE.md §8 spells out the privacy-grep before push. Verification: - py_compile run_benchmark.py: OK - --help renders cleanly - All 6 suite JSON files: valid - All 4 example JSON files: valid - Example run.jsonl (5 lines): valid This commit lands on branch feature/runner-and-agent-instructions. NOT pushed to main; staying on the feature branch until Sloba reviews on Gitea and merges. Bus dispatch to Ben + Sam announcing the architectural pivot lives in the WeeyugaWeb coordination repo.
2026-05-06 11:07:55 +02:00
parent ddc9626136
commit 97a9245d9e
22 changed files with 4400 additions and 47 deletions
--- a/submissions/EXAMPLE/mac-m1-8gb/run-00000000-0000-0000-0000-000000000000/run.jsonl
+++ b/submissions/EXAMPLE/mac-m1-8gb/run-00000000-0000-0000-0000-000000000000/run.jsonl
@@ -0,0 +1,6 @@
+{"type":"meta","benchmark_run_id":"00000000-0000-0000-0000-000000000000","harness_version":"public-1","started_at_utc":"2026-05-12T14:32:11Z","host_hostname_short":"alices-mbp","load_avg_start":[1.2,1.4,1.6],"target_url":"http://127.0.0.1:11434","cell_id_prefix":"mac-m1:ollama","submitter_handle":"EXAMPLE","device_tag":"mac-m1-8gb","execution_shape":"per-model-block","phases_planned":["hello","5q","20q"],"models_planned":["qwen3.5:0.8b"],"canonical_options":{"temperature":0.1,"num_ctx":4096,"num_predict":2048},"canonical_options_effective":{"temperature":0.1,"num_ctx":4096,"num_predict":2048},"timeout_seconds":360,"platform_system":"Darwin","platform_release":"23.5.0","python_version":"3.12.4"}
+{"type":"call","ts_utc":"2026-05-12T14:32:13Z","cell_id":"mac-m1:ollama:qwen3.5:0.8b","model":"qwen3.5:0.8b","phase":"hello","question_id":"hello_check","run_idx":0,"duration_seconds":1.847,"prompt_tokens":17,"completion_tokens":42,"tokens_per_second":22.74,"finish_reason":"stop","status_code":200,"response_chars":167,"response_preview":"Hello! Of course, I'd be happy to help. What can I assist you with today? Whether it's a question, a task, or just a chat, I'm here to help.","required_markers":[],"markers_hit":[],"marker_hit_rate":null,"format_rule":"","format_ok":null,"usable_answer":true,"error":null}
+{"type":"call","ts_utc":"2026-05-12T14:32:21Z","cell_id":"mac-m1:ollama:qwen3.5:0.8b","model":"qwen3.5:0.8b","phase":"5q","question_id":"Q1","run_idx":0,"duration_seconds":4.21,"prompt_tokens":58,"completion_tokens":94,"tokens_per_second":22.32,"finish_reason":"stop","status_code":200,"response_chars":312,"response_preview":"#!/usr/bin/env bash\nset -euo pipefail\n\nif [[ ! -d \"$1\" ]]; then\n  echo \"err: $1 not a directory\" >&2\n  exit 1\nfi\n\nfor f in \"$1\"/*.log; do\n  [[ -e \"$f\" ]] || continue\n  gzip -k \"$f\"\ndone","required_markers":["gzip","#!/usr/bin/env bash"],"markers_hit":["gzip","#!/usr/bin/env bash"],"marker_hit_rate":1.0,"format_rule":"bash_code","format_ok":true,"usable_answer":true,"error":null}
+{"type":"call","ts_utc":"2026-05-12T14:33:48Z","cell_id":"mac-m1:ollama:qwen3.5:0.8b","model":"qwen3.5:0.8b","phase":"20q","question_id":"Q01","run_idx":0,"duration_seconds":9.612,"prompt_tokens":142,"completion_tokens":201,"tokens_per_second":20.91,"finish_reason":"stop","status_code":200,"response_chars":784,"response_preview":"def is_valid_ipv4(addr: str) -> bool:\n    parts = addr.split('.')\n    if len(parts) != 4:\n        return False\n    for p in parts:\n        if not p.isdigit():\n            return False\n        n = int(p)","required_markers":["is_valid_ipv4","def test_"],"markers_hit":["is_valid_ipv4","def test_"],"marker_hit_rate":1.0,"format_rule":"python_code","format_ok":true,"usable_answer":true,"error":null}
+{"type":"call","ts_utc":"2026-05-12T14:36:42Z","cell_id":"mac-m1:ollama:qwen3.5:0.8b","model":"qwen3.5:0.8b","phase":"20q","question_id":"Q14","run_idx":13,"duration_seconds":42.118,"prompt_tokens":189,"completion_tokens":312,"tokens_per_second":7.41,"finish_reason":"stop","status_code":200,"response_chars":1240,"response_preview":"To debug this MyBoard auth issue, the triage should focus on…","required_markers":["/auth/login","/auth/me","myboard:post-login-redirect","tenant-missing"],"markers_hit":["/auth/login","/auth/me","tenant-missing"],"marker_hit_rate":0.75,"format_rule":"json_dict","format_ok":false,"usable_answer":true,"error":null}
+{"type":"footer","ts_utc":"2026-05-12T14:48:03Z","finished_at_utc":"2026-05-12T14:48:03Z","load_avg_end":[1.6,1.5,1.6]}