feat: harness + agent runbook — flip repo from archive-only to

crowdsourced runner Sloba's chat directive 2026-05-06: "this project is preparation for going public ... ship the harness along so others can join in." The repo's original purpose (Ben's catalogue + 21 reference run ledgers, shipped 2026-05-05) stays intact. This commit ADDS a second purpose: a portable harness + agent runbook so a friend's coding agent can clone, read CLAUDE.md, run the same suite on the friend's hardware, and submit results back as a PR. What landed: CLAUDE.md + AGENTS.md (byte-identical, ~520 lines) Full agent runbook: hardware probe, runtime + model selection, canonical knob reference (Sloba's Pavilion methodology values), hardware-adaptation decision rules, run-instructions, output-schema templates for hardware.json + metadata.json + run.md, PR submission flow (fork → branch → push → PR; nothing auto-merges), privacy guardrails, methodology lineage. Per Sloba's Q3 directive: the runbook explicitly tells the friend's agent to ADAPT to hardware reality and document deviations rather than blindly run defaults. CONTRIBUTING.md (~110 lines) Human-readable companion for the friend (not the agent). What you need, how it works, what we ask, what maintainers commit to, license, code-of-conduct short version. harness/ ├── README.md Technical readme for the harness folder ├── run_benchmark.py ~520 LOC runner. Stdlib-only. Adapted from │ WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py │ v3 with the cluster-internal IP defaults │ (10.8.0.x) replaced by 127.0.0.1:11434, the │ cluster /v1/cluster/* endpoints removed, the │ canonical-suite paths under ~/Documents/MyServers │ replaced by harness/suites/ paths, the git-sha │ enforcement on WeeyugaWeb dropped, and the │ output written under submissions/<handle>/<tag>/ │ instead of docs/BENCHMARKS/runs/. Supports all │ six suite phases via --phases, plus 'all'. ├── prompts.py Verbatim copy of the canonical 3 frozen prompts │ (P-EASY/P-MEDIUM/P-HARD) from │ WeeyugaWeb/scripts/benchmarks/prompts.py. ├── requirements.txt Empty by intent (stdlib-only); placeholder for │ pip-tools / agent auto-install patterns. ├── .gitignore __pycache__/ etc. └── suites/ Six bundled JSON suites copied verbatim from Sloba's MyServers/instances/vps-81-17-99-14/telemetry/: small_model_eval_questions.json, python_task_suite_questions.json, parallel_qwen_same_model_20q_suite.json, parallel_qwen_mixed_model_20q_suite.json, python_context_edge_append_questions.json, python_context_edge_suite_only.json. submissions/ README.md Folder convention + naming + reviewability rules EXAMPLE/mac-m1-8gb/run-00000000-...-000000000000/ Synthetic-but-shape-complete contribution template: manifest.json, hardware.json, run.jsonl (5 example lines), metadata.json, run.md (with privacy attestation, methodology deviations, reproducibility command). Marked as synthetic at the top so future analysis doesn't accidentally cite it. LICENSE-MIT MIT for harness/*.py and future helper code. Existing LICENSE (CC-BY-4.0) covers data files. README.md (modified) Updated to reflect dual purpose. Layout diagram updated. Maintainer credits: Ben for catalogue/methodology + Bane for harness. Contributor quick-start added. Status table extended. Privacy posture: - All 6 suite JSON files privacy-scanned for cluster IPs / hostnames / paths / tokens. Two prompts contain project names ("MyBoard" auth debugging in 20Q-Q14, generic SSH troubleshooting in 5Q-Q03); flagged in chat for Sloba's review. Otherwise clean. - run_benchmark.py default target_url is 127.0.0.1:11434 (no internal IPs leaked). - manifest.json captures host_hostname_short via socket.gethostname() .split('.')[0] — agent should review before PR if hostname is sensitive. - CLAUDE.md §8 spells out the privacy-grep before push. Verification: - py_compile run_benchmark.py: OK - --help renders cleanly - All 6 suite JSON files: valid - All 4 example JSON files: valid - Example run.jsonl (5 lines): valid This commit lands on branch feature/runner-and-agent-instructions. NOT pushed to main; staying on the feature branch until Sloba reviews on Gitea and merges. Bus dispatch to Ben + Sam announcing the architectural pivot lives in the WeeyugaWeb coordination repo.
2026-05-06 11:07:55 +02:00
parent ddc9626136
commit 97a9245d9e
22 changed files with 4400 additions and 47 deletions
--- a/harness/suites/parallel_qwen_mixed_model_20q_suite.json
+++ b/harness/suites/parallel_qwen_mixed_model_20q_suite.json
@@ -0,0 +1,363 @@
+{
+  "generated_at": "2026-04-11T19:00:02Z",
+  "suite_name": "parallel-qwen-mixed-model-20q-v1",
+  "version": "1.0",
+  "purpose": "Run the shared 20-question Python benchmark in two-question batches against qwen size pairs. Within each batch the first model answers the odd-numbered question and the second model answers the even-numbered question, while Ollama keeps two models loaded with two parallel request slots and a 32K request context.",
+  "run_mode": "mixed_model_pairs",
+  "question_batch_size": 2,
+  "question_assignment": "odd_questions_to_first_model_even_questions_to_second_model",
+  "ollama_runtime": {
+    "max_loaded_models": 2,
+    "num_parallel": 2,
+    "num_ctx": 32768,
+    "keep_alive": "24h"
+  },
+  "lanes": [
+    {
+      "lane_id": "qwen2_5_coder_0_5b__qwen2_5_coder_3b",
+      "display_name": "Qwen2.5 Coder 0.5B plus Qwen2.5 Coder 3B",
+      "kind": "mixed_model",
+      "question_assignment": "odd_questions_to_first_model_even_questions_to_second_model",
+      "models": [
+        {
+          "role": "odd_questions",
+          "model": "qwen2.5-coder:0.5b",
+          "display_name": "Qwen2.5 Coder 0.5B",
+          "family": "qwen2.5-coder",
+          "size_label": "0.5B",
+          "max_context_tokens": 32768,
+          "requested_num_ctx": 32768,
+          "mode": "mixed_model"
+        },
+        {
+          "role": "even_questions",
+          "model": "qwen2.5-coder:3b",
+          "display_name": "Qwen2.5 Coder 3B",
+          "family": "qwen2.5-coder",
+          "size_label": "3B",
+          "max_context_tokens": 32768,
+          "requested_num_ctx": 32768,
+          "mode": "mixed_model"
+        }
+      ]
+    },
+    {
+      "lane_id": "qwen2_5_coder_0_5b__qwen2_5_coder_1_5b",
+      "display_name": "Qwen2.5 Coder 0.5B plus Qwen2.5 Coder 1.5B",
+      "kind": "mixed_model",
+      "question_assignment": "odd_questions_to_first_model_even_questions_to_second_model",
+      "models": [
+        {
+          "role": "odd_questions",
+          "model": "qwen2.5-coder:0.5b",
+          "display_name": "Qwen2.5 Coder 0.5B",
+          "family": "qwen2.5-coder",
+          "size_label": "0.5B",
+          "max_context_tokens": 32768,
+          "requested_num_ctx": 32768,
+          "mode": "mixed_model"
+        },
+        {
+          "role": "even_questions",
+          "model": "qwen2.5-coder:1.5b",
+          "display_name": "Qwen2.5 Coder 1.5B",
+          "family": "qwen2.5-coder",
+          "size_label": "1.5B",
+          "max_context_tokens": 32768,
+          "requested_num_ctx": 32768,
+          "mode": "mixed_model"
+        }
+      ]
+    },
+    {
+      "lane_id": "qwen2_5_coder_1_5b__qwen2_5_coder_3b",
+      "display_name": "Qwen2.5 Coder 1.5B plus Qwen2.5 Coder 3B",
+      "kind": "mixed_model",
+      "question_assignment": "odd_questions_to_first_model_even_questions_to_second_model",
+      "models": [
+        {
+          "role": "odd_questions",
+          "model": "qwen2.5-coder:1.5b",
+          "display_name": "Qwen2.5 Coder 1.5B",
+          "family": "qwen2.5-coder",
+          "size_label": "1.5B",
+          "max_context_tokens": 32768,
+          "requested_num_ctx": 32768,
+          "mode": "mixed_model"
+        },
+        {
+          "role": "even_questions",
+          "model": "qwen2.5-coder:3b",
+          "display_name": "Qwen2.5 Coder 3B",
+          "family": "qwen2.5-coder",
+          "size_label": "3B",
+          "max_context_tokens": 32768,
+          "requested_num_ctx": 32768,
+          "mode": "mixed_model"
+        }
+      ]
+    }
+  ],
+  "questions": [
+    {
+      "id": "py_csv_parse",
+      "title": "CSV Parser",
+      "category": "parsing",
+      "prompt": "Return only Python code. Write a function that reads CSV text, skips blank lines, and returns a list of dicts keyed by the header row.",
+      "required_markers": [
+        "csv",
+        "Dict",
+        "reader",
+        "header"
+      ],
+      "format_rule": "python_code"
+    },
+    {
+      "id": "py_file_scan",
+      "title": "File Scanner",
+      "category": "file_io",
+      "prompt": "Return only Python code. Write a script that walks a directory tree and prints the paths of files larger than 5 MB.",
+      "required_markers": [
+        "os.walk",
+        "5 * 1024 * 1024",
+        "print",
+        "path"
+      ],
+      "format_rule": "python_code"
+    },
+    {
+      "id": "py_cli_args",
+      "title": "CLI Arguments",
+      "category": "cli",
+      "prompt": "Return only Python code. Build a small argparse CLI with one required path argument and one optional verbose flag.",
+      "required_markers": [
+        "argparse",
+        "--verbose",
+        "ArgumentParser",
+        "path"
+      ],
+      "format_rule": "python_code"
+    },
+    {
+      "id": "py_typing_dataclass",
+      "title": "Typed Dataclass",
+      "category": "typing",
+      "prompt": "Return only Python code. Define a typed dataclass for a job record with id, name, created_at, and is_active fields.",
+      "required_markers": [
+        "@dataclass",
+        "created_at",
+        "is_active",
+        "str"
+      ],
+      "format_rule": "python_code"
+    },
+    {
+      "id": "py_pytest_fixture",
+      "title": "Pytest Fixture",
+      "category": "tests",
+      "prompt": "Return only Python code. Write a pytest fixture and one test that uses it to verify a function converting Celsius to Fahrenheit.",
+      "required_markers": [
+        "@pytest.fixture",
+        "def test_",
+        "assert",
+        "fahrenheit"
+      ],
+      "format_rule": "python_code"
+    },
+    {
+      "id": "py_async_fetch",
+      "title": "Async Fetch",
+      "category": "async",
+      "prompt": "Return only Python code. Write an async function that fetches two URLs concurrently with asyncio.gather and returns both bodies.",
+      "required_markers": [
+        "async def",
+        "asyncio.gather",
+        "await",
+        "aiohttp"
+      ],
+      "format_rule": "python_code"
+    },
+    {
+      "id": "py_http_retry",
+      "title": "HTTP Retry",
+      "category": "http",
+      "prompt": "Return only Python code. Write a requests wrapper that retries HTTP 429 with exponential backoff and a maximum attempt count.",
+      "required_markers": [
+        "requests",
+        "429",
+        "backoff",
+        "max_attempts"
+      ],
+      "format_rule": "python_code"
+    },
+    {
+      "id": "py_json_validate",
+      "title": "JSON Validation",
+      "category": "validation",
+      "prompt": "Return only Python code. Validate a JSON object against a schema and raise ValueError when required keys are missing.",
+      "required_markers": [
+        "jsonschema",
+        "ValueError",
+        "required",
+        "schema"
+      ],
+      "format_rule": "python_code"
+    },
+    {
+      "id": "py_sqlite_store",
+      "title": "SQLite Store",
+      "category": "sqlite",
+      "prompt": "Return only Python code. Create a SQLite table for events and write a function that inserts one event row safely.",
+      "required_markers": [
+        "sqlite3",
+        "CREATE TABLE",
+        "INSERT INTO",
+        "commit"
+      ],
+      "format_rule": "python_code"
+    },
+    {
+      "id": "py_fastapi_handler",
+      "title": "FastAPI Handler",
+      "category": "web",
+      "prompt": "Return only Python code. Write a FastAPI route that returns a JSON health response with status and version fields.",
+      "required_markers": [
+        "FastAPI",
+        "@app.get",
+        "status",
+        "version"
+      ],
+      "format_rule": "python_code"
+    },
+    {
+      "id": "py_config_dataclass",
+      "title": "Config Dataclass",
+      "category": "config",
+      "prompt": "Return only Python code. Build a dataclass-based config loader that reads environment variables and supplies defaults.",
+      "required_markers": [
+        "dataclass",
+        "os.environ",
+        "default",
+        "load"
+      ],
+      "format_rule": "python_code"
+    },
+    {
+      "id": "py_logging_setup",
+      "title": "Logging Setup",
+      "category": "logging",
+      "prompt": "Return only Python code. Configure structured logging with a timestamped formatter and a reusable setup function.",
+      "required_markers": [
+        "logging",
+        "Formatter",
+        "timestamp",
+        "basicConfig"
+      ],
+      "format_rule": "python_code"
+    },
+    {
+      "id": "py_thread_pool",
+      "title": "Thread Pool",
+      "category": "concurrency",
+      "prompt": "Return only Python code. Use concurrent.futures to run a small CPU-bound function across a list of inputs and collect results.",
+      "required_markers": [
+        "concurrent.futures",
+        "ThreadPoolExecutor",
+        "map",
+        "results"
+      ],
+      "format_rule": "python_code"
+    },
+    {
+      "id": "py_package_layout",
+      "title": "Package Layout",
+      "category": "package",
+      "prompt": "Return only Python code. Show a minimal package layout with __init__.py and a helper module that can be imported from tests.",
+      "required_markers": [
+        "__init__.py",
+        "import",
+        "helper",
+        "tests"
+      ],
+      "format_rule": "python_code"
+    },
+    {
+      "id": "py_debug_stacktrace",
+      "title": "Debug Stacktrace",
+      "category": "debugging",
+      "prompt": "Return only Python code. Fix a function that crashes on None input by adding an early return and a clear exception message.",
+      "required_markers": [
+        "None",
+        "return",
+        "raise",
+        "message"
+      ],
+      "format_rule": "python_code"
+    },
+    {
+      "id": "py_refactor_split",
+      "title": "Refactor Split",
+      "category": "refactor",
+      "prompt": "Return only Python code. Refactor a large function into two smaller helpers while preserving behavior.",
+      "required_markers": [
+        "def",
+        "helper",
+        "return",
+        "preserve"
+      ],
+      "format_rule": "python_code"
+    },
+    {
+      "id": "py_csv_summary",
+      "title": "CSV Summary",
+      "category": "analysis",
+      "prompt": "Return only Python code. Read a CSV file and produce a summary with row count and a count of unique values in one column.",
+      "required_markers": [
+        "csv",
+        "row_count",
+        "unique",
+        "Counter"
+      ],
+      "format_rule": "python_code"
+    },
+    {
+      "id": "py_pathlib_clean",
+      "title": "Pathlib Cleaner",
+      "category": "filesystem",
+      "prompt": "Return only Python code. Use pathlib to remove empty files from a directory tree and print each deleted path.",
+      "required_markers": [
+        "pathlib",
+        "rglob",
+        "unlink",
+        "print"
+      ],
+      "format_rule": "python_code"
+    },
+    {
+      "id": "py_pydantic_model",
+      "title": "Pydantic Model",
+      "category": "validation",
+      "prompt": "Return only Python code. Define a Pydantic model for a user profile with email validation and an age field.",
+      "required_markers": [
+        "BaseModel",
+        "EmailStr",
+        "age",
+        "validation"
+      ],
+      "format_rule": "python_code"
+    },
+    {
+      "id": "py_regex_log_parser",
+      "title": "Regex Log Parser",
+      "category": "parsing",
+      "prompt": "Return only Python code. Parse web server log lines with regex and return a list of status codes and request paths.",
+      "required_markers": [
+        "re",
+        "status",
+        "path",
+        "findall"
+      ],
+      "format_rule": "python_code"
+    }
+  ]
+}