feat: harness + agent runbook — flip repo from archive-only to
crowdsourced runner
Sloba's chat directive 2026-05-06: "this project is preparation for
going public ... ship the harness along so others can join in."
The repo's original purpose (Ben's catalogue + 21 reference run
ledgers, shipped 2026-05-05) stays intact. This commit ADDS a second
purpose: a portable harness + agent runbook so a friend's coding agent
can clone, read CLAUDE.md, run the same suite on the friend's hardware,
and submit results back as a PR.
What landed:
CLAUDE.md + AGENTS.md (byte-identical, ~520 lines)
Full agent runbook: hardware probe, runtime + model selection,
canonical knob reference (Sloba's Pavilion methodology values),
hardware-adaptation decision rules, run-instructions, output-schema
templates for hardware.json + metadata.json + run.md, PR submission
flow (fork → branch → push → PR; nothing auto-merges), privacy
guardrails, methodology lineage. Per Sloba's Q3 directive: the
runbook explicitly tells the friend's agent to ADAPT to hardware
reality and document deviations rather than blindly run defaults.
CONTRIBUTING.md (~110 lines)
Human-readable companion for the friend (not the agent). What you
need, how it works, what we ask, what maintainers commit to,
license, code-of-conduct short version.
harness/
├── README.md Technical readme for the harness folder
├── run_benchmark.py ~520 LOC runner. Stdlib-only. Adapted from
│ WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py
│ v3 with the cluster-internal IP defaults
│ (10.8.0.x) replaced by 127.0.0.1:11434, the
│ cluster /v1/cluster/* endpoints removed, the
│ canonical-suite paths under ~/Documents/MyServers
│ replaced by harness/suites/ paths, the git-sha
│ enforcement on WeeyugaWeb dropped, and the
│ output written under submissions/<handle>/<tag>/
│ instead of docs/BENCHMARKS/runs/. Supports all
│ six suite phases via --phases, plus 'all'.
├── prompts.py Verbatim copy of the canonical 3 frozen prompts
│ (P-EASY/P-MEDIUM/P-HARD) from
│ WeeyugaWeb/scripts/benchmarks/prompts.py.
├── requirements.txt Empty by intent (stdlib-only); placeholder for
│ pip-tools / agent auto-install patterns.
├── .gitignore __pycache__/ etc.
└── suites/ Six bundled JSON suites copied verbatim from
Sloba's MyServers/instances/vps-81-17-99-14/telemetry/:
small_model_eval_questions.json, python_task_suite_questions.json,
parallel_qwen_same_model_20q_suite.json,
parallel_qwen_mixed_model_20q_suite.json,
python_context_edge_append_questions.json,
python_context_edge_suite_only.json.
submissions/
README.md Folder convention + naming + reviewability rules
EXAMPLE/mac-m1-8gb/run-00000000-...-000000000000/
Synthetic-but-shape-complete contribution template:
manifest.json, hardware.json, run.jsonl (5 example lines),
metadata.json, run.md (with privacy attestation, methodology
deviations, reproducibility command). Marked as synthetic at
the top so future analysis doesn't accidentally cite it.
LICENSE-MIT
MIT for harness/*.py and future helper code. Existing LICENSE
(CC-BY-4.0) covers data files.
README.md (modified)
Updated to reflect dual purpose. Layout diagram updated.
Maintainer credits: Ben for catalogue/methodology + Bane for harness.
Contributor quick-start added. Status table extended.
Privacy posture:
- All 6 suite JSON files privacy-scanned for cluster IPs / hostnames /
paths / tokens. Two prompts contain project names ("MyBoard" auth
debugging in 20Q-Q14, generic SSH troubleshooting in 5Q-Q03);
flagged in chat for Sloba's review. Otherwise clean.
- run_benchmark.py default target_url is 127.0.0.1:11434 (no internal
IPs leaked).
- manifest.json captures host_hostname_short via socket.gethostname()
.split('.')[0] — agent should review before PR if hostname is
sensitive.
- CLAUDE.md §8 spells out the privacy-grep before push.
Verification:
- py_compile run_benchmark.py: OK
- --help renders cleanly
- All 6 suite JSON files: valid
- All 4 example JSON files: valid
- Example run.jsonl (5 lines): valid
This commit lands on branch feature/runner-and-agent-instructions.
NOT pushed to main; staying on the feature branch until Sloba reviews
on Gitea and merges. Bus dispatch to Ben + Sam announcing the
architectural pivot lives in the WeeyugaWeb coordination repo.
This commit is contained in:
363
harness/suites/parallel_qwen_mixed_model_20q_suite.json
Normal file
363
harness/suites/parallel_qwen_mixed_model_20q_suite.json
Normal file
@@ -0,0 +1,363 @@
|
||||
{
|
||||
"generated_at": "2026-04-11T19:00:02Z",
|
||||
"suite_name": "parallel-qwen-mixed-model-20q-v1",
|
||||
"version": "1.0",
|
||||
"purpose": "Run the shared 20-question Python benchmark in two-question batches against qwen size pairs. Within each batch the first model answers the odd-numbered question and the second model answers the even-numbered question, while Ollama keeps two models loaded with two parallel request slots and a 32K request context.",
|
||||
"run_mode": "mixed_model_pairs",
|
||||
"question_batch_size": 2,
|
||||
"question_assignment": "odd_questions_to_first_model_even_questions_to_second_model",
|
||||
"ollama_runtime": {
|
||||
"max_loaded_models": 2,
|
||||
"num_parallel": 2,
|
||||
"num_ctx": 32768,
|
||||
"keep_alive": "24h"
|
||||
},
|
||||
"lanes": [
|
||||
{
|
||||
"lane_id": "qwen2_5_coder_0_5b__qwen2_5_coder_3b",
|
||||
"display_name": "Qwen2.5 Coder 0.5B plus Qwen2.5 Coder 3B",
|
||||
"kind": "mixed_model",
|
||||
"question_assignment": "odd_questions_to_first_model_even_questions_to_second_model",
|
||||
"models": [
|
||||
{
|
||||
"role": "odd_questions",
|
||||
"model": "qwen2.5-coder:0.5b",
|
||||
"display_name": "Qwen2.5 Coder 0.5B",
|
||||
"family": "qwen2.5-coder",
|
||||
"size_label": "0.5B",
|
||||
"max_context_tokens": 32768,
|
||||
"requested_num_ctx": 32768,
|
||||
"mode": "mixed_model"
|
||||
},
|
||||
{
|
||||
"role": "even_questions",
|
||||
"model": "qwen2.5-coder:3b",
|
||||
"display_name": "Qwen2.5 Coder 3B",
|
||||
"family": "qwen2.5-coder",
|
||||
"size_label": "3B",
|
||||
"max_context_tokens": 32768,
|
||||
"requested_num_ctx": 32768,
|
||||
"mode": "mixed_model"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"lane_id": "qwen2_5_coder_0_5b__qwen2_5_coder_1_5b",
|
||||
"display_name": "Qwen2.5 Coder 0.5B plus Qwen2.5 Coder 1.5B",
|
||||
"kind": "mixed_model",
|
||||
"question_assignment": "odd_questions_to_first_model_even_questions_to_second_model",
|
||||
"models": [
|
||||
{
|
||||
"role": "odd_questions",
|
||||
"model": "qwen2.5-coder:0.5b",
|
||||
"display_name": "Qwen2.5 Coder 0.5B",
|
||||
"family": "qwen2.5-coder",
|
||||
"size_label": "0.5B",
|
||||
"max_context_tokens": 32768,
|
||||
"requested_num_ctx": 32768,
|
||||
"mode": "mixed_model"
|
||||
},
|
||||
{
|
||||
"role": "even_questions",
|
||||
"model": "qwen2.5-coder:1.5b",
|
||||
"display_name": "Qwen2.5 Coder 1.5B",
|
||||
"family": "qwen2.5-coder",
|
||||
"size_label": "1.5B",
|
||||
"max_context_tokens": 32768,
|
||||
"requested_num_ctx": 32768,
|
||||
"mode": "mixed_model"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"lane_id": "qwen2_5_coder_1_5b__qwen2_5_coder_3b",
|
||||
"display_name": "Qwen2.5 Coder 1.5B plus Qwen2.5 Coder 3B",
|
||||
"kind": "mixed_model",
|
||||
"question_assignment": "odd_questions_to_first_model_even_questions_to_second_model",
|
||||
"models": [
|
||||
{
|
||||
"role": "odd_questions",
|
||||
"model": "qwen2.5-coder:1.5b",
|
||||
"display_name": "Qwen2.5 Coder 1.5B",
|
||||
"family": "qwen2.5-coder",
|
||||
"size_label": "1.5B",
|
||||
"max_context_tokens": 32768,
|
||||
"requested_num_ctx": 32768,
|
||||
"mode": "mixed_model"
|
||||
},
|
||||
{
|
||||
"role": "even_questions",
|
||||
"model": "qwen2.5-coder:3b",
|
||||
"display_name": "Qwen2.5 Coder 3B",
|
||||
"family": "qwen2.5-coder",
|
||||
"size_label": "3B",
|
||||
"max_context_tokens": 32768,
|
||||
"requested_num_ctx": 32768,
|
||||
"mode": "mixed_model"
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"questions": [
|
||||
{
|
||||
"id": "py_csv_parse",
|
||||
"title": "CSV Parser",
|
||||
"category": "parsing",
|
||||
"prompt": "Return only Python code. Write a function that reads CSV text, skips blank lines, and returns a list of dicts keyed by the header row.",
|
||||
"required_markers": [
|
||||
"csv",
|
||||
"Dict",
|
||||
"reader",
|
||||
"header"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_file_scan",
|
||||
"title": "File Scanner",
|
||||
"category": "file_io",
|
||||
"prompt": "Return only Python code. Write a script that walks a directory tree and prints the paths of files larger than 5 MB.",
|
||||
"required_markers": [
|
||||
"os.walk",
|
||||
"5 * 1024 * 1024",
|
||||
"print",
|
||||
"path"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_cli_args",
|
||||
"title": "CLI Arguments",
|
||||
"category": "cli",
|
||||
"prompt": "Return only Python code. Build a small argparse CLI with one required path argument and one optional verbose flag.",
|
||||
"required_markers": [
|
||||
"argparse",
|
||||
"--verbose",
|
||||
"ArgumentParser",
|
||||
"path"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_typing_dataclass",
|
||||
"title": "Typed Dataclass",
|
||||
"category": "typing",
|
||||
"prompt": "Return only Python code. Define a typed dataclass for a job record with id, name, created_at, and is_active fields.",
|
||||
"required_markers": [
|
||||
"@dataclass",
|
||||
"created_at",
|
||||
"is_active",
|
||||
"str"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_pytest_fixture",
|
||||
"title": "Pytest Fixture",
|
||||
"category": "tests",
|
||||
"prompt": "Return only Python code. Write a pytest fixture and one test that uses it to verify a function converting Celsius to Fahrenheit.",
|
||||
"required_markers": [
|
||||
"@pytest.fixture",
|
||||
"def test_",
|
||||
"assert",
|
||||
"fahrenheit"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_async_fetch",
|
||||
"title": "Async Fetch",
|
||||
"category": "async",
|
||||
"prompt": "Return only Python code. Write an async function that fetches two URLs concurrently with asyncio.gather and returns both bodies.",
|
||||
"required_markers": [
|
||||
"async def",
|
||||
"asyncio.gather",
|
||||
"await",
|
||||
"aiohttp"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_http_retry",
|
||||
"title": "HTTP Retry",
|
||||
"category": "http",
|
||||
"prompt": "Return only Python code. Write a requests wrapper that retries HTTP 429 with exponential backoff and a maximum attempt count.",
|
||||
"required_markers": [
|
||||
"requests",
|
||||
"429",
|
||||
"backoff",
|
||||
"max_attempts"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_json_validate",
|
||||
"title": "JSON Validation",
|
||||
"category": "validation",
|
||||
"prompt": "Return only Python code. Validate a JSON object against a schema and raise ValueError when required keys are missing.",
|
||||
"required_markers": [
|
||||
"jsonschema",
|
||||
"ValueError",
|
||||
"required",
|
||||
"schema"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_sqlite_store",
|
||||
"title": "SQLite Store",
|
||||
"category": "sqlite",
|
||||
"prompt": "Return only Python code. Create a SQLite table for events and write a function that inserts one event row safely.",
|
||||
"required_markers": [
|
||||
"sqlite3",
|
||||
"CREATE TABLE",
|
||||
"INSERT INTO",
|
||||
"commit"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_fastapi_handler",
|
||||
"title": "FastAPI Handler",
|
||||
"category": "web",
|
||||
"prompt": "Return only Python code. Write a FastAPI route that returns a JSON health response with status and version fields.",
|
||||
"required_markers": [
|
||||
"FastAPI",
|
||||
"@app.get",
|
||||
"status",
|
||||
"version"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_config_dataclass",
|
||||
"title": "Config Dataclass",
|
||||
"category": "config",
|
||||
"prompt": "Return only Python code. Build a dataclass-based config loader that reads environment variables and supplies defaults.",
|
||||
"required_markers": [
|
||||
"dataclass",
|
||||
"os.environ",
|
||||
"default",
|
||||
"load"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_logging_setup",
|
||||
"title": "Logging Setup",
|
||||
"category": "logging",
|
||||
"prompt": "Return only Python code. Configure structured logging with a timestamped formatter and a reusable setup function.",
|
||||
"required_markers": [
|
||||
"logging",
|
||||
"Formatter",
|
||||
"timestamp",
|
||||
"basicConfig"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_thread_pool",
|
||||
"title": "Thread Pool",
|
||||
"category": "concurrency",
|
||||
"prompt": "Return only Python code. Use concurrent.futures to run a small CPU-bound function across a list of inputs and collect results.",
|
||||
"required_markers": [
|
||||
"concurrent.futures",
|
||||
"ThreadPoolExecutor",
|
||||
"map",
|
||||
"results"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_package_layout",
|
||||
"title": "Package Layout",
|
||||
"category": "package",
|
||||
"prompt": "Return only Python code. Show a minimal package layout with __init__.py and a helper module that can be imported from tests.",
|
||||
"required_markers": [
|
||||
"__init__.py",
|
||||
"import",
|
||||
"helper",
|
||||
"tests"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_debug_stacktrace",
|
||||
"title": "Debug Stacktrace",
|
||||
"category": "debugging",
|
||||
"prompt": "Return only Python code. Fix a function that crashes on None input by adding an early return and a clear exception message.",
|
||||
"required_markers": [
|
||||
"None",
|
||||
"return",
|
||||
"raise",
|
||||
"message"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_refactor_split",
|
||||
"title": "Refactor Split",
|
||||
"category": "refactor",
|
||||
"prompt": "Return only Python code. Refactor a large function into two smaller helpers while preserving behavior.",
|
||||
"required_markers": [
|
||||
"def",
|
||||
"helper",
|
||||
"return",
|
||||
"preserve"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_csv_summary",
|
||||
"title": "CSV Summary",
|
||||
"category": "analysis",
|
||||
"prompt": "Return only Python code. Read a CSV file and produce a summary with row count and a count of unique values in one column.",
|
||||
"required_markers": [
|
||||
"csv",
|
||||
"row_count",
|
||||
"unique",
|
||||
"Counter"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_pathlib_clean",
|
||||
"title": "Pathlib Cleaner",
|
||||
"category": "filesystem",
|
||||
"prompt": "Return only Python code. Use pathlib to remove empty files from a directory tree and print each deleted path.",
|
||||
"required_markers": [
|
||||
"pathlib",
|
||||
"rglob",
|
||||
"unlink",
|
||||
"print"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_pydantic_model",
|
||||
"title": "Pydantic Model",
|
||||
"category": "validation",
|
||||
"prompt": "Return only Python code. Define a Pydantic model for a user profile with email validation and an age field.",
|
||||
"required_markers": [
|
||||
"BaseModel",
|
||||
"EmailStr",
|
||||
"age",
|
||||
"validation"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_regex_log_parser",
|
||||
"title": "Regex Log Parser",
|
||||
"category": "parsing",
|
||||
"prompt": "Return only Python code. Parse web server log lines with regex and return a list of status codes and request paths.",
|
||||
"required_markers": [
|
||||
"re",
|
||||
"status",
|
||||
"path",
|
||||
"findall"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
}
|
||||
]
|
||||
}
|
||||
Reference in New Issue
Block a user