feat: harness + agent runbook — flip repo from archive-only to
crowdsourced runner
Sloba's chat directive 2026-05-06: "this project is preparation for
going public ... ship the harness along so others can join in."
The repo's original purpose (Ben's catalogue + 21 reference run
ledgers, shipped 2026-05-05) stays intact. This commit ADDS a second
purpose: a portable harness + agent runbook so a friend's coding agent
can clone, read CLAUDE.md, run the same suite on the friend's hardware,
and submit results back as a PR.
What landed:
CLAUDE.md + AGENTS.md (byte-identical, ~520 lines)
Full agent runbook: hardware probe, runtime + model selection,
canonical knob reference (Sloba's Pavilion methodology values),
hardware-adaptation decision rules, run-instructions, output-schema
templates for hardware.json + metadata.json + run.md, PR submission
flow (fork → branch → push → PR; nothing auto-merges), privacy
guardrails, methodology lineage. Per Sloba's Q3 directive: the
runbook explicitly tells the friend's agent to ADAPT to hardware
reality and document deviations rather than blindly run defaults.
CONTRIBUTING.md (~110 lines)
Human-readable companion for the friend (not the agent). What you
need, how it works, what we ask, what maintainers commit to,
license, code-of-conduct short version.
harness/
├── README.md Technical readme for the harness folder
├── run_benchmark.py ~520 LOC runner. Stdlib-only. Adapted from
│ WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py
│ v3 with the cluster-internal IP defaults
│ (10.8.0.x) replaced by 127.0.0.1:11434, the
│ cluster /v1/cluster/* endpoints removed, the
│ canonical-suite paths under ~/Documents/MyServers
│ replaced by harness/suites/ paths, the git-sha
│ enforcement on WeeyugaWeb dropped, and the
│ output written under submissions/<handle>/<tag>/
│ instead of docs/BENCHMARKS/runs/. Supports all
│ six suite phases via --phases, plus 'all'.
├── prompts.py Verbatim copy of the canonical 3 frozen prompts
│ (P-EASY/P-MEDIUM/P-HARD) from
│ WeeyugaWeb/scripts/benchmarks/prompts.py.
├── requirements.txt Empty by intent (stdlib-only); placeholder for
│ pip-tools / agent auto-install patterns.
├── .gitignore __pycache__/ etc.
└── suites/ Six bundled JSON suites copied verbatim from
Sloba's MyServers/instances/vps-81-17-99-14/telemetry/:
small_model_eval_questions.json, python_task_suite_questions.json,
parallel_qwen_same_model_20q_suite.json,
parallel_qwen_mixed_model_20q_suite.json,
python_context_edge_append_questions.json,
python_context_edge_suite_only.json.
submissions/
README.md Folder convention + naming + reviewability rules
EXAMPLE/mac-m1-8gb/run-00000000-...-000000000000/
Synthetic-but-shape-complete contribution template:
manifest.json, hardware.json, run.jsonl (5 example lines),
metadata.json, run.md (with privacy attestation, methodology
deviations, reproducibility command). Marked as synthetic at
the top so future analysis doesn't accidentally cite it.
LICENSE-MIT
MIT for harness/*.py and future helper code. Existing LICENSE
(CC-BY-4.0) covers data files.
README.md (modified)
Updated to reflect dual purpose. Layout diagram updated.
Maintainer credits: Ben for catalogue/methodology + Bane for harness.
Contributor quick-start added. Status table extended.
Privacy posture:
- All 6 suite JSON files privacy-scanned for cluster IPs / hostnames /
paths / tokens. Two prompts contain project names ("MyBoard" auth
debugging in 20Q-Q14, generic SSH troubleshooting in 5Q-Q03);
flagged in chat for Sloba's review. Otherwise clean.
- run_benchmark.py default target_url is 127.0.0.1:11434 (no internal
IPs leaked).
- manifest.json captures host_hostname_short via socket.gethostname()
.split('.')[0] — agent should review before PR if hostname is
sensitive.
- CLAUDE.md §8 spells out the privacy-grep before push.
Verification:
- py_compile run_benchmark.py: OK
- --help renders cleanly
- All 6 suite JSON files: valid
- All 4 example JSON files: valid
- Example run.jsonl (5 lines): valid
This commit lands on branch feature/runner-and-agent-instructions.
NOT pushed to main; staying on the feature branch until Sloba reviews
on Gitea and merges. Bus dispatch to Ben + Sam announcing the
architectural pivot lives in the WeeyugaWeb coordination repo.
This commit is contained in:
363
harness/suites/parallel_qwen_mixed_model_20q_suite.json
Normal file
363
harness/suites/parallel_qwen_mixed_model_20q_suite.json
Normal file
@@ -0,0 +1,363 @@
|
||||
{
|
||||
"generated_at": "2026-04-11T19:00:02Z",
|
||||
"suite_name": "parallel-qwen-mixed-model-20q-v1",
|
||||
"version": "1.0",
|
||||
"purpose": "Run the shared 20-question Python benchmark in two-question batches against qwen size pairs. Within each batch the first model answers the odd-numbered question and the second model answers the even-numbered question, while Ollama keeps two models loaded with two parallel request slots and a 32K request context.",
|
||||
"run_mode": "mixed_model_pairs",
|
||||
"question_batch_size": 2,
|
||||
"question_assignment": "odd_questions_to_first_model_even_questions_to_second_model",
|
||||
"ollama_runtime": {
|
||||
"max_loaded_models": 2,
|
||||
"num_parallel": 2,
|
||||
"num_ctx": 32768,
|
||||
"keep_alive": "24h"
|
||||
},
|
||||
"lanes": [
|
||||
{
|
||||
"lane_id": "qwen2_5_coder_0_5b__qwen2_5_coder_3b",
|
||||
"display_name": "Qwen2.5 Coder 0.5B plus Qwen2.5 Coder 3B",
|
||||
"kind": "mixed_model",
|
||||
"question_assignment": "odd_questions_to_first_model_even_questions_to_second_model",
|
||||
"models": [
|
||||
{
|
||||
"role": "odd_questions",
|
||||
"model": "qwen2.5-coder:0.5b",
|
||||
"display_name": "Qwen2.5 Coder 0.5B",
|
||||
"family": "qwen2.5-coder",
|
||||
"size_label": "0.5B",
|
||||
"max_context_tokens": 32768,
|
||||
"requested_num_ctx": 32768,
|
||||
"mode": "mixed_model"
|
||||
},
|
||||
{
|
||||
"role": "even_questions",
|
||||
"model": "qwen2.5-coder:3b",
|
||||
"display_name": "Qwen2.5 Coder 3B",
|
||||
"family": "qwen2.5-coder",
|
||||
"size_label": "3B",
|
||||
"max_context_tokens": 32768,
|
||||
"requested_num_ctx": 32768,
|
||||
"mode": "mixed_model"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"lane_id": "qwen2_5_coder_0_5b__qwen2_5_coder_1_5b",
|
||||
"display_name": "Qwen2.5 Coder 0.5B plus Qwen2.5 Coder 1.5B",
|
||||
"kind": "mixed_model",
|
||||
"question_assignment": "odd_questions_to_first_model_even_questions_to_second_model",
|
||||
"models": [
|
||||
{
|
||||
"role": "odd_questions",
|
||||
"model": "qwen2.5-coder:0.5b",
|
||||
"display_name": "Qwen2.5 Coder 0.5B",
|
||||
"family": "qwen2.5-coder",
|
||||
"size_label": "0.5B",
|
||||
"max_context_tokens": 32768,
|
||||
"requested_num_ctx": 32768,
|
||||
"mode": "mixed_model"
|
||||
},
|
||||
{
|
||||
"role": "even_questions",
|
||||
"model": "qwen2.5-coder:1.5b",
|
||||
"display_name": "Qwen2.5 Coder 1.5B",
|
||||
"family": "qwen2.5-coder",
|
||||
"size_label": "1.5B",
|
||||
"max_context_tokens": 32768,
|
||||
"requested_num_ctx": 32768,
|
||||
"mode": "mixed_model"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"lane_id": "qwen2_5_coder_1_5b__qwen2_5_coder_3b",
|
||||
"display_name": "Qwen2.5 Coder 1.5B plus Qwen2.5 Coder 3B",
|
||||
"kind": "mixed_model",
|
||||
"question_assignment": "odd_questions_to_first_model_even_questions_to_second_model",
|
||||
"models": [
|
||||
{
|
||||
"role": "odd_questions",
|
||||
"model": "qwen2.5-coder:1.5b",
|
||||
"display_name": "Qwen2.5 Coder 1.5B",
|
||||
"family": "qwen2.5-coder",
|
||||
"size_label": "1.5B",
|
||||
"max_context_tokens": 32768,
|
||||
"requested_num_ctx": 32768,
|
||||
"mode": "mixed_model"
|
||||
},
|
||||
{
|
||||
"role": "even_questions",
|
||||
"model": "qwen2.5-coder:3b",
|
||||
"display_name": "Qwen2.5 Coder 3B",
|
||||
"family": "qwen2.5-coder",
|
||||
"size_label": "3B",
|
||||
"max_context_tokens": 32768,
|
||||
"requested_num_ctx": 32768,
|
||||
"mode": "mixed_model"
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"questions": [
|
||||
{
|
||||
"id": "py_csv_parse",
|
||||
"title": "CSV Parser",
|
||||
"category": "parsing",
|
||||
"prompt": "Return only Python code. Write a function that reads CSV text, skips blank lines, and returns a list of dicts keyed by the header row.",
|
||||
"required_markers": [
|
||||
"csv",
|
||||
"Dict",
|
||||
"reader",
|
||||
"header"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_file_scan",
|
||||
"title": "File Scanner",
|
||||
"category": "file_io",
|
||||
"prompt": "Return only Python code. Write a script that walks a directory tree and prints the paths of files larger than 5 MB.",
|
||||
"required_markers": [
|
||||
"os.walk",
|
||||
"5 * 1024 * 1024",
|
||||
"print",
|
||||
"path"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_cli_args",
|
||||
"title": "CLI Arguments",
|
||||
"category": "cli",
|
||||
"prompt": "Return only Python code. Build a small argparse CLI with one required path argument and one optional verbose flag.",
|
||||
"required_markers": [
|
||||
"argparse",
|
||||
"--verbose",
|
||||
"ArgumentParser",
|
||||
"path"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_typing_dataclass",
|
||||
"title": "Typed Dataclass",
|
||||
"category": "typing",
|
||||
"prompt": "Return only Python code. Define a typed dataclass for a job record with id, name, created_at, and is_active fields.",
|
||||
"required_markers": [
|
||||
"@dataclass",
|
||||
"created_at",
|
||||
"is_active",
|
||||
"str"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_pytest_fixture",
|
||||
"title": "Pytest Fixture",
|
||||
"category": "tests",
|
||||
"prompt": "Return only Python code. Write a pytest fixture and one test that uses it to verify a function converting Celsius to Fahrenheit.",
|
||||
"required_markers": [
|
||||
"@pytest.fixture",
|
||||
"def test_",
|
||||
"assert",
|
||||
"fahrenheit"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_async_fetch",
|
||||
"title": "Async Fetch",
|
||||
"category": "async",
|
||||
"prompt": "Return only Python code. Write an async function that fetches two URLs concurrently with asyncio.gather and returns both bodies.",
|
||||
"required_markers": [
|
||||
"async def",
|
||||
"asyncio.gather",
|
||||
"await",
|
||||
"aiohttp"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_http_retry",
|
||||
"title": "HTTP Retry",
|
||||
"category": "http",
|
||||
"prompt": "Return only Python code. Write a requests wrapper that retries HTTP 429 with exponential backoff and a maximum attempt count.",
|
||||
"required_markers": [
|
||||
"requests",
|
||||
"429",
|
||||
"backoff",
|
||||
"max_attempts"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_json_validate",
|
||||
"title": "JSON Validation",
|
||||
"category": "validation",
|
||||
"prompt": "Return only Python code. Validate a JSON object against a schema and raise ValueError when required keys are missing.",
|
||||
"required_markers": [
|
||||
"jsonschema",
|
||||
"ValueError",
|
||||
"required",
|
||||
"schema"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_sqlite_store",
|
||||
"title": "SQLite Store",
|
||||
"category": "sqlite",
|
||||
"prompt": "Return only Python code. Create a SQLite table for events and write a function that inserts one event row safely.",
|
||||
"required_markers": [
|
||||
"sqlite3",
|
||||
"CREATE TABLE",
|
||||
"INSERT INTO",
|
||||
"commit"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_fastapi_handler",
|
||||
"title": "FastAPI Handler",
|
||||
"category": "web",
|
||||
"prompt": "Return only Python code. Write a FastAPI route that returns a JSON health response with status and version fields.",
|
||||
"required_markers": [
|
||||
"FastAPI",
|
||||
"@app.get",
|
||||
"status",
|
||||
"version"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_config_dataclass",
|
||||
"title": "Config Dataclass",
|
||||
"category": "config",
|
||||
"prompt": "Return only Python code. Build a dataclass-based config loader that reads environment variables and supplies defaults.",
|
||||
"required_markers": [
|
||||
"dataclass",
|
||||
"os.environ",
|
||||
"default",
|
||||
"load"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_logging_setup",
|
||||
"title": "Logging Setup",
|
||||
"category": "logging",
|
||||
"prompt": "Return only Python code. Configure structured logging with a timestamped formatter and a reusable setup function.",
|
||||
"required_markers": [
|
||||
"logging",
|
||||
"Formatter",
|
||||
"timestamp",
|
||||
"basicConfig"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_thread_pool",
|
||||
"title": "Thread Pool",
|
||||
"category": "concurrency",
|
||||
"prompt": "Return only Python code. Use concurrent.futures to run a small CPU-bound function across a list of inputs and collect results.",
|
||||
"required_markers": [
|
||||
"concurrent.futures",
|
||||
"ThreadPoolExecutor",
|
||||
"map",
|
||||
"results"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_package_layout",
|
||||
"title": "Package Layout",
|
||||
"category": "package",
|
||||
"prompt": "Return only Python code. Show a minimal package layout with __init__.py and a helper module that can be imported from tests.",
|
||||
"required_markers": [
|
||||
"__init__.py",
|
||||
"import",
|
||||
"helper",
|
||||
"tests"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_debug_stacktrace",
|
||||
"title": "Debug Stacktrace",
|
||||
"category": "debugging",
|
||||
"prompt": "Return only Python code. Fix a function that crashes on None input by adding an early return and a clear exception message.",
|
||||
"required_markers": [
|
||||
"None",
|
||||
"return",
|
||||
"raise",
|
||||
"message"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_refactor_split",
|
||||
"title": "Refactor Split",
|
||||
"category": "refactor",
|
||||
"prompt": "Return only Python code. Refactor a large function into two smaller helpers while preserving behavior.",
|
||||
"required_markers": [
|
||||
"def",
|
||||
"helper",
|
||||
"return",
|
||||
"preserve"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_csv_summary",
|
||||
"title": "CSV Summary",
|
||||
"category": "analysis",
|
||||
"prompt": "Return only Python code. Read a CSV file and produce a summary with row count and a count of unique values in one column.",
|
||||
"required_markers": [
|
||||
"csv",
|
||||
"row_count",
|
||||
"unique",
|
||||
"Counter"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_pathlib_clean",
|
||||
"title": "Pathlib Cleaner",
|
||||
"category": "filesystem",
|
||||
"prompt": "Return only Python code. Use pathlib to remove empty files from a directory tree and print each deleted path.",
|
||||
"required_markers": [
|
||||
"pathlib",
|
||||
"rglob",
|
||||
"unlink",
|
||||
"print"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_pydantic_model",
|
||||
"title": "Pydantic Model",
|
||||
"category": "validation",
|
||||
"prompt": "Return only Python code. Define a Pydantic model for a user profile with email validation and an age field.",
|
||||
"required_markers": [
|
||||
"BaseModel",
|
||||
"EmailStr",
|
||||
"age",
|
||||
"validation"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_regex_log_parser",
|
||||
"title": "Regex Log Parser",
|
||||
"category": "parsing",
|
||||
"prompt": "Return only Python code. Parse web server log lines with regex and return a list of status codes and request paths.",
|
||||
"required_markers": [
|
||||
"re",
|
||||
"status",
|
||||
"path",
|
||||
"findall"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
}
|
||||
]
|
||||
}
|
||||
347
harness/suites/parallel_qwen_same_model_20q_suite.json
Normal file
347
harness/suites/parallel_qwen_same_model_20q_suite.json
Normal file
@@ -0,0 +1,347 @@
|
||||
{
|
||||
"generated_at": "2026-04-11T19:00:02Z",
|
||||
"suite_name": "parallel-qwen-same-model-20q-v1",
|
||||
"version": "1.0",
|
||||
"purpose": "Run the shared 20-question Python benchmark in two-question batches against one model at a time. Questions 1+2 run together, then 3+4, and so on, while Ollama stays on one loaded model with two parallel request slots and a 32K request context.",
|
||||
"run_mode": "same_model_pairs",
|
||||
"question_batch_size": 2,
|
||||
"question_assignment": "same_model_receives_both_questions_in_each_batch",
|
||||
"ollama_runtime": {
|
||||
"max_loaded_models": 1,
|
||||
"num_parallel": 2,
|
||||
"num_ctx": 32768,
|
||||
"keep_alive": "24h"
|
||||
},
|
||||
"lanes": [
|
||||
{
|
||||
"lane_id": "qwen2_5_coder_0_5b_same_model",
|
||||
"display_name": "Qwen2.5 Coder 0.5B same-model 2-up",
|
||||
"kind": "same_model",
|
||||
"models": [
|
||||
{
|
||||
"role": "shared",
|
||||
"model": "qwen2.5-coder:0.5b",
|
||||
"display_name": "Qwen2.5 Coder 0.5B",
|
||||
"family": "qwen2.5-coder",
|
||||
"size_label": "0.5B",
|
||||
"max_context_tokens": 32768,
|
||||
"requested_num_ctx": 32768,
|
||||
"mode": "same_model"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"lane_id": "qwen2_5_coder_1_5b_same_model",
|
||||
"display_name": "Qwen2.5 Coder 1.5B same-model 2-up",
|
||||
"kind": "same_model",
|
||||
"models": [
|
||||
{
|
||||
"role": "shared",
|
||||
"model": "qwen2.5-coder:1.5b",
|
||||
"display_name": "Qwen2.5 Coder 1.5B",
|
||||
"family": "qwen2.5-coder",
|
||||
"size_label": "1.5B",
|
||||
"max_context_tokens": 32768,
|
||||
"requested_num_ctx": 32768,
|
||||
"mode": "same_model"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"lane_id": "qwen2_5_coder_3b_same_model",
|
||||
"display_name": "Qwen2.5 Coder 3B same-model 2-up",
|
||||
"kind": "same_model",
|
||||
"models": [
|
||||
{
|
||||
"role": "shared",
|
||||
"model": "qwen2.5-coder:3b",
|
||||
"display_name": "Qwen2.5 Coder 3B",
|
||||
"family": "qwen2.5-coder",
|
||||
"size_label": "3B",
|
||||
"max_context_tokens": 32768,
|
||||
"requested_num_ctx": 32768,
|
||||
"mode": "same_model"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"lane_id": "qwen3_5_0_8b_same_model",
|
||||
"display_name": "Qwen3.5 0.8B same-model 2-up",
|
||||
"kind": "same_model",
|
||||
"models": [
|
||||
{
|
||||
"role": "shared",
|
||||
"model": "qwen3.5:0.8b",
|
||||
"display_name": "Qwen3.5 0.8B",
|
||||
"family": "qwen3.5",
|
||||
"size_label": "0.8B",
|
||||
"max_context_tokens": 262144,
|
||||
"requested_num_ctx": 32768,
|
||||
"mode": "same_model"
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"questions": [
|
||||
{
|
||||
"id": "py_csv_parse",
|
||||
"title": "CSV Parser",
|
||||
"category": "parsing",
|
||||
"prompt": "Return only Python code. Write a function that reads CSV text, skips blank lines, and returns a list of dicts keyed by the header row.",
|
||||
"required_markers": [
|
||||
"csv",
|
||||
"Dict",
|
||||
"reader",
|
||||
"header"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_file_scan",
|
||||
"title": "File Scanner",
|
||||
"category": "file_io",
|
||||
"prompt": "Return only Python code. Write a script that walks a directory tree and prints the paths of files larger than 5 MB.",
|
||||
"required_markers": [
|
||||
"os.walk",
|
||||
"5 * 1024 * 1024",
|
||||
"print",
|
||||
"path"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_cli_args",
|
||||
"title": "CLI Arguments",
|
||||
"category": "cli",
|
||||
"prompt": "Return only Python code. Build a small argparse CLI with one required path argument and one optional verbose flag.",
|
||||
"required_markers": [
|
||||
"argparse",
|
||||
"--verbose",
|
||||
"ArgumentParser",
|
||||
"path"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_typing_dataclass",
|
||||
"title": "Typed Dataclass",
|
||||
"category": "typing",
|
||||
"prompt": "Return only Python code. Define a typed dataclass for a job record with id, name, created_at, and is_active fields.",
|
||||
"required_markers": [
|
||||
"@dataclass",
|
||||
"created_at",
|
||||
"is_active",
|
||||
"str"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_pytest_fixture",
|
||||
"title": "Pytest Fixture",
|
||||
"category": "tests",
|
||||
"prompt": "Return only Python code. Write a pytest fixture and one test that uses it to verify a function converting Celsius to Fahrenheit.",
|
||||
"required_markers": [
|
||||
"@pytest.fixture",
|
||||
"def test_",
|
||||
"assert",
|
||||
"fahrenheit"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_async_fetch",
|
||||
"title": "Async Fetch",
|
||||
"category": "async",
|
||||
"prompt": "Return only Python code. Write an async function that fetches two URLs concurrently with asyncio.gather and returns both bodies.",
|
||||
"required_markers": [
|
||||
"async def",
|
||||
"asyncio.gather",
|
||||
"await",
|
||||
"aiohttp"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_http_retry",
|
||||
"title": "HTTP Retry",
|
||||
"category": "http",
|
||||
"prompt": "Return only Python code. Write a requests wrapper that retries HTTP 429 with exponential backoff and a maximum attempt count.",
|
||||
"required_markers": [
|
||||
"requests",
|
||||
"429",
|
||||
"backoff",
|
||||
"max_attempts"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_json_validate",
|
||||
"title": "JSON Validation",
|
||||
"category": "validation",
|
||||
"prompt": "Return only Python code. Validate a JSON object against a schema and raise ValueError when required keys are missing.",
|
||||
"required_markers": [
|
||||
"jsonschema",
|
||||
"ValueError",
|
||||
"required",
|
||||
"schema"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_sqlite_store",
|
||||
"title": "SQLite Store",
|
||||
"category": "sqlite",
|
||||
"prompt": "Return only Python code. Create a SQLite table for events and write a function that inserts one event row safely.",
|
||||
"required_markers": [
|
||||
"sqlite3",
|
||||
"CREATE TABLE",
|
||||
"INSERT INTO",
|
||||
"commit"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_fastapi_handler",
|
||||
"title": "FastAPI Handler",
|
||||
"category": "web",
|
||||
"prompt": "Return only Python code. Write a FastAPI route that returns a JSON health response with status and version fields.",
|
||||
"required_markers": [
|
||||
"FastAPI",
|
||||
"@app.get",
|
||||
"status",
|
||||
"version"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_config_dataclass",
|
||||
"title": "Config Dataclass",
|
||||
"category": "config",
|
||||
"prompt": "Return only Python code. Build a dataclass-based config loader that reads environment variables and supplies defaults.",
|
||||
"required_markers": [
|
||||
"dataclass",
|
||||
"os.environ",
|
||||
"default",
|
||||
"load"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_logging_setup",
|
||||
"title": "Logging Setup",
|
||||
"category": "logging",
|
||||
"prompt": "Return only Python code. Configure structured logging with a timestamped formatter and a reusable setup function.",
|
||||
"required_markers": [
|
||||
"logging",
|
||||
"Formatter",
|
||||
"timestamp",
|
||||
"basicConfig"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_thread_pool",
|
||||
"title": "Thread Pool",
|
||||
"category": "concurrency",
|
||||
"prompt": "Return only Python code. Use concurrent.futures to run a small CPU-bound function across a list of inputs and collect results.",
|
||||
"required_markers": [
|
||||
"concurrent.futures",
|
||||
"ThreadPoolExecutor",
|
||||
"map",
|
||||
"results"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_package_layout",
|
||||
"title": "Package Layout",
|
||||
"category": "package",
|
||||
"prompt": "Return only Python code. Show a minimal package layout with __init__.py and a helper module that can be imported from tests.",
|
||||
"required_markers": [
|
||||
"__init__.py",
|
||||
"import",
|
||||
"helper",
|
||||
"tests"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_debug_stacktrace",
|
||||
"title": "Debug Stacktrace",
|
||||
"category": "debugging",
|
||||
"prompt": "Return only Python code. Fix a function that crashes on None input by adding an early return and a clear exception message.",
|
||||
"required_markers": [
|
||||
"None",
|
||||
"return",
|
||||
"raise",
|
||||
"message"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_refactor_split",
|
||||
"title": "Refactor Split",
|
||||
"category": "refactor",
|
||||
"prompt": "Return only Python code. Refactor a large function into two smaller helpers while preserving behavior.",
|
||||
"required_markers": [
|
||||
"def",
|
||||
"helper",
|
||||
"return",
|
||||
"preserve"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_csv_summary",
|
||||
"title": "CSV Summary",
|
||||
"category": "analysis",
|
||||
"prompt": "Return only Python code. Read a CSV file and produce a summary with row count and a count of unique values in one column.",
|
||||
"required_markers": [
|
||||
"csv",
|
||||
"row_count",
|
||||
"unique",
|
||||
"Counter"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_pathlib_clean",
|
||||
"title": "Pathlib Cleaner",
|
||||
"category": "filesystem",
|
||||
"prompt": "Return only Python code. Use pathlib to remove empty files from a directory tree and print each deleted path.",
|
||||
"required_markers": [
|
||||
"pathlib",
|
||||
"rglob",
|
||||
"unlink",
|
||||
"print"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_pydantic_model",
|
||||
"title": "Pydantic Model",
|
||||
"category": "validation",
|
||||
"prompt": "Return only Python code. Define a Pydantic model for a user profile with email validation and an age field.",
|
||||
"required_markers": [
|
||||
"BaseModel",
|
||||
"EmailStr",
|
||||
"age",
|
||||
"validation"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "py_regex_log_parser",
|
||||
"title": "Regex Log Parser",
|
||||
"category": "parsing",
|
||||
"prompt": "Return only Python code. Parse web server log lines with regex and return a list of status codes and request paths.",
|
||||
"required_markers": [
|
||||
"re",
|
||||
"status",
|
||||
"path",
|
||||
"findall"
|
||||
],
|
||||
"format_rule": "python_code"
|
||||
}
|
||||
]
|
||||
}
|
||||
505
harness/suites/python_context_edge_append_questions.json
Normal file
505
harness/suites/python_context_edge_append_questions.json
Normal file
@@ -0,0 +1,505 @@
|
||||
{
|
||||
"suite_name": "python-context-edge-append-v1",
|
||||
"version": "1.0",
|
||||
"append_mode": "questions_only",
|
||||
"purpose": "Append-only long-context stress questions for the overnight Python suite. The runner expands context bands and renders model-specific packets near the configured benchmark context caps.",
|
||||
"questions": [
|
||||
{
|
||||
"id": "context_edge_release_wave_planner",
|
||||
"title": "Context Edge Release Wave Planner",
|
||||
"category": "orchestration",
|
||||
"format_rule": "json_dict",
|
||||
"num_predict": 650,
|
||||
"required_markers": [
|
||||
"auth/session.py",
|
||||
"contracts/user-profile.json",
|
||||
"FLAG_REQUIRE_NEW_TOKEN_CACHE",
|
||||
"db_migrate --lock-timeout 120",
|
||||
"billing-webhook",
|
||||
"search-reindex",
|
||||
"09:30"
|
||||
],
|
||||
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your orchestration answer. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
|
||||
"followup_required_markers": [
|
||||
"- Early anchor:",
|
||||
"- Middle anchor:",
|
||||
"- Late anchor:",
|
||||
"WG-02",
|
||||
"CHK-27",
|
||||
"BUS-03"
|
||||
],
|
||||
"followup_format_rule": "three_bullets",
|
||||
"prompt": "Return only JSON. You are the release orchestrator for a multi-service Python deployment train. Read the full packet carefully because the decisive blockers are spread across the early, middle, and late parts of the context.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}\\n\\nTask: produce a deployment packet with output keys exactly: objective, blocking_dependencies, execution_waves, owner_handoffs, validation_gates, rollback_triggers. Constraints: mention auth/session.py, contracts/user-profile.json, FLAG_REQUIRE_NEW_TOKEN_CACHE, db_migrate --lock-timeout 120, billing-webhook, search-reindex, and 09:30 customer demo.",
|
||||
"context_stress": {
|
||||
"bands": [
|
||||
0.5,
|
||||
0.75,
|
||||
0.9
|
||||
],
|
||||
"reserved_output_tokens": 1100,
|
||||
"minimum_context_tokens": 2048,
|
||||
"record_prefix": "Release packet item",
|
||||
"context_intro": "Release train packet for the April wave. Every line came from a planning note, test summary, operator handoff, or business constraint. Treat the packet as authoritative and do not invent hidden systems.",
|
||||
"anchors": {
|
||||
"early": [
|
||||
"WG-02: identity-api and admin-web are both changing auth/session.py and contracts/user-profile.json. If those branches merge out of order, cookie version mismatches break post-login redirects for tenant-scoped routes.",
|
||||
"OPS-14: db_migrate --lock-timeout 120 must run while FLAG_REQUIRE_NEW_TOKEN_CACHE is disabled. The flag flips cache key shape and makes rollback harder once migration starts."
|
||||
],
|
||||
"middle": [
|
||||
"CHK-27: search-reindex can lag by 35 minutes after the API deploy. That lag is acceptable for customer search results but should not block the release acceptance gate.",
|
||||
"SEC-09: deploy key rotation already happened. Do not roll back to images older than 2026.04.07-3 because those images still reference the retired package registry key."
|
||||
],
|
||||
"late": [
|
||||
"BUS-03: the billing-webhook queue must keep draining during the 09:30 customer demo. A pause longer than 90 seconds will surface stale invoice state in the live walkthrough.",
|
||||
"QA-41: mobile login smoke is only meaningful after edge-proxy and identity-api are both serving the same cookie version. Running it earlier produces false failures."
|
||||
]
|
||||
},
|
||||
"records": [
|
||||
"Service identity-api is green on unit tests but still has one open canary note about tenant header normalization. The branch owner says the change only touches cookie parsing and the response contract for auth bootstrap.",
|
||||
"Service edge-proxy passes lint and integration tests. The remaining note says cookie-version forwarding was renamed from cookie_build to cookie_version to match the new auth contract.",
|
||||
"Service admin-web updated the post-login redirect helper and now reads project context after auth bootstrap. QA notes that login, logout, and tenant-missing flows all need one shared smoke pass after deploy.",
|
||||
"Worker queue-scheduler has no code changes in this train but its cron definitions were regenerated yesterday. Operators want to avoid overlapping scheduler restarts with the migration step.",
|
||||
"Billing service is not changing code in this train. The operational risk is backlog accumulation in the billing-webhook consumer if the identity rollout accidentally stalls shared Redis access.",
|
||||
"Search service is receiving a schema-compatible event rename. The reindex job can backfill eventually, and product already accepted a temporary lag in search freshness during the train.",
|
||||
"QA note: the fastest critical path is identity-api, then edge-proxy, then admin-web, then mobile smoke, then billing observation, then search verification. They do not want optional checks in front of auth safety checks.",
|
||||
"Rollback note: if cookie validation fails after the proxy deploy, revert edge-proxy first and hold admin-web. Reverting admin-web alone leaves the browser storing the wrong redirect metadata.",
|
||||
"Observability note: dashboard ORCH-REL-12 tracks tenant-scoped login success, billing-webhook lag, and search event age in one board. Release managers prefer those metrics over raw pod restart counts.",
|
||||
"Dependency note: the deploy tool can stage identity-api and edge-proxy in separate waves, but shared contract changes mean contracts/user-profile.json must land before admin-web is exposed to users.",
|
||||
"Comms note: support has a saved macro for minor search delay, but no macro for failed billing state during a customer demo. Business risk is therefore asymmetric toward queue health.",
|
||||
"Infra note: the release train uses one database migration transaction and one feature-flag flip. Operators only want one irreversible step, and they want it late enough that rollback still exists before then."
|
||||
]
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": "context_edge_worker_dispatch_matrix",
|
||||
"title": "Context Edge Worker Dispatch Matrix",
|
||||
"category": "worker_coordination",
|
||||
"format_rule": "json_dict",
|
||||
"num_predict": 650,
|
||||
"required_markers": [
|
||||
"resolve_context.go",
|
||||
"20260408_add_job_owner.sql",
|
||||
"toolset_registry.py",
|
||||
"status.json",
|
||||
"rebase",
|
||||
"ops-2"
|
||||
],
|
||||
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your worker-coordination answer. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
|
||||
"followup_required_markers": [
|
||||
"- Early anchor:",
|
||||
"- Middle anchor:",
|
||||
"- Late anchor:",
|
||||
"orch-3",
|
||||
"worker-8",
|
||||
"ops-2"
|
||||
],
|
||||
"followup_format_rule": "three_bullets",
|
||||
"prompt": "Return only JSON. You are coordinating a mixed worker wave across Earth, TruthGraph, and MyServers. The packet is intentionally long because the real risk is file overlap and sequencing, not raw task count.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}\\n\\nTask: produce a dispatch packet with output keys exactly: stalled_workstreams, safe_parallel_groups, files_with_conflict_risk, required_rebases, first_messages_to_send, done_definition. Constraints: mention resolve_context.go, 20260408_add_job_owner.sql, toolset_registry.py, status.json, rebase order, and the ops-2 bottleneck.",
|
||||
"context_stress": {
|
||||
"bands": [
|
||||
0.5,
|
||||
0.75,
|
||||
0.9
|
||||
],
|
||||
"reserved_output_tokens": 1000,
|
||||
"minimum_context_tokens": 2048,
|
||||
"record_prefix": "Dispatch packet item",
|
||||
"context_intro": "This packet merges worker handoffs, dirty-file reports, and operator availability notes. Every worker owns a different slice, but shared files and sequencing make or break the wave.",
|
||||
"anchors": {
|
||||
"early": [
|
||||
"orch-3: branch tg-query-cleanup and orch-7: branch tg-doc-ingest both touch TruthGraph/internal/query/resolve_context.go. They cannot land independently without a reconciliation pass.",
|
||||
"worker-5: scheduler ownership cleanup depends on migration 20260408_add_job_owner.sql. Any code merge before the migration lands will leave mixed owner semantics in runtime views."
|
||||
],
|
||||
"middle": [
|
||||
"worker-2: tests are green, but docs/contracts/status.json still reflects the old rollout states. Snapshot tests downstream will churn if the contract file is not refreshed before merge.",
|
||||
"worker-8: already cherry-picked part of worker-1. Rebase order matters now because tool names were renamed in one branch and only documented in the other."
|
||||
],
|
||||
"late": [
|
||||
"ops-2: the only human with production shell access before 08:00. Anything needing live verification or cron edits must line up behind that window.",
|
||||
"worker-4: can unblock three others by landing toolset_registry.py first. Until that file stabilizes, downstream command manifests will keep conflicting."
|
||||
]
|
||||
},
|
||||
"records": [
|
||||
"worker-1 is updating telemetry/build_python_overnight_mini_report.py and the JSON summary contract consumed by the manual page. Their branch also renames one latency field used by dashboards.",
|
||||
"worker-2 is on TruthGraph/docs plus a small code touch in cmd/truthgraph/status.go. The branch is mostly docs but accidentally edits one shared enum name in the CLI output helper.",
|
||||
"worker-3 is improving the MyServers cron installer for one-time jobs. Their changes are isolated except for touching a shared helper that prints UTC timestamps for wrapper scripts.",
|
||||
"worker-4 is consolidating tool declarations in toolset_registry.py. Multiple downstream branches imported old names directly instead of using the registry.",
|
||||
"worker-5 is adding explicit owner fields to scheduler jobs and matching database rows. The migration is written but has not been reviewed against existing null rows.",
|
||||
"worker-6 is editing operator docs and runbooks. They do not block code merges directly, but they own the wording that gets copied into incident channels during rollout.",
|
||||
"worker-7 is adjusting model-routing defaults for Hermes and Discord. Their branch changes both config defaults and one reconnect warning string in gateway/run.py.",
|
||||
"worker-8 is on lightweight dashboard polish but already cherry-picked worker-1's field rename to unblock local screenshots. Their branch now contains an older copy of the report schema.",
|
||||
"orch-1 wants the final wave to preserve linear, reviewable commits. They explicitly do not want one mega-merge that hides ordering mistakes.",
|
||||
"orch-2 notes that MyServers and Earth can merge independently unless the status contract is changed. If status.json shifts shape, the report builder and dashboards need to move together.",
|
||||
"test note: the riskiest shared files are resolve_context.go, toolset_registry.py, status.json, and the migration plus scheduler read path. Everything else is secondary.",
|
||||
"communications note: developers are online all morning, but only ops-2 can approve production crontab edits before the normal business day starts."
|
||||
]
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": "context_edge_scheduler_incident_forensics",
|
||||
"title": "Context Edge Scheduler Incident Forensics",
|
||||
"category": "debugging",
|
||||
"format_rule": "json_dict",
|
||||
"num_predict": 700,
|
||||
"required_markers": [
|
||||
"ACK_AFTER_WRITE",
|
||||
"deadline_seconds=45",
|
||||
"clock_skew_ms",
|
||||
"retry_id",
|
||||
"ack",
|
||||
"duplicate",
|
||||
"2026.04.08-rc3"
|
||||
],
|
||||
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your incident answer. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
|
||||
"followup_required_markers": [
|
||||
"- Early anchor:",
|
||||
"- Middle anchor:",
|
||||
"- Late anchor:",
|
||||
"INC-104",
|
||||
"TRACE-22",
|
||||
"DB-19"
|
||||
],
|
||||
"followup_format_rule": "three_bullets",
|
||||
"prompt": "Return only JSON. You are reading a long incident packet for a Python scheduler service that produced duplicate downstream outputs. Several clues are noisy; only some of them matter.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}\\n\\nTask: produce an incident packet with output keys exactly: primary_failure, evidence_chain, misleading_signals, immediate_mitigation, durable_fix, verification_sequence. Constraints: mention ACK_AFTER_WRITE, deadline_seconds=45, clock_skew_ms, retry_id, duplicate ack, and deploy 2026.04.08-rc3.",
|
||||
"context_stress": {
|
||||
"bands": [
|
||||
0.5,
|
||||
0.75,
|
||||
0.9
|
||||
],
|
||||
"reserved_output_tokens": 1100,
|
||||
"minimum_context_tokens": 2048,
|
||||
"record_prefix": "Incident packet item",
|
||||
"context_intro": "This packet combines deploy notes, logs, traces, metrics, and operator comments from a duplicate-output incident in a Python scheduler pipeline. Not every warning is causal.",
|
||||
"anchors": {
|
||||
"early": [
|
||||
"INC-104: the first duplicate outputs started only after deploy 2026.04.08-rc3 changed ACK_AFTER_WRITE from false to true in the scheduler worker configuration.",
|
||||
"LOG-13: clock_skew_ms spiked to 1820 on one host, but duplicates had already begun before NTP finished correcting the clock warning."
|
||||
],
|
||||
"middle": [
|
||||
"TRACE-22: retry_id increments before the original worker calls ack(), so two workers can believe they own the same logical job when the deadline expires.",
|
||||
"CFG-07: deadline_seconds=45 replaced the previous value of 90 in the same deploy, shrinking the time between write completion and retry pickup."
|
||||
],
|
||||
"late": [
|
||||
"DB-19: the database write commits successfully, then a second worker acks the same logical job after the retry path already re-issued it.",
|
||||
"OPS-31: restarting Redis reduced queue noise and warning volume, but duplicate downstream outputs continued afterward."
|
||||
]
|
||||
},
|
||||
"records": [
|
||||
"The scheduler processes one logical job per payload_id and writes a completion row before acking the queue lease. Prior to rc3, ack happened first and the write path was shorter.",
|
||||
"Metrics packet: queue lag rose mildly during the incident, but CPU and memory stayed within normal range. The most visible symptom to customers was duplicate email and webhook fan-out.",
|
||||
"Operator note: one host emitted noisy clock warnings, which pulled attention toward NTP first. A later cross-host trace showed duplicate ownership on hosts without clock issues.",
|
||||
"Deploy note: rc3 also changed retry logging verbosity and added one trace span around downstream fan-out. That made the incident look larger in logs but was not itself causal.",
|
||||
"Trace note: original worker wrote success, paused in a post-write hook, then attempted ack. Retry worker acquired the lease after deadline expiration and re-issued fan-out with a new retry_id.",
|
||||
"Safety note: the downstream consumer is idempotent for storage writes but not for customer notifications, which is why duplicates surfaced in email and webhook channels first.",
|
||||
"Redis note: one operator restart reduced pending command backlog and made queue metrics calmer. No code paths changed and the duplicate symptom persisted.",
|
||||
"Config note: deadline_seconds and ACK_AFTER_WRITE were rolled out together. There is no experiment isolating one from the other in production.",
|
||||
"Postmortem draft: the service lacks a single ownership fence between write completion and lease acknowledgment. Retry semantics assume that ack or durable ownership happens first.",
|
||||
"Verification note: operators want a fix that can be tested under a fake clock and a delayed post-write hook so the race becomes deterministic in CI.",
|
||||
"Rollback note: reverting only the retry logging changes would be meaningless. The risky part of rc3 is the ordering change plus the tighter deadline.",
|
||||
"Customer note: the biggest harm was duplicate human-facing notifications, not raw queue delay. Mitigation must stop duplicate fan-out quickly even if throughput drops."
|
||||
]
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": "context_edge_ingest_requirements_contract",
|
||||
"title": "Context Edge Ingest Requirements Contract",
|
||||
"category": "structured_extraction",
|
||||
"format_rule": "json_dict",
|
||||
"num_predict": 600,
|
||||
"required_markers": [
|
||||
"ingestion_mode",
|
||||
"retry_budget",
|
||||
"quarantine_rule",
|
||||
"required_artifacts",
|
||||
"owner_escalation",
|
||||
"privacy_constraint",
|
||||
"rollout_gate",
|
||||
"kill_switch"
|
||||
],
|
||||
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your contract extraction answer. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
|
||||
"followup_required_markers": [
|
||||
"- Early anchor:",
|
||||
"- Middle anchor:",
|
||||
"- Late anchor:",
|
||||
"REQ-03",
|
||||
"POL-11",
|
||||
"OPS-28"
|
||||
],
|
||||
"followup_format_rule": "three_bullets",
|
||||
"prompt": "Return only JSON. You are extracting a deployment contract for a Python ingestion pipeline from a long mixed packet of requirements, policy notes, rollout notes, and operator reminders.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}\\n\\nTask: return a JSON object with exactly these keys and no extras: ingestion_mode, retry_budget, quarantine_rule, required_artifacts, owner_escalation, privacy_constraint, rollout_gate, kill_switch. Constraints: stay literal, prefer exact values over paraphrase, and do not invent unstated defaults.",
|
||||
"context_stress": {
|
||||
"bands": [
|
||||
0.5,
|
||||
0.75,
|
||||
0.9
|
||||
],
|
||||
"reserved_output_tokens": 900,
|
||||
"minimum_context_tokens": 2048,
|
||||
"record_prefix": "Contract packet item",
|
||||
"context_intro": "Mixed contract packet for a Python ingestion system. Some details are binding requirements, some are background. The task is to extract only the binding contract values.",
|
||||
"anchors": {
|
||||
"early": [
|
||||
"REQ-03: retry budget is exactly 2 automatic retries after the first failed attempt. A third retry is forbidden because it can duplicate external side effects.",
|
||||
"REQ-05: ingestion_mode is shadow_then_stream. The pipeline must begin in shadow mode, prove parity, and only then flip to streamed writes."
|
||||
],
|
||||
"middle": [
|
||||
"POL-11: any payload containing raw customer email addresses goes to quarantine bucket pii-review and must not be summarized into human-readable incident reports.",
|
||||
"ART-07: required_artifacts are manifest.json, validation_report.json, and trace.txt for every promoted ingest run."
|
||||
],
|
||||
"late": [
|
||||
"OPS-28: the kill switch is environment variable INGEST_STOP_AFTER_DOWNLOAD=1 and it must stop promotion before parsing or persistence.",
|
||||
"OWN-04: owner escalation goes to platform-oncall first, then data-infra lead only if the incident lasts more than 30 minutes."
|
||||
]
|
||||
},
|
||||
"records": [
|
||||
"Design note: the system ingests partner dumps, validates rows, stages transformed objects, and only later publishes promoted records. Teams want one contract that product and ops can both read.",
|
||||
"Rollout note: shadow mode exists because partner dumps are often messy. The team wants hard evidence that counts and hashes line up before streamed writes go live.",
|
||||
"Validation note: row-level errors may be sampled for debugging, but privacy guidance forbids copying raw customer email into broad operator summaries or public incident channels.",
|
||||
"Ops note: when promotion is blocked, the run still needs complete artifacts so responders can debug without rerunning the partner dump immediately.",
|
||||
"Noise note: one historical document recommended three retries for network flaps, but that advice pre-dates the current side-effect model and is no longer authoritative.",
|
||||
"Escalation note: data-infra lead helps on sustained incidents, but the first operational owner is always the platform-oncall rotation because they control the promotion switch.",
|
||||
"Runbook note: the kill switch exists to stop damage after download if a bad dump arrives. It should preserve downloaded evidence while preventing parse and write phases.",
|
||||
"Compliance note: the quarantine bucket name is fixed because downstream cleanup tooling keys off pii-review and nothing else.",
|
||||
"Artifact note: analysts depend on manifest.json, validation_report.json, and trace.txt when they compare shadow and stream runs. Missing any one of them blocks promotion approval.",
|
||||
"Product note: streamed writes are the end state, but leadership explicitly wants a visible shadow phase first, not an immediate cutover."
|
||||
]
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": "context_edge_ollama_runbook_migration_brief",
|
||||
"title": "Context Edge Ollama Runbook Migration Brief",
|
||||
"category": "documentation",
|
||||
"format_rule": "numbered_plan_4",
|
||||
"num_predict": 480,
|
||||
"required_markers": [
|
||||
"OLLAMA_HOST=0.0.0.0:11434",
|
||||
"OLLAMA_MAX_LOADED_MODELS=3",
|
||||
"curl http://SERVER_IP:11434/api/tags",
|
||||
"hermes gateway restart"
|
||||
],
|
||||
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your migration brief. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
|
||||
"followup_required_markers": [
|
||||
"- Early anchor:",
|
||||
"- Middle anchor:",
|
||||
"- Late anchor:",
|
||||
"CFG-21",
|
||||
"NET-08",
|
||||
"BOT-14"
|
||||
],
|
||||
"followup_format_rule": "three_bullets",
|
||||
"prompt": "Return exactly four numbered lines and nothing else. Each line must be one migration step for an operator moving from local-only Ollama to a remote Ollama plus Discord/Hermes setup.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}\\n\\nRequirements: Step 1 must be a precheck, step 2 must be the server cutover, step 3 must be verification from the developer machine, and step 4 must be the bot/client reconnection step. Mention OLLAMA_HOST=0.0.0.0:11434, OLLAMA_MAX_LOADED_MODELS=3, curl http://SERVER_IP:11434/api/tags, and hermes gateway restart.",
|
||||
"context_stress": {
|
||||
"bands": [
|
||||
0.9
|
||||
],
|
||||
"reserved_output_tokens": 700,
|
||||
"minimum_context_tokens": 2048,
|
||||
"record_prefix": "Runbook packet item",
|
||||
"context_intro": "Migration packet for exposing a previously local-only Ollama server to remote clients while keeping the setup supportable for Aider and Hermes.",
|
||||
"anchors": {
|
||||
"early": [
|
||||
"CFG-21: systemd override must set OLLAMA_HOST=0.0.0.0:11434, OLLAMA_NUM_PARALLEL=1 or another deliberate value, OLLAMA_MAX_LOADED_MODELS=3, and OLLAMA_KEEP_ALIVE=24h.",
|
||||
"CFG-24: after editing override.conf, operators must run systemctl daemon-reload and restart ollama before testing anything else."
|
||||
],
|
||||
"middle": [
|
||||
"NET-08: open port 11434 only from the developer machine IP when possible. A wide-open firewall rule is simpler but explicitly less safe.",
|
||||
"NET-11: curl http://localhost:11434/api/tags on the server is not enough; the runbook must also include curl http://SERVER_IP:11434/api/tags from the developer machine."
|
||||
],
|
||||
"late": [
|
||||
"BOT-14: Hermes should not be restarted until the remote tags endpoint works. Otherwise Discord symptoms look like bot errors when the real issue is Ollama reachability.",
|
||||
"BOT-19: after the endpoint is healthy, hermes gateway restart is the final reconnect step so Discord and custom endpoint settings are refreshed."
|
||||
]
|
||||
},
|
||||
"records": [
|
||||
"Server baseline: Ollama is installed and running, but historically bound only to localhost. The operator wants to serve remote Aider and Hermes without turning the box into an open relay.",
|
||||
"Model baseline: the desired operating set is a small router, a medium orchestrator, and one heavier coding worker. OLLAMA_MAX_LOADED_MODELS=3 exists to keep the three hottest models around without pretending all can stay resident.",
|
||||
"Firewall note: UFW may be inactive on a fresh VPS, in which case adding the rule alone changes nothing until UFW is enabled or provider-side firewall rules are also correct.",
|
||||
"Developer-machine note: direct curl smoke tests are faster and less ambiguous than jumping straight into Hermes, because they isolate network reachability from agent wrapper behavior.",
|
||||
"Aider note: ~/.aider.conf.yml should point at http://SERVER_IP:11434/v1 with API key set to ollama. That config proves the remote OpenAI-compatible surface is working before complex agents are blamed.",
|
||||
"Hermes note: custom endpoint setup requires the same base URL and a model string. Discord is only useful after the base endpoint already responds from the laptop.",
|
||||
"Rollback note: if remote access fails, revert the systemd override and firewall rule before touching client configs. Otherwise client debugging starts from a broken server assumption.",
|
||||
"Verification note: the strongest smoke test order is server-local tags, laptop-visible tags, one short chat completion, then Aider or Hermes reconnect."
|
||||
]
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": "context_edge_python_context_budget_module",
|
||||
"title": "Context Edge Python Context Budget Module",
|
||||
"category": "coding",
|
||||
"format_rule": "python_module",
|
||||
"num_predict": 900,
|
||||
"required_markers": [
|
||||
"def utc_now",
|
||||
"def estimate_token_count",
|
||||
"def target_prompt_tokens",
|
||||
"def assemble_context_packet",
|
||||
"def prompt_sha256",
|
||||
"hashlib",
|
||||
"typing"
|
||||
],
|
||||
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your code answer. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
|
||||
"followup_required_markers": [
|
||||
"- Early anchor:",
|
||||
"- Middle anchor:",
|
||||
"- Late anchor:",
|
||||
"REQ-CTX-01",
|
||||
"FAIL-CTX-07",
|
||||
"OPS-CTX-12"
|
||||
],
|
||||
"followup_format_rule": "three_bullets",
|
||||
"prompt": "Return only Python code. Write one self-contained module named context_budget.py. The module must expose utc_now(), estimate_token_count(text), target_prompt_tokens(max_context_tokens, band_fraction, reserved_output_tokens), assemble_context_packet(intro, early, middle, late, records, target_tokens, record_prefix='Packet item'), and prompt_sha256(text). Use only the standard library, include type hints, keep the behavior deterministic, and do not emit markdown fences.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}",
|
||||
"context_stress": {
|
||||
"bands": [
|
||||
0.9
|
||||
],
|
||||
"reserved_output_tokens": 1400,
|
||||
"minimum_context_tokens": 2048,
|
||||
"record_prefix": "Design packet item",
|
||||
"context_intro": "Design packet for a reusable context-budget helper module intended for benchmark runners and agent wrappers that need deterministic long-prompt assembly plus debuggable metadata.",
|
||||
"anchors": {
|
||||
"early": [
|
||||
"REQ-CTX-01: the module must expose target_prompt_tokens(max_context_tokens, band_fraction, reserved_output_tokens) so context bands are reproducible instead of hand-tuned.",
|
||||
"REQ-CTX-03: estimate_token_count can be approximate but must be deterministic, cheap, and based only on the input text."
|
||||
],
|
||||
"middle": [
|
||||
"FAIL-CTX-07: a previous Hermes replay consumed a huge prompt, returned finish_reason stop, and produced empty content. Debugging required a prompt hash plus preview and tail slices.",
|
||||
"FAIL-CTX-09: repeated records are acceptable when stretching a packet, but their ordering must be deterministic or telemetry comparisons become meaningless."
|
||||
],
|
||||
"late": [
|
||||
"OPS-CTX-12: helper output and timestamps must stay human-facing and UTC-friendly because operators debug these suites from terminal logs, not notebooks.",
|
||||
"OPS-CTX-14: no third-party tokenizer dependency is allowed on the server path because benchmark scripts must run on a clean VPS without pip installs."
|
||||
]
|
||||
},
|
||||
"records": [
|
||||
"Implementation note: assemble_context_packet should accept intro plus early, middle, late anchor lists and a pool of repeating records. The output should grow until it roughly hits a target token budget.",
|
||||
"Debug note: prompt_sha256 exists because storing every rendered prompt verbatim can waste disk. A hash plus preview and tail slices gives traceability without keeping giant files by default.",
|
||||
"Operator note: utc_now should be a tiny helper returning one stable UTC format so benchmark logs across scripts line up naturally.",
|
||||
"Reliability note: target_prompt_tokens should guard against impossible inputs such as negative reserved output tokens or a band fraction outside the open interval from 0 to 1.",
|
||||
"Performance note: estimate_token_count should be good enough for shaping packets but not so clever that it becomes the slowest part of the run.",
|
||||
"Code style note: type hints matter because downstream scripts may import this helper. A small dataclass is fine, but the interface should remain simple and standard-library only.",
|
||||
"Telemetry note: deterministic packet assembly makes it possible to compare models honestly because the prompt content is the same for every model once the cap and band are fixed.",
|
||||
"Failure note: previous runs showed that long prompts can fail in clean-looking ways, including empty assistant text. The module therefore needs affordances for reproducible reconstruction."
|
||||
]
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": "context_edge_pytest_scheduler_retry_regression",
|
||||
"title": "Context Edge Pytest Scheduler Retry Regression",
|
||||
"category": "tests",
|
||||
"format_rule": "pytest_code",
|
||||
"num_predict": 1000,
|
||||
"required_markers": [
|
||||
"def test_",
|
||||
"monkeypatch",
|
||||
"retry_id",
|
||||
"ACK_AFTER_WRITE",
|
||||
"deadline_seconds",
|
||||
"assert"
|
||||
],
|
||||
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your test answer. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
|
||||
"followup_required_markers": [
|
||||
"- Early anchor:",
|
||||
"- Middle anchor:",
|
||||
"- Late anchor:",
|
||||
"BUG-ACK-01",
|
||||
"TRACE-ACK-09",
|
||||
"VERIFY-ACK-22"
|
||||
],
|
||||
"followup_format_rule": "three_bullets",
|
||||
"prompt": "Return only Python code. Write one focused pytest module for the duplicate-ack scheduler regression described in the packet. Requirements: include one deterministic test with monkeypatch or fakes, model the retry_id race, assert that only one logical job commit wins, and make the failure impossible to miss in CI. Use only standard pytest patterns and do not wrap the answer in markdown fences.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}",
|
||||
"context_stress": {
|
||||
"bands": [
|
||||
0.9
|
||||
],
|
||||
"reserved_output_tokens": 1500,
|
||||
"minimum_context_tokens": 2048,
|
||||
"record_prefix": "Regression packet item",
|
||||
"context_intro": "Regression packet for a Python scheduler service where retry timing and ack ordering can duplicate downstream side effects. The target is one surgical pytest module, not a whole test suite.",
|
||||
"anchors": {
|
||||
"early": [
|
||||
"BUG-ACK-01: duplicate downstream outputs started after ACK_AFTER_WRITE=true shipped in rc3. The regression test must exercise that ordering change directly.",
|
||||
"BUG-ACK-03: deadline_seconds tightened from 90 to 45 in the same release, making the retry pickup easier to trigger."
|
||||
],
|
||||
"middle": [
|
||||
"TRACE-ACK-09: the retry worker increments retry_id before the original worker calls ack(), so the test needs two ownership paths and one delayed ack.",
|
||||
"TRACE-ACK-14: the original write succeeds before the retry starts fan-out, which is why the bug is duplicate side effects rather than missing persistence."
|
||||
],
|
||||
"late": [
|
||||
"VERIFY-ACK-22: the regression test must prove that only one logical job commit and one notify path are treated as authoritative after the fix.",
|
||||
"VERIFY-ACK-24: a fake clock or explicit delay hook is required so the race is deterministic instead of relying on sleeping threads."
|
||||
]
|
||||
},
|
||||
"records": [
|
||||
"Service behavior: one worker writes job completion and a downstream notify record, then acknowledges the queue lease. Retry logic watches deadline expiry and can spawn a second worker for the same logical payload.",
|
||||
"Historical assumption: ack happened before write, so retry pickup rarely overlapped a durable write. After rc3 that assumption no longer holds.",
|
||||
"Testing note: a good regression harness can stub the notifier and collect emitted payload_ids. Duplicate notification is easier to assert than raw queue internals.",
|
||||
"Operator note: Redis restarts and clock warnings were noisy but non-causal. The test should focus on ordering and ownership, not infrastructure flakiness.",
|
||||
"Implementation note: a fake clock or injectable now() hook is preferred over thread sleeps because CI latency is too variable for a race test.",
|
||||
"Acceptance note: if the fix works, either the retry worker or the original worker should stand down cleanly, but never both proceed to external notify.",
|
||||
"CI note: the test should fail loudly with a short diff if duplicate notification happens. Silent counting helpers are harder to trust in review.",
|
||||
"Code review note: a focused module with one strong regression test is worth more than many weak permutations for this specific benchmark."
|
||||
]
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": "context_edge_change_review_packet",
|
||||
"title": "Context Edge Change Review Packet",
|
||||
"category": "review",
|
||||
"format_rule": "json_dict",
|
||||
"num_predict": 700,
|
||||
"required_markers": [
|
||||
"gateway/run.py",
|
||||
"run_python_task_suite.py",
|
||||
"status.json",
|
||||
"20260408_add_job_owner.sql",
|
||||
"discord",
|
||||
"migration"
|
||||
],
|
||||
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your review answer. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
|
||||
"followup_required_markers": [
|
||||
"- Early anchor:",
|
||||
"- Middle anchor:",
|
||||
"- Late anchor:",
|
||||
"REV-03",
|
||||
"REV-17",
|
||||
"REV-29"
|
||||
],
|
||||
"followup_format_rule": "three_bullets",
|
||||
"prompt": "Return only JSON. You are reviewing a large mixed diff packet that spans Python services, telemetry tooling, Discord gateway behavior, documentation, and one database migration.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}\\n\\nTask: produce a review packet with output keys exactly: likely_regressions, riskiest_files, missing_tests, rollout_risk, safe_merge_condition. Constraints: mention gateway/run.py, run_python_task_suite.py, status.json, 20260408_add_job_owner.sql, and Discord or gateway behavior where relevant.",
|
||||
"context_stress": {
|
||||
"bands": [
|
||||
0.9
|
||||
],
|
||||
"reserved_output_tokens": 1100,
|
||||
"minimum_context_tokens": 2048,
|
||||
"record_prefix": "Review packet item",
|
||||
"context_intro": "Mixed diff packet assembled from review summaries, file notes, test output snippets, and rollout comments. The challenge is to surface concrete regressions instead of repeating generic code-review advice.",
|
||||
"anchors": {
|
||||
"early": [
|
||||
"REV-03: migration 20260408_add_job_owner.sql adds a non-null job_owner column without a documented backfill for existing rows. That can fail immediately on populated databases.",
|
||||
"REV-06: scheduler read paths were updated in code, but one admin query still selects the old nullable shape in a report view."
|
||||
],
|
||||
"middle": [
|
||||
"REV-17: gateway/run.py changed reconnect behavior for stale provider responses, but no test proves how Discord handles empty assistant content or one clean timeout followed by a retry.",
|
||||
"REV-21: run_python_task_suite.py now records prompt hashes and context metadata, yet no report-level check verifies the new keys are preserved."
|
||||
],
|
||||
"late": [
|
||||
"REV-29: the report builder renamed one status field, but dashboards and status.json examples in docs still use the old key. That will silently break HTML rendering if merged together.",
|
||||
"REV-34: rollout notes assume the migration and report schema can deploy independently, but the dashboard pull path still reads both in the same morning workflow."
|
||||
]
|
||||
},
|
||||
"records": [
|
||||
"Diff summary: gateway/run.py now waits longer before declaring the provider stale, and the Discord adapter emits one new reconnect warning line. No new fixture captures an empty successful response body.",
|
||||
"Telemetry summary: run_python_task_suite.py expanded to support context-stress prompts, custom follow-up prompts, and prompt metadata files. The status markdown and report builder were only partially updated.",
|
||||
"Migration summary: 20260408_add_job_owner.sql introduces explicit ownership on scheduled jobs so the UI can attribute work cleanly. The migration note mentions new writes, not legacy rows.",
|
||||
"Dashboard summary: one HTML manual still expects the previous status key name from status.json and has no assertion on unknown-field fallback.",
|
||||
"Docs summary: operator docs were refreshed for the new Discord reconnect wording, but one screenshots guide still references the prior warning text verbatim.",
|
||||
"Review note: the riskiest interactions are schema plus runtime reads, and telemetry JSON plus dashboard consumption. Pure doc edits are comparatively safe.",
|
||||
"Testing note: there are unit tests around scheduler ownership writes and a separate smoke test for Discord login, but nothing that exercises both the new reconnect path and empty assistant content.",
|
||||
"Rollout note: support wants the dashboard alive on the same morning the migration lands. That makes silent telemetry-key drift more expensive than a normal internal-only contract change."
|
||||
]
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
561
harness/suites/python_context_edge_suite_only.json
Normal file
561
harness/suites/python_context_edge_suite_only.json
Normal file
@@ -0,0 +1,561 @@
|
||||
{
|
||||
"suite_name": "python-context-edge-append-v1",
|
||||
"version": "1.0",
|
||||
"purpose": "Append-only long-context stress questions for the overnight Python suite. The runner expands context bands and renders model-specific packets near the configured benchmark context caps.",
|
||||
"models": [
|
||||
{
|
||||
"model": "qwen32-coder-32k",
|
||||
"display_name": "Qwen32 Coder 32k",
|
||||
"size_label": "32b"
|
||||
},
|
||||
{
|
||||
"model": "qwen14-coder-32k",
|
||||
"display_name": "Qwen14 Coder 32k",
|
||||
"size_label": "14b"
|
||||
},
|
||||
{
|
||||
"model": "codestral-32k",
|
||||
"display_name": "Codestral 32k",
|
||||
"size_label": "22b"
|
||||
},
|
||||
{
|
||||
"model": "codellama34-16k",
|
||||
"display_name": "CodeLlama 34 16k",
|
||||
"size_label": "34b"
|
||||
},
|
||||
{
|
||||
"model": "phind34-16k",
|
||||
"display_name": "Phind 34 16k",
|
||||
"size_label": "34b"
|
||||
},
|
||||
{
|
||||
"model": "qwen14-general-32k",
|
||||
"display_name": "Qwen14 General 32k",
|
||||
"size_label": "14b"
|
||||
},
|
||||
{
|
||||
"model": "qwen2.5-coder:3b",
|
||||
"display_name": "Qwen2.5 Coder 3B",
|
||||
"size_label": "3b"
|
||||
},
|
||||
{
|
||||
"model": "qwen2.5-coder:1.5b",
|
||||
"display_name": "Qwen2.5 Coder 1.5B",
|
||||
"size_label": "1.5b"
|
||||
},
|
||||
{
|
||||
"model": "qwen2.5:3b",
|
||||
"display_name": "Qwen2.5 3B",
|
||||
"size_label": "3b"
|
||||
},
|
||||
{
|
||||
"model": "llama3.2:3b",
|
||||
"display_name": "Llama 3.2 3B",
|
||||
"size_label": "3b"
|
||||
},
|
||||
{
|
||||
"model": "phi3",
|
||||
"display_name": "Phi-3 Mini",
|
||||
"size_label": "3.8b"
|
||||
}
|
||||
],
|
||||
"questions": [
|
||||
{
|
||||
"id": "context_edge_release_wave_planner",
|
||||
"title": "Context Edge Release Wave Planner",
|
||||
"category": "orchestration",
|
||||
"format_rule": "json_dict",
|
||||
"num_predict": 650,
|
||||
"required_markers": [
|
||||
"auth/session.py",
|
||||
"contracts/user-profile.json",
|
||||
"FLAG_REQUIRE_NEW_TOKEN_CACHE",
|
||||
"db_migrate --lock-timeout 120",
|
||||
"billing-webhook",
|
||||
"search-reindex",
|
||||
"09:30"
|
||||
],
|
||||
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your orchestration answer. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
|
||||
"followup_required_markers": [
|
||||
"- Early anchor:",
|
||||
"- Middle anchor:",
|
||||
"- Late anchor:",
|
||||
"WG-02",
|
||||
"CHK-27",
|
||||
"BUS-03"
|
||||
],
|
||||
"followup_format_rule": "three_bullets",
|
||||
"prompt": "Return only JSON. You are the release orchestrator for a multi-service Python deployment train. Read the full packet carefully because the decisive blockers are spread across the early, middle, and late parts of the context.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}\\n\\nTask: produce a deployment packet with output keys exactly: objective, blocking_dependencies, execution_waves, owner_handoffs, validation_gates, rollback_triggers. Constraints: mention auth/session.py, contracts/user-profile.json, FLAG_REQUIRE_NEW_TOKEN_CACHE, db_migrate --lock-timeout 120, billing-webhook, search-reindex, and 09:30 customer demo.",
|
||||
"context_stress": {
|
||||
"bands": [
|
||||
0.5,
|
||||
0.75,
|
||||
0.9
|
||||
],
|
||||
"reserved_output_tokens": 1100,
|
||||
"minimum_context_tokens": 2048,
|
||||
"record_prefix": "Release packet item",
|
||||
"context_intro": "Release train packet for the April wave. Every line came from a planning note, test summary, operator handoff, or business constraint. Treat the packet as authoritative and do not invent hidden systems.",
|
||||
"anchors": {
|
||||
"early": [
|
||||
"WG-02: identity-api and admin-web are both changing auth/session.py and contracts/user-profile.json. If those branches merge out of order, cookie version mismatches break post-login redirects for tenant-scoped routes.",
|
||||
"OPS-14: db_migrate --lock-timeout 120 must run while FLAG_REQUIRE_NEW_TOKEN_CACHE is disabled. The flag flips cache key shape and makes rollback harder once migration starts."
|
||||
],
|
||||
"middle": [
|
||||
"CHK-27: search-reindex can lag by 35 minutes after the API deploy. That lag is acceptable for customer search results but should not block the release acceptance gate.",
|
||||
"SEC-09: deploy key rotation already happened. Do not roll back to images older than 2026.04.07-3 because those images still reference the retired package registry key."
|
||||
],
|
||||
"late": [
|
||||
"BUS-03: the billing-webhook queue must keep draining during the 09:30 customer demo. A pause longer than 90 seconds will surface stale invoice state in the live walkthrough.",
|
||||
"QA-41: mobile login smoke is only meaningful after edge-proxy and identity-api are both serving the same cookie version. Running it earlier produces false failures."
|
||||
]
|
||||
},
|
||||
"records": [
|
||||
"Service identity-api is green on unit tests but still has one open canary note about tenant header normalization. The branch owner says the change only touches cookie parsing and the response contract for auth bootstrap.",
|
||||
"Service edge-proxy passes lint and integration tests. The remaining note says cookie-version forwarding was renamed from cookie_build to cookie_version to match the new auth contract.",
|
||||
"Service admin-web updated the post-login redirect helper and now reads project context after auth bootstrap. QA notes that login, logout, and tenant-missing flows all need one shared smoke pass after deploy.",
|
||||
"Worker queue-scheduler has no code changes in this train but its cron definitions were regenerated yesterday. Operators want to avoid overlapping scheduler restarts with the migration step.",
|
||||
"Billing service is not changing code in this train. The operational risk is backlog accumulation in the billing-webhook consumer if the identity rollout accidentally stalls shared Redis access.",
|
||||
"Search service is receiving a schema-compatible event rename. The reindex job can backfill eventually, and product already accepted a temporary lag in search freshness during the train.",
|
||||
"QA note: the fastest critical path is identity-api, then edge-proxy, then admin-web, then mobile smoke, then billing observation, then search verification. They do not want optional checks in front of auth safety checks.",
|
||||
"Rollback note: if cookie validation fails after the proxy deploy, revert edge-proxy first and hold admin-web. Reverting admin-web alone leaves the browser storing the wrong redirect metadata.",
|
||||
"Observability note: dashboard ORCH-REL-12 tracks tenant-scoped login success, billing-webhook lag, and search event age in one board. Release managers prefer those metrics over raw pod restart counts.",
|
||||
"Dependency note: the deploy tool can stage identity-api and edge-proxy in separate waves, but shared contract changes mean contracts/user-profile.json must land before admin-web is exposed to users.",
|
||||
"Comms note: support has a saved macro for minor search delay, but no macro for failed billing state during a customer demo. Business risk is therefore asymmetric toward queue health.",
|
||||
"Infra note: the release train uses one database migration transaction and one feature-flag flip. Operators only want one irreversible step, and they want it late enough that rollback still exists before then."
|
||||
]
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": "context_edge_worker_dispatch_matrix",
|
||||
"title": "Context Edge Worker Dispatch Matrix",
|
||||
"category": "worker_coordination",
|
||||
"format_rule": "json_dict",
|
||||
"num_predict": 650,
|
||||
"required_markers": [
|
||||
"resolve_context.go",
|
||||
"20260408_add_job_owner.sql",
|
||||
"toolset_registry.py",
|
||||
"status.json",
|
||||
"rebase",
|
||||
"ops-2"
|
||||
],
|
||||
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your worker-coordination answer. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
|
||||
"followup_required_markers": [
|
||||
"- Early anchor:",
|
||||
"- Middle anchor:",
|
||||
"- Late anchor:",
|
||||
"orch-3",
|
||||
"worker-8",
|
||||
"ops-2"
|
||||
],
|
||||
"followup_format_rule": "three_bullets",
|
||||
"prompt": "Return only JSON. You are coordinating a mixed worker wave across Earth, TruthGraph, and MyServers. The packet is intentionally long because the real risk is file overlap and sequencing, not raw task count.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}\\n\\nTask: produce a dispatch packet with output keys exactly: stalled_workstreams, safe_parallel_groups, files_with_conflict_risk, required_rebases, first_messages_to_send, done_definition. Constraints: mention resolve_context.go, 20260408_add_job_owner.sql, toolset_registry.py, status.json, rebase order, and the ops-2 bottleneck.",
|
||||
"context_stress": {
|
||||
"bands": [
|
||||
0.5,
|
||||
0.75,
|
||||
0.9
|
||||
],
|
||||
"reserved_output_tokens": 1000,
|
||||
"minimum_context_tokens": 2048,
|
||||
"record_prefix": "Dispatch packet item",
|
||||
"context_intro": "This packet merges worker handoffs, dirty-file reports, and operator availability notes. Every worker owns a different slice, but shared files and sequencing make or break the wave.",
|
||||
"anchors": {
|
||||
"early": [
|
||||
"orch-3: branch tg-query-cleanup and orch-7: branch tg-doc-ingest both touch TruthGraph/internal/query/resolve_context.go. They cannot land independently without a reconciliation pass.",
|
||||
"worker-5: scheduler ownership cleanup depends on migration 20260408_add_job_owner.sql. Any code merge before the migration lands will leave mixed owner semantics in runtime views."
|
||||
],
|
||||
"middle": [
|
||||
"worker-2: tests are green, but docs/contracts/status.json still reflects the old rollout states. Snapshot tests downstream will churn if the contract file is not refreshed before merge.",
|
||||
"worker-8: already cherry-picked part of worker-1. Rebase order matters now because tool names were renamed in one branch and only documented in the other."
|
||||
],
|
||||
"late": [
|
||||
"ops-2: the only human with production shell access before 08:00. Anything needing live verification or cron edits must line up behind that window.",
|
||||
"worker-4: can unblock three others by landing toolset_registry.py first. Until that file stabilizes, downstream command manifests will keep conflicting."
|
||||
]
|
||||
},
|
||||
"records": [
|
||||
"worker-1 is updating telemetry/build_python_overnight_mini_report.py and the JSON summary contract consumed by the manual page. Their branch also renames one latency field used by dashboards.",
|
||||
"worker-2 is on TruthGraph/docs plus a small code touch in cmd/truthgraph/status.go. The branch is mostly docs but accidentally edits one shared enum name in the CLI output helper.",
|
||||
"worker-3 is improving the MyServers cron installer for one-time jobs. Their changes are isolated except for touching a shared helper that prints UTC timestamps for wrapper scripts.",
|
||||
"worker-4 is consolidating tool declarations in toolset_registry.py. Multiple downstream branches imported old names directly instead of using the registry.",
|
||||
"worker-5 is adding explicit owner fields to scheduler jobs and matching database rows. The migration is written but has not been reviewed against existing null rows.",
|
||||
"worker-6 is editing operator docs and runbooks. They do not block code merges directly, but they own the wording that gets copied into incident channels during rollout.",
|
||||
"worker-7 is adjusting model-routing defaults for Hermes and Discord. Their branch changes both config defaults and one reconnect warning string in gateway/run.py.",
|
||||
"worker-8 is on lightweight dashboard polish but already cherry-picked worker-1's field rename to unblock local screenshots. Their branch now contains an older copy of the report schema.",
|
||||
"orch-1 wants the final wave to preserve linear, reviewable commits. They explicitly do not want one mega-merge that hides ordering mistakes.",
|
||||
"orch-2 notes that MyServers and Earth can merge independently unless the status contract is changed. If status.json shifts shape, the report builder and dashboards need to move together.",
|
||||
"test note: the riskiest shared files are resolve_context.go, toolset_registry.py, status.json, and the migration plus scheduler read path. Everything else is secondary.",
|
||||
"communications note: developers are online all morning, but only ops-2 can approve production crontab edits before the normal business day starts."
|
||||
]
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": "context_edge_scheduler_incident_forensics",
|
||||
"title": "Context Edge Scheduler Incident Forensics",
|
||||
"category": "debugging",
|
||||
"format_rule": "json_dict",
|
||||
"num_predict": 700,
|
||||
"required_markers": [
|
||||
"ACK_AFTER_WRITE",
|
||||
"deadline_seconds=45",
|
||||
"clock_skew_ms",
|
||||
"retry_id",
|
||||
"ack",
|
||||
"duplicate",
|
||||
"2026.04.08-rc3"
|
||||
],
|
||||
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your incident answer. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
|
||||
"followup_required_markers": [
|
||||
"- Early anchor:",
|
||||
"- Middle anchor:",
|
||||
"- Late anchor:",
|
||||
"INC-104",
|
||||
"TRACE-22",
|
||||
"DB-19"
|
||||
],
|
||||
"followup_format_rule": "three_bullets",
|
||||
"prompt": "Return only JSON. You are reading a long incident packet for a Python scheduler service that produced duplicate downstream outputs. Several clues are noisy; only some of them matter.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}\\n\\nTask: produce an incident packet with output keys exactly: primary_failure, evidence_chain, misleading_signals, immediate_mitigation, durable_fix, verification_sequence. Constraints: mention ACK_AFTER_WRITE, deadline_seconds=45, clock_skew_ms, retry_id, duplicate ack, and deploy 2026.04.08-rc3.",
|
||||
"context_stress": {
|
||||
"bands": [
|
||||
0.5,
|
||||
0.75,
|
||||
0.9
|
||||
],
|
||||
"reserved_output_tokens": 1100,
|
||||
"minimum_context_tokens": 2048,
|
||||
"record_prefix": "Incident packet item",
|
||||
"context_intro": "This packet combines deploy notes, logs, traces, metrics, and operator comments from a duplicate-output incident in a Python scheduler pipeline. Not every warning is causal.",
|
||||
"anchors": {
|
||||
"early": [
|
||||
"INC-104: the first duplicate outputs started only after deploy 2026.04.08-rc3 changed ACK_AFTER_WRITE from false to true in the scheduler worker configuration.",
|
||||
"LOG-13: clock_skew_ms spiked to 1820 on one host, but duplicates had already begun before NTP finished correcting the clock warning."
|
||||
],
|
||||
"middle": [
|
||||
"TRACE-22: retry_id increments before the original worker calls ack(), so two workers can believe they own the same logical job when the deadline expires.",
|
||||
"CFG-07: deadline_seconds=45 replaced the previous value of 90 in the same deploy, shrinking the time between write completion and retry pickup."
|
||||
],
|
||||
"late": [
|
||||
"DB-19: the database write commits successfully, then a second worker acks the same logical job after the retry path already re-issued it.",
|
||||
"OPS-31: restarting Redis reduced queue noise and warning volume, but duplicate downstream outputs continued afterward."
|
||||
]
|
||||
},
|
||||
"records": [
|
||||
"The scheduler processes one logical job per payload_id and writes a completion row before acking the queue lease. Prior to rc3, ack happened first and the write path was shorter.",
|
||||
"Metrics packet: queue lag rose mildly during the incident, but CPU and memory stayed within normal range. The most visible symptom to customers was duplicate email and webhook fan-out.",
|
||||
"Operator note: one host emitted noisy clock warnings, which pulled attention toward NTP first. A later cross-host trace showed duplicate ownership on hosts without clock issues.",
|
||||
"Deploy note: rc3 also changed retry logging verbosity and added one trace span around downstream fan-out. That made the incident look larger in logs but was not itself causal.",
|
||||
"Trace note: original worker wrote success, paused in a post-write hook, then attempted ack. Retry worker acquired the lease after deadline expiration and re-issued fan-out with a new retry_id.",
|
||||
"Safety note: the downstream consumer is idempotent for storage writes but not for customer notifications, which is why duplicates surfaced in email and webhook channels first.",
|
||||
"Redis note: one operator restart reduced pending command backlog and made queue metrics calmer. No code paths changed and the duplicate symptom persisted.",
|
||||
"Config note: deadline_seconds and ACK_AFTER_WRITE were rolled out together. There is no experiment isolating one from the other in production.",
|
||||
"Postmortem draft: the service lacks a single ownership fence between write completion and lease acknowledgment. Retry semantics assume that ack or durable ownership happens first.",
|
||||
"Verification note: operators want a fix that can be tested under a fake clock and a delayed post-write hook so the race becomes deterministic in CI.",
|
||||
"Rollback note: reverting only the retry logging changes would be meaningless. The risky part of rc3 is the ordering change plus the tighter deadline.",
|
||||
"Customer note: the biggest harm was duplicate human-facing notifications, not raw queue delay. Mitigation must stop duplicate fan-out quickly even if throughput drops."
|
||||
]
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": "context_edge_ingest_requirements_contract",
|
||||
"title": "Context Edge Ingest Requirements Contract",
|
||||
"category": "structured_extraction",
|
||||
"format_rule": "json_dict",
|
||||
"num_predict": 600,
|
||||
"required_markers": [
|
||||
"ingestion_mode",
|
||||
"retry_budget",
|
||||
"quarantine_rule",
|
||||
"required_artifacts",
|
||||
"owner_escalation",
|
||||
"privacy_constraint",
|
||||
"rollout_gate",
|
||||
"kill_switch"
|
||||
],
|
||||
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your contract extraction answer. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
|
||||
"followup_required_markers": [
|
||||
"- Early anchor:",
|
||||
"- Middle anchor:",
|
||||
"- Late anchor:",
|
||||
"REQ-03",
|
||||
"POL-11",
|
||||
"OPS-28"
|
||||
],
|
||||
"followup_format_rule": "three_bullets",
|
||||
"prompt": "Return only JSON. You are extracting a deployment contract for a Python ingestion pipeline from a long mixed packet of requirements, policy notes, rollout notes, and operator reminders.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}\\n\\nTask: return a JSON object with exactly these keys and no extras: ingestion_mode, retry_budget, quarantine_rule, required_artifacts, owner_escalation, privacy_constraint, rollout_gate, kill_switch. Constraints: stay literal, prefer exact values over paraphrase, and do not invent unstated defaults.",
|
||||
"context_stress": {
|
||||
"bands": [
|
||||
0.5,
|
||||
0.75,
|
||||
0.9
|
||||
],
|
||||
"reserved_output_tokens": 900,
|
||||
"minimum_context_tokens": 2048,
|
||||
"record_prefix": "Contract packet item",
|
||||
"context_intro": "Mixed contract packet for a Python ingestion system. Some details are binding requirements, some are background. The task is to extract only the binding contract values.",
|
||||
"anchors": {
|
||||
"early": [
|
||||
"REQ-03: retry budget is exactly 2 automatic retries after the first failed attempt. A third retry is forbidden because it can duplicate external side effects.",
|
||||
"REQ-05: ingestion_mode is shadow_then_stream. The pipeline must begin in shadow mode, prove parity, and only then flip to streamed writes."
|
||||
],
|
||||
"middle": [
|
||||
"POL-11: any payload containing raw customer email addresses goes to quarantine bucket pii-review and must not be summarized into human-readable incident reports.",
|
||||
"ART-07: required_artifacts are manifest.json, validation_report.json, and trace.txt for every promoted ingest run."
|
||||
],
|
||||
"late": [
|
||||
"OPS-28: the kill switch is environment variable INGEST_STOP_AFTER_DOWNLOAD=1 and it must stop promotion before parsing or persistence.",
|
||||
"OWN-04: owner escalation goes to platform-oncall first, then data-infra lead only if the incident lasts more than 30 minutes."
|
||||
]
|
||||
},
|
||||
"records": [
|
||||
"Design note: the system ingests partner dumps, validates rows, stages transformed objects, and only later publishes promoted records. Teams want one contract that product and ops can both read.",
|
||||
"Rollout note: shadow mode exists because partner dumps are often messy. The team wants hard evidence that counts and hashes line up before streamed writes go live.",
|
||||
"Validation note: row-level errors may be sampled for debugging, but privacy guidance forbids copying raw customer email into broad operator summaries or public incident channels.",
|
||||
"Ops note: when promotion is blocked, the run still needs complete artifacts so responders can debug without rerunning the partner dump immediately.",
|
||||
"Noise note: one historical document recommended three retries for network flaps, but that advice pre-dates the current side-effect model and is no longer authoritative.",
|
||||
"Escalation note: data-infra lead helps on sustained incidents, but the first operational owner is always the platform-oncall rotation because they control the promotion switch.",
|
||||
"Runbook note: the kill switch exists to stop damage after download if a bad dump arrives. It should preserve downloaded evidence while preventing parse and write phases.",
|
||||
"Compliance note: the quarantine bucket name is fixed because downstream cleanup tooling keys off pii-review and nothing else.",
|
||||
"Artifact note: analysts depend on manifest.json, validation_report.json, and trace.txt when they compare shadow and stream runs. Missing any one of them blocks promotion approval.",
|
||||
"Product note: streamed writes are the end state, but leadership explicitly wants a visible shadow phase first, not an immediate cutover."
|
||||
]
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": "context_edge_ollama_runbook_migration_brief",
|
||||
"title": "Context Edge Ollama Runbook Migration Brief",
|
||||
"category": "documentation",
|
||||
"format_rule": "numbered_plan_4",
|
||||
"num_predict": 480,
|
||||
"required_markers": [
|
||||
"OLLAMA_HOST=0.0.0.0:11434",
|
||||
"OLLAMA_MAX_LOADED_MODELS=3",
|
||||
"curl http://SERVER_IP:11434/api/tags",
|
||||
"hermes gateway restart"
|
||||
],
|
||||
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your migration brief. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
|
||||
"followup_required_markers": [
|
||||
"- Early anchor:",
|
||||
"- Middle anchor:",
|
||||
"- Late anchor:",
|
||||
"CFG-21",
|
||||
"NET-08",
|
||||
"BOT-14"
|
||||
],
|
||||
"followup_format_rule": "three_bullets",
|
||||
"prompt": "Return exactly four numbered lines and nothing else. Each line must be one migration step for an operator moving from local-only Ollama to a remote Ollama plus Discord/Hermes setup.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}\\n\\nRequirements: Step 1 must be a precheck, step 2 must be the server cutover, step 3 must be verification from the developer machine, and step 4 must be the bot/client reconnection step. Mention OLLAMA_HOST=0.0.0.0:11434, OLLAMA_MAX_LOADED_MODELS=3, curl http://SERVER_IP:11434/api/tags, and hermes gateway restart.",
|
||||
"context_stress": {
|
||||
"bands": [
|
||||
0.9
|
||||
],
|
||||
"reserved_output_tokens": 700,
|
||||
"minimum_context_tokens": 2048,
|
||||
"record_prefix": "Runbook packet item",
|
||||
"context_intro": "Migration packet for exposing a previously local-only Ollama server to remote clients while keeping the setup supportable for Aider and Hermes.",
|
||||
"anchors": {
|
||||
"early": [
|
||||
"CFG-21: systemd override must set OLLAMA_HOST=0.0.0.0:11434, OLLAMA_NUM_PARALLEL=1 or another deliberate value, OLLAMA_MAX_LOADED_MODELS=3, and OLLAMA_KEEP_ALIVE=24h.",
|
||||
"CFG-24: after editing override.conf, operators must run systemctl daemon-reload and restart ollama before testing anything else."
|
||||
],
|
||||
"middle": [
|
||||
"NET-08: open port 11434 only from the developer machine IP when possible. A wide-open firewall rule is simpler but explicitly less safe.",
|
||||
"NET-11: curl http://localhost:11434/api/tags on the server is not enough; the runbook must also include curl http://SERVER_IP:11434/api/tags from the developer machine."
|
||||
],
|
||||
"late": [
|
||||
"BOT-14: Hermes should not be restarted until the remote tags endpoint works. Otherwise Discord symptoms look like bot errors when the real issue is Ollama reachability.",
|
||||
"BOT-19: after the endpoint is healthy, hermes gateway restart is the final reconnect step so Discord and custom endpoint settings are refreshed."
|
||||
]
|
||||
},
|
||||
"records": [
|
||||
"Server baseline: Ollama is installed and running, but historically bound only to localhost. The operator wants to serve remote Aider and Hermes without turning the box into an open relay.",
|
||||
"Model baseline: the desired operating set is a small router, a medium orchestrator, and one heavier coding worker. OLLAMA_MAX_LOADED_MODELS=3 exists to keep the three hottest models around without pretending all can stay resident.",
|
||||
"Firewall note: UFW may be inactive on a fresh VPS, in which case adding the rule alone changes nothing until UFW is enabled or provider-side firewall rules are also correct.",
|
||||
"Developer-machine note: direct curl smoke tests are faster and less ambiguous than jumping straight into Hermes, because they isolate network reachability from agent wrapper behavior.",
|
||||
"Aider note: ~/.aider.conf.yml should point at http://SERVER_IP:11434/v1 with API key set to ollama. That config proves the remote OpenAI-compatible surface is working before complex agents are blamed.",
|
||||
"Hermes note: custom endpoint setup requires the same base URL and a model string. Discord is only useful after the base endpoint already responds from the laptop.",
|
||||
"Rollback note: if remote access fails, revert the systemd override and firewall rule before touching client configs. Otherwise client debugging starts from a broken server assumption.",
|
||||
"Verification note: the strongest smoke test order is server-local tags, laptop-visible tags, one short chat completion, then Aider or Hermes reconnect."
|
||||
]
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": "context_edge_python_context_budget_module",
|
||||
"title": "Context Edge Python Context Budget Module",
|
||||
"category": "coding",
|
||||
"format_rule": "python_module",
|
||||
"num_predict": 900,
|
||||
"required_markers": [
|
||||
"def utc_now",
|
||||
"def estimate_token_count",
|
||||
"def target_prompt_tokens",
|
||||
"def assemble_context_packet",
|
||||
"def prompt_sha256",
|
||||
"hashlib",
|
||||
"typing"
|
||||
],
|
||||
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your code answer. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
|
||||
"followup_required_markers": [
|
||||
"- Early anchor:",
|
||||
"- Middle anchor:",
|
||||
"- Late anchor:",
|
||||
"REQ-CTX-01",
|
||||
"FAIL-CTX-07",
|
||||
"OPS-CTX-12"
|
||||
],
|
||||
"followup_format_rule": "three_bullets",
|
||||
"prompt": "Return only Python code. Write one self-contained module named context_budget.py. The module must expose utc_now(), estimate_token_count(text), target_prompt_tokens(max_context_tokens, band_fraction, reserved_output_tokens), assemble_context_packet(intro, early, middle, late, records, target_tokens, record_prefix='Packet item'), and prompt_sha256(text). Use only the standard library, include type hints, keep the behavior deterministic, and do not emit markdown fences.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}",
|
||||
"context_stress": {
|
||||
"bands": [
|
||||
0.9
|
||||
],
|
||||
"reserved_output_tokens": 1400,
|
||||
"minimum_context_tokens": 2048,
|
||||
"record_prefix": "Design packet item",
|
||||
"context_intro": "Design packet for a reusable context-budget helper module intended for benchmark runners and agent wrappers that need deterministic long-prompt assembly plus debuggable metadata.",
|
||||
"anchors": {
|
||||
"early": [
|
||||
"REQ-CTX-01: the module must expose target_prompt_tokens(max_context_tokens, band_fraction, reserved_output_tokens) so context bands are reproducible instead of hand-tuned.",
|
||||
"REQ-CTX-03: estimate_token_count can be approximate but must be deterministic, cheap, and based only on the input text."
|
||||
],
|
||||
"middle": [
|
||||
"FAIL-CTX-07: a previous Hermes replay consumed a huge prompt, returned finish_reason stop, and produced empty content. Debugging required a prompt hash plus preview and tail slices.",
|
||||
"FAIL-CTX-09: repeated records are acceptable when stretching a packet, but their ordering must be deterministic or telemetry comparisons become meaningless."
|
||||
],
|
||||
"late": [
|
||||
"OPS-CTX-12: helper output and timestamps must stay human-facing and UTC-friendly because operators debug these suites from terminal logs, not notebooks.",
|
||||
"OPS-CTX-14: no third-party tokenizer dependency is allowed on the server path because benchmark scripts must run on a clean VPS without pip installs."
|
||||
]
|
||||
},
|
||||
"records": [
|
||||
"Implementation note: assemble_context_packet should accept intro plus early, middle, late anchor lists and a pool of repeating records. The output should grow until it roughly hits a target token budget.",
|
||||
"Debug note: prompt_sha256 exists because storing every rendered prompt verbatim can waste disk. A hash plus preview and tail slices gives traceability without keeping giant files by default.",
|
||||
"Operator note: utc_now should be a tiny helper returning one stable UTC format so benchmark logs across scripts line up naturally.",
|
||||
"Reliability note: target_prompt_tokens should guard against impossible inputs such as negative reserved output tokens or a band fraction outside the open interval from 0 to 1.",
|
||||
"Performance note: estimate_token_count should be good enough for shaping packets but not so clever that it becomes the slowest part of the run.",
|
||||
"Code style note: type hints matter because downstream scripts may import this helper. A small dataclass is fine, but the interface should remain simple and standard-library only.",
|
||||
"Telemetry note: deterministic packet assembly makes it possible to compare models honestly because the prompt content is the same for every model once the cap and band are fixed.",
|
||||
"Failure note: previous runs showed that long prompts can fail in clean-looking ways, including empty assistant text. The module therefore needs affordances for reproducible reconstruction."
|
||||
]
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": "context_edge_pytest_scheduler_retry_regression",
|
||||
"title": "Context Edge Pytest Scheduler Retry Regression",
|
||||
"category": "tests",
|
||||
"format_rule": "pytest_code",
|
||||
"num_predict": 1000,
|
||||
"required_markers": [
|
||||
"def test_",
|
||||
"monkeypatch",
|
||||
"retry_id",
|
||||
"ACK_AFTER_WRITE",
|
||||
"deadline_seconds",
|
||||
"assert"
|
||||
],
|
||||
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your test answer. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
|
||||
"followup_required_markers": [
|
||||
"- Early anchor:",
|
||||
"- Middle anchor:",
|
||||
"- Late anchor:",
|
||||
"BUG-ACK-01",
|
||||
"TRACE-ACK-09",
|
||||
"VERIFY-ACK-22"
|
||||
],
|
||||
"followup_format_rule": "three_bullets",
|
||||
"prompt": "Return only Python code. Write one focused pytest module for the duplicate-ack scheduler regression described in the packet. Requirements: include one deterministic test with monkeypatch or fakes, model the retry_id race, assert that only one logical job commit wins, and make the failure impossible to miss in CI. Use only standard pytest patterns and do not wrap the answer in markdown fences.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}",
|
||||
"context_stress": {
|
||||
"bands": [
|
||||
0.9
|
||||
],
|
||||
"reserved_output_tokens": 1500,
|
||||
"minimum_context_tokens": 2048,
|
||||
"record_prefix": "Regression packet item",
|
||||
"context_intro": "Regression packet for a Python scheduler service where retry timing and ack ordering can duplicate downstream side effects. The target is one surgical pytest module, not a whole test suite.",
|
||||
"anchors": {
|
||||
"early": [
|
||||
"BUG-ACK-01: duplicate downstream outputs started after ACK_AFTER_WRITE=true shipped in rc3. The regression test must exercise that ordering change directly.",
|
||||
"BUG-ACK-03: deadline_seconds tightened from 90 to 45 in the same release, making the retry pickup easier to trigger."
|
||||
],
|
||||
"middle": [
|
||||
"TRACE-ACK-09: the retry worker increments retry_id before the original worker calls ack(), so the test needs two ownership paths and one delayed ack.",
|
||||
"TRACE-ACK-14: the original write succeeds before the retry starts fan-out, which is why the bug is duplicate side effects rather than missing persistence."
|
||||
],
|
||||
"late": [
|
||||
"VERIFY-ACK-22: the regression test must prove that only one logical job commit and one notify path are treated as authoritative after the fix.",
|
||||
"VERIFY-ACK-24: a fake clock or explicit delay hook is required so the race is deterministic instead of relying on sleeping threads."
|
||||
]
|
||||
},
|
||||
"records": [
|
||||
"Service behavior: one worker writes job completion and a downstream notify record, then acknowledges the queue lease. Retry logic watches deadline expiry and can spawn a second worker for the same logical payload.",
|
||||
"Historical assumption: ack happened before write, so retry pickup rarely overlapped a durable write. After rc3 that assumption no longer holds.",
|
||||
"Testing note: a good regression harness can stub the notifier and collect emitted payload_ids. Duplicate notification is easier to assert than raw queue internals.",
|
||||
"Operator note: Redis restarts and clock warnings were noisy but non-causal. The test should focus on ordering and ownership, not infrastructure flakiness.",
|
||||
"Implementation note: a fake clock or injectable now() hook is preferred over thread sleeps because CI latency is too variable for a race test.",
|
||||
"Acceptance note: if the fix works, either the retry worker or the original worker should stand down cleanly, but never both proceed to external notify.",
|
||||
"CI note: the test should fail loudly with a short diff if duplicate notification happens. Silent counting helpers are harder to trust in review.",
|
||||
"Code review note: a focused module with one strong regression test is worth more than many weak permutations for this specific benchmark."
|
||||
]
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": "context_edge_change_review_packet",
|
||||
"title": "Context Edge Change Review Packet",
|
||||
"category": "review",
|
||||
"format_rule": "json_dict",
|
||||
"num_predict": 700,
|
||||
"required_markers": [
|
||||
"gateway/run.py",
|
||||
"run_python_task_suite.py",
|
||||
"status.json",
|
||||
"20260408_add_job_owner.sql",
|
||||
"discord",
|
||||
"migration"
|
||||
],
|
||||
"followup_prompt": "In exactly three bullet lines, recall the specific early, middle, and late packet facts that most changed your review answer. Use these exact labels:\\n- Early anchor:\\n- Middle anchor:\\n- Late anchor:\\nEach line must include the relevant packet ID if present.",
|
||||
"followup_required_markers": [
|
||||
"- Early anchor:",
|
||||
"- Middle anchor:",
|
||||
"- Late anchor:",
|
||||
"REV-03",
|
||||
"REV-17",
|
||||
"REV-29"
|
||||
],
|
||||
"followup_format_rule": "three_bullets",
|
||||
"prompt": "Return only JSON. You are reviewing a large mixed diff packet that spans Python services, telemetry tooling, Discord gateway behavior, documentation, and one database migration.\\n\\nLong packet:\\n{{CONTEXT_STRESS_BLOCK}}\\n\\nTask: produce a review packet with output keys exactly: likely_regressions, riskiest_files, missing_tests, rollout_risk, safe_merge_condition. Constraints: mention gateway/run.py, run_python_task_suite.py, status.json, 20260408_add_job_owner.sql, and Discord or gateway behavior where relevant.",
|
||||
"context_stress": {
|
||||
"bands": [
|
||||
0.9
|
||||
],
|
||||
"reserved_output_tokens": 1100,
|
||||
"minimum_context_tokens": 2048,
|
||||
"record_prefix": "Review packet item",
|
||||
"context_intro": "Mixed diff packet assembled from review summaries, file notes, test output snippets, and rollout comments. The challenge is to surface concrete regressions instead of repeating generic code-review advice.",
|
||||
"anchors": {
|
||||
"early": [
|
||||
"REV-03: migration 20260408_add_job_owner.sql adds a non-null job_owner column without a documented backfill for existing rows. That can fail immediately on populated databases.",
|
||||
"REV-06: scheduler read paths were updated in code, but one admin query still selects the old nullable shape in a report view."
|
||||
],
|
||||
"middle": [
|
||||
"REV-17: gateway/run.py changed reconnect behavior for stale provider responses, but no test proves how Discord handles empty assistant content or one clean timeout followed by a retry.",
|
||||
"REV-21: run_python_task_suite.py now records prompt hashes and context metadata, yet no report-level check verifies the new keys are preserved."
|
||||
],
|
||||
"late": [
|
||||
"REV-29: the report builder renamed one status field, but dashboards and status.json examples in docs still use the old key. That will silently break HTML rendering if merged together.",
|
||||
"REV-34: rollout notes assume the migration and report schema can deploy independently, but the dashboard pull path still reads both in the same morning workflow."
|
||||
]
|
||||
},
|
||||
"records": [
|
||||
"Diff summary: gateway/run.py now waits longer before declaring the provider stale, and the Discord adapter emits one new reconnect warning line. No new fixture captures an empty successful response body.",
|
||||
"Telemetry summary: run_python_task_suite.py expanded to support context-stress prompts, custom follow-up prompts, and prompt metadata files. The status markdown and report builder were only partially updated.",
|
||||
"Migration summary: 20260408_add_job_owner.sql introduces explicit ownership on scheduled jobs so the UI can attribute work cleanly. The migration note mentions new writes, not legacy rows.",
|
||||
"Dashboard summary: one HTML manual still expects the previous status key name from status.json and has no assertion on unknown-field fallback.",
|
||||
"Docs summary: operator docs were refreshed for the new Discord reconnect wording, but one screenshots guide still references the prior warning text verbatim.",
|
||||
"Review note: the riskiest interactions are schema plus runtime reads, and telemetry JSON plus dashboard consumption. Pure doc edits are comparatively safe.",
|
||||
"Testing note: there are unit tests around scheduler ownership writes and a separate smoke test for Discord login, but nothing that exercises both the new reconnect path and empty assistant content.",
|
||||
"Rollout note: support wants the dashboard alive on the same morning the migration lands. That makes silent telemetry-key drift more expensive than a normal internal-only contract change."
|
||||
]
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
310
harness/suites/python_task_suite_questions.json
Normal file
310
harness/suites/python_task_suite_questions.json
Normal file
@@ -0,0 +1,310 @@
|
||||
{
|
||||
"suite_name": "overnight-python-telemetry-v2-real-context",
|
||||
"version": "2.0",
|
||||
"purpose": "A deterministic overnight suite for evaluating big and small Ollama models on vps50 with harder multi-file prompts shaped after Slobodan's real implementation, review, debugging, and orchestration asks.",
|
||||
"models": [
|
||||
{
|
||||
"model": "qwen32-coder-32k",
|
||||
"display_name": "Qwen32 Coder 32k",
|
||||
"size_label": "32b"
|
||||
},
|
||||
{
|
||||
"model": "qwen14-coder-32k",
|
||||
"display_name": "Qwen14 Coder 32k",
|
||||
"size_label": "14b"
|
||||
},
|
||||
{
|
||||
"model": "codestral-32k",
|
||||
"display_name": "Codestral 32k",
|
||||
"size_label": "22b"
|
||||
},
|
||||
{
|
||||
"model": "codellama34-16k",
|
||||
"display_name": "CodeLlama 34 16k",
|
||||
"size_label": "34b"
|
||||
},
|
||||
{
|
||||
"model": "phind34-16k",
|
||||
"display_name": "Phind 34 16k",
|
||||
"size_label": "34b"
|
||||
},
|
||||
{
|
||||
"model": "qwen14-general-32k",
|
||||
"display_name": "Qwen14 General 32k",
|
||||
"size_label": "14b"
|
||||
},
|
||||
{
|
||||
"model": "qwen2.5-coder:3b",
|
||||
"display_name": "Qwen2.5 Coder 3B",
|
||||
"size_label": "3b"
|
||||
},
|
||||
{
|
||||
"model": "qwen2.5-coder:1.5b",
|
||||
"display_name": "Qwen2.5 Coder 1.5B",
|
||||
"size_label": "1.5b"
|
||||
},
|
||||
{
|
||||
"model": "qwen2.5:3b",
|
||||
"display_name": "Qwen2.5 3B",
|
||||
"size_label": "3b"
|
||||
},
|
||||
{
|
||||
"model": "llama3.2:3b",
|
||||
"display_name": "Llama 3.2 3B",
|
||||
"size_label": "3b"
|
||||
},
|
||||
{
|
||||
"model": "phi3",
|
||||
"display_name": "Phi-3 Mini",
|
||||
"size_label": "3.8b"
|
||||
}
|
||||
],
|
||||
"questions": [
|
||||
{
|
||||
"id": "myboard_auth_redirect_triage",
|
||||
"title": "MyBoard Auth Redirect Triage",
|
||||
"category": "debugging",
|
||||
"format_rule": "json_dict",
|
||||
"num_predict": 700,
|
||||
"context_files": [
|
||||
"MyBoard/app/api.py",
|
||||
"MyBoard/web/nuxt-app/app/composables/useSession.ts",
|
||||
"MyBoard/web/nuxt-app/app/middleware/auth.global.ts",
|
||||
"MyBoard/web/nuxt-app/app/pages/login.vue",
|
||||
"MyBoard/tests/browser/flow-coverage-manifest.json",
|
||||
"MyBoard/docs/user flows/access-onboarding-and-account-flows.md"
|
||||
],
|
||||
"prompt": "Return only JSON. You are debugging a real MyBoard issue where password login can succeed but the user still lands on the wrong route or loses tenant context.\nTask: produce a repo-grounded triage packet.\nContext files:\n1. MyBoard/app/api.py: exposes /auth/login and /auth/me.\n2. MyBoard/web/nuxt-app/app/composables/useSession.ts: loginWithPassword(), setSession(), fetchContext(), applyTenant(), applyProject().\n3. MyBoard/web/nuxt-app/app/middleware/auth.global.ts: redirects to /login, /tenant-missing, /403 and reads myboard:post-login-redirect.\n4. MyBoard/web/nuxt-app/app/pages/login.vue: saveRedirectTarget(), redirectAfterLogin(), resolveOidcErrorMessage().\n5. MyBoard/tests/browser/flow-coverage-manifest.json: login and tenant-missing flows are acceptance coverage.\n6. MyBoard/docs/user flows/access-onboarding-and-account-flows.md: expected post-login workspace behavior.\nOutput keys exactly: issue_summary, likely_root_causes, backend_touchpoints, frontend_touchpoints, tests_to_run, safe_fix_plan.\nConstraints: mention /auth/login, /auth/me, myboard:post-login-redirect, tenant-missing, X-Tenant, and do not invent files outside the context list.",
|
||||
"required_markers": [
|
||||
"/auth/login",
|
||||
"/auth/me",
|
||||
"useSession.ts",
|
||||
"auth.global.ts",
|
||||
"login.vue",
|
||||
"tenant-missing",
|
||||
"myboard:post-login-redirect",
|
||||
"X-Tenant"
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "myboard_board_snapshot_regression_test",
|
||||
"title": "Board Snapshot Regression Test",
|
||||
"category": "tests",
|
||||
"format_rule": "pytest_code",
|
||||
"num_predict": 900,
|
||||
"context_files": [
|
||||
"MyBoard/app/api.py",
|
||||
"MyBoard/app/models.py",
|
||||
"MyBoard/tests/api/test_board_snapshot.py",
|
||||
"MyBoard/tests/api/test_task_bulk_jobs.py",
|
||||
"MyBoard/web/nuxt-app/app/composables/queries/boards.ts",
|
||||
"MyBoard/web/nuxt-app/app/pages/board/[[projectSlug]].vue"
|
||||
],
|
||||
"prompt": "Return only Python code. Write one focused pytest module for a real MyBoard regression around /board/snapshot after bulk task assignment and lane movement.\nContext files:\n1. MyBoard/app/api.py: exposes /board/snapshot, /tasks/{task_id}, and bulk job endpoints.\n2. MyBoard/app/models.py: workflow statuses and lane ordering are defined here.\n3. MyBoard/tests/api/test_board_snapshot.py: current lane-count and project-scope coverage.\n4. MyBoard/tests/api/test_task_bulk_jobs.py: helper patterns for creating stories/tasks and checking snapshot sync after assign/move.\n5. MyBoard/web/nuxt-app/app/composables/queries/boards.ts: frontend expects board data to stay lane-consistent.\n6. MyBoard/web/nuxt-app/app/pages/board/[[projectSlug]].vue: board page consumes the snapshot.\nRequirements: include async auth helper, create at least one story and one task, move the task to session, verify /board/snapshot returns the task in the session lane, and assert assignee_ids survive the move. Do not invent endpoints outside the context list.",
|
||||
"required_markers": [
|
||||
"def test_",
|
||||
"/board/snapshot",
|
||||
"/tasks/",
|
||||
"session",
|
||||
"assignee_ids",
|
||||
"story_id",
|
||||
"api_client"
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "myboard_lane_config_patch_plan",
|
||||
"title": "Lane Config Patch Plan",
|
||||
"category": "planning",
|
||||
"format_rule": "json_dict",
|
||||
"num_predict": 700,
|
||||
"context_files": [
|
||||
"MyBoard/app/models.py",
|
||||
"MyBoard/app/api.py",
|
||||
"MyBoard/tests/api/test_lane_config.py",
|
||||
"MyBoard/web/nuxt-app/app/composables/queries/lane-config.ts",
|
||||
"MyBoard/web/nuxt-app/app/lib/workflow.ts",
|
||||
"MyBoard/web/nuxt-app/app/pages/board/[[projectSlug]].vue"
|
||||
],
|
||||
"prompt": "Return only JSON. A new regression report says project lane overrides can drift from the canonical workflow and confuse the board page.\nTask: prepare a concrete patch plan.\nContext files:\n1. MyBoard/app/models.py: default_lane_sequence() and workflow enums are canonical.\n2. MyBoard/app/api.py: organization and project lane-config endpoints live here.\n3. MyBoard/tests/api/test_lane_config.py: round-trip and inheritance tests already exist.\n4. MyBoard/web/nuxt-app/app/composables/queries/lane-config.ts: frontend query behavior.\n5. MyBoard/web/nuxt-app/app/lib/workflow.ts: frontend lane semantics.\n6. MyBoard/web/nuxt-app/app/pages/board/[[projectSlug]].vue: board rendering depends on effective lanes.\nOutput keys exactly: regression_summary, invariants_to_protect, backend_changes, frontend_changes, tests_to_add, rollout_checks.\nConstraints: mention default_lane_sequence, use_organization_default, effective_lanes, /organizations/{organization_id}/lane-config, /projects/{project_id}/lane-config, and do not invent new persistence layers.",
|
||||
"required_markers": [
|
||||
"default_lane_sequence",
|
||||
"use_organization_default",
|
||||
"effective_lanes",
|
||||
"/organizations/{organization_id}/lane-config",
|
||||
"/projects/{project_id}/lane-config",
|
||||
"test_lane_config.py"
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "myboard_api_token_audit_regression_test",
|
||||
"title": "API Token Audit Regression Test",
|
||||
"category": "tests",
|
||||
"format_rule": "pytest_code",
|
||||
"num_predict": 900,
|
||||
"context_files": [
|
||||
"MyBoard/app/api.py",
|
||||
"MyBoard/app/models.py",
|
||||
"MyBoard/tests/api/test_api_tokens.py",
|
||||
"MyBoard/web/nuxt-app/app/composables/queries/api-tokens.ts",
|
||||
"MyBoard/web/nuxt-app/app/pages/settings/api-tokens.vue",
|
||||
"MyBoard/contracts/myboard-api.openapi.json"
|
||||
],
|
||||
"prompt": "Return only Python code. Write one pytest module that hardens the API token lifecycle against an audit-ordering regression.\nContext files:\n1. MyBoard/app/api.py: /api-tokens endpoints, regenerate, revoke, and audits.\n2. MyBoard/app/models.py: APIToken, APITokenAudit, APITokenAction.\n3. MyBoard/tests/api/test_api_tokens.py: existing lifecycle coverage and auth header pattern.\n4. MyBoard/web/nuxt-app/app/composables/queries/api-tokens.ts: frontend sorts audits descending by created timestamp.\n5. MyBoard/web/nuxt-app/app/pages/settings/api-tokens.vue: UI expects regenerated and revoked tokens to refresh correctly.\n6. MyBoard/contracts/myboard-api.openapi.json: contract surface must stay aligned.\nRequirements: include create, machine-use, regenerate, revoke, audit fetch, and assertions that CREATED, REGENERATED, and REVOKED are all present in audit history and the revoked token is inactive. Do not use code fences.",
|
||||
"required_markers": [
|
||||
"def test_",
|
||||
"/api-tokens",
|
||||
"/auth/me",
|
||||
"APITokenAudit",
|
||||
"REGENERATED",
|
||||
"REVOKED",
|
||||
"CREATED"
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "myboard_announcements_state_sync_review",
|
||||
"title": "Announcements State Sync Review",
|
||||
"category": "review",
|
||||
"format_rule": "json_dict",
|
||||
"num_predict": 700,
|
||||
"context_files": [
|
||||
"MyBoard/app/api.py",
|
||||
"MyBoard/app/models.py",
|
||||
"MyBoard/tests/api/test_announcements.py",
|
||||
"MyBoard/web/nuxt-app/app/composables/queries/announcements.ts",
|
||||
"MyBoard/web/nuxt-app/app/pages/announcements.vue",
|
||||
"MyBoard/docs/app-feature-inventory.md"
|
||||
],
|
||||
"prompt": "Return only JSON. Review a suspected frontend/backend sync bug where announcement read and dismiss state can diverge after mark-all-read and list refresh.\nContext files:\n1. MyBoard/app/api.py: CRUD, read, unread, dismiss, undismiss, and mark-all-read endpoints.\n2. MyBoard/app/models.py: announcement read/dismiss persistence objects live here.\n3. MyBoard/tests/api/test_announcements.py: current API coverage.\n4. MyBoard/web/nuxt-app/app/composables/queries/announcements.ts: mergeAnnouncement(), mark-all-read cache behavior, include_dismissed handling.\n5. MyBoard/web/nuxt-app/app/pages/announcements.vue: UI depends on query cache correctness.\n6. MyBoard/docs/app-feature-inventory.md: announcements are a user-visible feature surface.\nOutput keys exactly: failure_modes, most_suspicious_cache_paths, backend_contract_checks, frontend_fix_options, regression_tests, rollout_risk.\nConstraints: mention mergeAnnouncement, include_dismissed, /announcements/mark-all-read, /announcements/{announcement_id}/dismiss, /announcements/{announcement_id}/read, and dismissed.",
|
||||
"required_markers": [
|
||||
"mergeAnnouncement",
|
||||
"include_dismissed",
|
||||
"/announcements/mark-all-read",
|
||||
"/announcements/{announcement_id}/dismiss",
|
||||
"/announcements/{announcement_id}/read",
|
||||
"dismissed"
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "myboard_feature_flag_lifecycle_test",
|
||||
"title": "Feature Flag Lifecycle Test",
|
||||
"category": "tests",
|
||||
"format_rule": "pytest_code",
|
||||
"num_predict": 900,
|
||||
"context_files": [
|
||||
"MyBoard/app/api.py",
|
||||
"MyBoard/app/models.py",
|
||||
"MyBoard/tests/api/test_feature_flags.py",
|
||||
"MyBoard/web/nuxt-app/app/composables/queries/feature-flags.ts",
|
||||
"MyBoard/web/nuxt-app/app/pages/admin/index.vue",
|
||||
"MyBoard/contracts/myboard-api.openapi.json"
|
||||
],
|
||||
"prompt": "Return only Python code. Write one pytest module for a real MyBoard feature-flag regression around environment toggles and history.\nContext files:\n1. MyBoard/app/api.py: feature-flag and feature-flag-environment endpoints.\n2. MyBoard/app/models.py: FeatureFlag, FeatureFlagEnvironment, FeatureFlagHistory, FeatureFlagState.\n3. MyBoard/tests/api/test_feature_flags.py: current lifecycle coverage.\n4. MyBoard/web/nuxt-app/app/composables/queries/feature-flags.ts: frontend expects detail and history cache to stay aligned.\n5. MyBoard/web/nuxt-app/app/pages/admin/index.vue: admin console consumes this data.\n6. MyBoard/contracts/myboard-api.openapi.json: response shapes must remain stable.\nRequirements: create two environments, create one flag, toggle dev to enabled with rollout percentage, fetch history, verify latest history action is toggle, verify non-admin toggle is rejected with 403, and verify delete cleanup. Do not invent helper libraries.",
|
||||
"required_markers": [
|
||||
"def test_",
|
||||
"/feature-flags",
|
||||
"/feature-flag-environments",
|
||||
"/history",
|
||||
"rollout_percentage",
|
||||
"403",
|
||||
"toggle"
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "myboard_task_bulk_job_debug_packet",
|
||||
"title": "Task Bulk Job Debug Packet",
|
||||
"category": "debugging",
|
||||
"format_rule": "json_dict",
|
||||
"num_predict": 750,
|
||||
"context_files": [
|
||||
"MyBoard/app/api.py",
|
||||
"MyBoard/app/models.py",
|
||||
"MyBoard/tests/api/test_task_bulk_jobs.py",
|
||||
"MyBoard/web/nuxt-app/app/composables/queries/task-bulk.ts",
|
||||
"MyBoard/web/nuxt-app/app/composables/queries/boards.ts",
|
||||
"MyBoard/web/nuxt-app/app/components/ui/productivity/BulkActionToolbar.vue"
|
||||
],
|
||||
"prompt": "Return only JSON. A production-like report says bulk assign jobs complete, but some task detail panels and board lanes stay stale until a hard refresh.\nTask: produce a debug packet grounded in the repo.\nContext files:\n1. MyBoard/app/api.py: /tasks/bulk/jobs, /tasks/bulk/preview, /tasks/{task_id}, /board/snapshot.\n2. MyBoard/app/models.py: TaskBulkJob, TaskBulkJobEntry, task assignment fields.\n3. MyBoard/tests/api/test_task_bulk_jobs.py: happy-path completion and board sync tests.\n4. MyBoard/web/nuxt-app/app/composables/queries/task-bulk.ts: invalidateBulkAffectedTaskCaches() and polling behavior.\n5. MyBoard/web/nuxt-app/app/composables/queries/boards.ts: board query cache consumers.\n6. MyBoard/web/nuxt-app/app/components/ui/productivity/BulkActionToolbar.vue: user trigger surface.\nOutput keys exactly: suspected_root_causes, cache_invalidation_gaps, backend_checks, frontend_checks, additional_tests, smallest_safe_fix.\nConstraints: mention invalidateBulkAffectedTaskCaches, queryKeys.bulkJobs.detail, queryKeys.boards.lanes, /tasks/bulk/jobs/{job_id}, /board/snapshot, and assignee_ids.",
|
||||
"required_markers": [
|
||||
"invalidateBulkAffectedTaskCaches",
|
||||
"queryKeys.bulkJobs.detail",
|
||||
"queryKeys.boards.lanes",
|
||||
"/tasks/bulk/jobs/{job_id}",
|
||||
"/board/snapshot",
|
||||
"assignee_ids"
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "myboard_user_preferences_contract_test",
|
||||
"title": "User Preferences Contract Test",
|
||||
"category": "tests",
|
||||
"format_rule": "pytest_code",
|
||||
"num_predict": 950,
|
||||
"context_files": [
|
||||
"MyBoard/app/api.py",
|
||||
"MyBoard/app/models.py",
|
||||
"MyBoard/app/services/preferences.py",
|
||||
"MyBoard/tests/api/test_user_preferences.py",
|
||||
"MyBoard/web/nuxt-app/app/stores/preferences.ts",
|
||||
"MyBoard/web/nuxt-app/app/plugins/00-preferences-bootstrap.ts"
|
||||
],
|
||||
"prompt": "Return only Python code. Write one pytest module that strengthens the user-preferences contract around nested payload updates and theme preview.\nContext files:\n1. MyBoard/app/api.py: /user/preferences and /user/preferences/theme-preview endpoints.\n2. MyBoard/app/models.py: ThemeMode, ThemePreset, BoardViewPreference, UserPreferences.\n3. MyBoard/app/services/preferences.py: normalization and validation live here.\n4. MyBoard/tests/api/test_user_preferences.py: existing nested payload coverage.\n5. MyBoard/web/nuxt-app/app/stores/preferences.ts: frontend consumes the persisted shape.\n6. MyBoard/web/nuxt-app/app/plugins/00-preferences-bootstrap.ts: bootstrap path depends on stable defaults.\nRequirements: include auth helper, one successful nested update assertion, one invalid timezone assertion, one theme-preview non-persistence assertion, and direct checks for locale, theme preset, and board default lane. Do not emit markdown fences.",
|
||||
"required_markers": [
|
||||
"def test_",
|
||||
"/user/preferences",
|
||||
"/user/preferences/theme-preview",
|
||||
"ThemePreset",
|
||||
"locale",
|
||||
"timezone",
|
||||
"default_lane"
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "myboard_orchestration_timeline_forensics",
|
||||
"title": "Orchestration Timeline Forensics",
|
||||
"category": "forensics",
|
||||
"format_rule": "json_dict",
|
||||
"num_predict": 800,
|
||||
"context_files": [
|
||||
"MyBoard/app/api.py",
|
||||
"MyBoard/app/models.py",
|
||||
"MyBoard/tests/api/test_orchestration_events.py",
|
||||
"MyBoard/docs/user flows/orchestration-and-dependency-api-flows.md",
|
||||
"MyBoard/web/nuxt-app/app/pages/admin/index.vue",
|
||||
"MyBoard/web/nuxt-app/app/composables/queries/meta.ts"
|
||||
],
|
||||
"prompt": "Return only JSON. You are investigating a real operator complaint: run history exists, but retry chains and handoff evidence are hard to explain from the admin surface.\nTask: produce a forensics packet.\nContext files:\n1. MyBoard/app/api.py: orchestration event, run, dependency, failure, and timeline endpoints.\n2. MyBoard/app/models.py: OrchestrationRun and related enums and evidence structures.\n3. MyBoard/tests/api/test_orchestration_events.py: canonical event ingestion, retry, and handoff timeline expectations.\n4. MyBoard/docs/user flows/orchestration-and-dependency-api-flows.md: user-visible operator flows.\n5. MyBoard/web/nuxt-app/app/pages/admin/index.vue: admin console surface.\n6. MyBoard/web/nuxt-app/app/composables/queries/meta.ts: operator metadata fetch patterns.\nOutput keys exactly: operator_problem_statement, timeline_questions_to_answer, endpoints_to_query, evidence_fields_that_matter, missing_tests, recommended_ui_improvements.\nConstraints: mention /orchestration/events, /orchestration/runs, /orchestration/dependencies, handoff_requested, run_failed, and retry chain.",
|
||||
"required_markers": [
|
||||
"/orchestration/events",
|
||||
"/orchestration/runs",
|
||||
"/orchestration/dependencies",
|
||||
"handoff_requested",
|
||||
"run_failed",
|
||||
"retry"
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "truthgraph_ingest_log_triage",
|
||||
"title": "TruthGraph Ingest Log Triage",
|
||||
"category": "cross_repo_debugging",
|
||||
"format_rule": "json_dict",
|
||||
"num_predict": 800,
|
||||
"context_files": [
|
||||
"Earth/PHASE2_PROMPT_COMPLEXITY_METRIC_V1.md",
|
||||
"TruthGraph/docs/TRUTHGRAPH_DOC_INGESTION_CONTRACT.md",
|
||||
"TruthGraph/contracts/doc_ingest_manifest.schema.json",
|
||||
"TruthGraph/internal/query/resolve_context.go",
|
||||
"TruthGraph/internal/truthgraph/ingest/preflight/preflight.go",
|
||||
"TruthGraph/cmd/truthgraph/status.go"
|
||||
],
|
||||
"prompt": "Return only JSON. This task mirrors Slobodan's real cross-repo debugging asks. Given a TruthGraph ingest run that discovers repositories but later produces stale or incomplete query answers, produce a triage packet.\nContext files:\n1. Earth/PHASE2_PROMPT_COMPLEXITY_METRIC_V1.md: prompts above threshold should be decomposed.\n2. TruthGraph/docs/TRUTHGRAPH_DOC_INGESTION_CONTRACT.md: intended doc-ingest behavior.\n3. TruthGraph/contracts/doc_ingest_manifest.schema.json: manifest contract surface.\n4. TruthGraph/internal/query/resolve_context.go: context resolution path.\n5. TruthGraph/internal/truthgraph/ingest/preflight/preflight.go: ingest preflight checks.\n6. TruthGraph/cmd/truthgraph/status.go: operator-visible status reporting.\nOutput keys exactly: observed_symptoms, likely_failure_surfaces, preflight_checks, status_gaps, code_paths_to_review, follow_up_commands.\nConstraints: mention doc_ingest_manifest.schema.json, resolve_context, preflight, status, stale index, and prompt complexity. Do not invent files outside the context list.",
|
||||
"required_markers": [
|
||||
"doc_ingest_manifest.schema.json",
|
||||
"resolve_context",
|
||||
"preflight",
|
||||
"status",
|
||||
"stale",
|
||||
"prompt complexity"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
74
harness/suites/small_model_eval_questions.json
Normal file
74
harness/suites/small_model_eval_questions.json
Normal file
@@ -0,0 +1,74 @@
|
||||
{
|
||||
"suite_name": "small-model-coding-eval-v1",
|
||||
"version": "1.0",
|
||||
"purpose": "A deterministic five-question coding and DevOps comparison suite for smaller Ollama models on vps50.",
|
||||
"models": [
|
||||
{
|
||||
"model": "qwen2.5-coder:3b",
|
||||
"display_name": "Qwen2.5 Coder 3B",
|
||||
"size_label": "3b"
|
||||
},
|
||||
{
|
||||
"model": "qwen2.5-coder:1.5b",
|
||||
"display_name": "Qwen2.5 Coder 1.5B",
|
||||
"size_label": "1.5b"
|
||||
},
|
||||
{
|
||||
"model": "qwen2.5:3b",
|
||||
"display_name": "Qwen2.5 3B",
|
||||
"size_label": "3b"
|
||||
},
|
||||
{
|
||||
"model": "llama3.2:3b",
|
||||
"display_name": "Llama 3.2 3B",
|
||||
"size_label": "3b"
|
||||
},
|
||||
{
|
||||
"model": "phi3",
|
||||
"display_name": "Phi-3 Mini",
|
||||
"size_label": "3.8b"
|
||||
}
|
||||
],
|
||||
"questions": [
|
||||
{
|
||||
"id": "disk_guard_bash",
|
||||
"title": "Disk Guard Script",
|
||||
"category": "shell",
|
||||
"prompt": "Return only Bash code. Write a script that checks disk usage for /, prints a human-readable warning, and exits with status 1 when usage is above 85 percent. Requirements: include a shebang, use df -P /, parse the numeric percentage, and keep the script production-safe.",
|
||||
"required_markers": ["#!/usr/bin/env bash", "df -P /", "85", "exit 1"],
|
||||
"format_rule": "bash_code"
|
||||
},
|
||||
{
|
||||
"id": "ipv4_python_tests",
|
||||
"title": "IPv4 Validator",
|
||||
"category": "python",
|
||||
"prompt": "Return only Python code. Write a function named is_valid_ipv4(value: str) -> bool and include exactly three pytest tests that cover a valid address, an out-of-range octet, and a non-numeric input.",
|
||||
"required_markers": ["def is_valid_ipv4", "def test_", "assert", "split('.')"],
|
||||
"format_rule": "python_code"
|
||||
},
|
||||
{
|
||||
"id": "nginx_safe_reload",
|
||||
"title": "Nginx Safe Reload",
|
||||
"category": "ops",
|
||||
"prompt": "Return only Bash commands, one per line. Back up /etc/nginx/nginx.conf, validate nginx config, and reload nginx only if validation passes.",
|
||||
"required_markers": ["cp /etc/nginx/nginx.conf", "nginx -t", "systemctl reload nginx", "&&"],
|
||||
"format_rule": "shell_lines"
|
||||
},
|
||||
{
|
||||
"id": "yaml_cli_plan",
|
||||
"title": "YAML Validator Plan",
|
||||
"category": "planning",
|
||||
"prompt": "Return exactly four numbered steps. Plan a Python CLI that scans a git repo for changed YAML files, validates them against a JSON schema, and exits nonzero on failure.",
|
||||
"required_markers": ["1.", "2.", "3.", "4.", "JSON schema", "git"],
|
||||
"format_rule": "four_numbered_steps"
|
||||
},
|
||||
{
|
||||
"id": "ssh_lockout_triage",
|
||||
"title": "SSH Lockout Triage",
|
||||
"category": "debugging",
|
||||
"prompt": "Return exactly five bullet points. After hardening, SSH started returning Permission denied (publickey,password). List the safest first checks before changing config. Mention sshd_config, authorized_keys, journalctl, rollback, and PasswordAuthentication.",
|
||||
"required_markers": ["sshd_config", "authorized_keys", "journalctl", "rollback", "PasswordAuthentication"],
|
||||
"format_rule": "five_bullets"
|
||||
}
|
||||
]
|
||||
}
|
||||
Reference in New Issue
Block a user