Files
weeyuga-benchmarks-public/harness/run_benchmark.py
Slobodan Margetic 97a9245d9e feat: harness + agent runbook — flip repo from archive-only to
crowdsourced runner

Sloba's chat directive 2026-05-06: "this project is preparation for
going public ... ship the harness along so others can join in."

The repo's original purpose (Ben's catalogue + 21 reference run
ledgers, shipped 2026-05-05) stays intact. This commit ADDS a second
purpose: a portable harness + agent runbook so a friend's coding agent
can clone, read CLAUDE.md, run the same suite on the friend's hardware,
and submit results back as a PR.

What landed:

CLAUDE.md + AGENTS.md (byte-identical, ~520 lines)
  Full agent runbook: hardware probe, runtime + model selection,
  canonical knob reference (Sloba's Pavilion methodology values),
  hardware-adaptation decision rules, run-instructions, output-schema
  templates for hardware.json + metadata.json + run.md, PR submission
  flow (fork → branch → push → PR; nothing auto-merges), privacy
  guardrails, methodology lineage. Per Sloba's Q3 directive: the
  runbook explicitly tells the friend's agent to ADAPT to hardware
  reality and document deviations rather than blindly run defaults.

CONTRIBUTING.md (~110 lines)
  Human-readable companion for the friend (not the agent). What you
  need, how it works, what we ask, what maintainers commit to,
  license, code-of-conduct short version.

harness/
  ├── README.md        Technical readme for the harness folder
  ├── run_benchmark.py ~520 LOC runner. Stdlib-only. Adapted from
  │                    WeeyugaWeb/scripts/benchmarks/run_pavilion_weeyuga.py
  │                    v3 with the cluster-internal IP defaults
  │                    (10.8.0.x) replaced by 127.0.0.1:11434, the
  │                    cluster /v1/cluster/* endpoints removed, the
  │                    canonical-suite paths under ~/Documents/MyServers
  │                    replaced by harness/suites/ paths, the git-sha
  │                    enforcement on WeeyugaWeb dropped, and the
  │                    output written under submissions/<handle>/<tag>/
  │                    instead of docs/BENCHMARKS/runs/. Supports all
  │                    six suite phases via --phases, plus 'all'.
  ├── prompts.py       Verbatim copy of the canonical 3 frozen prompts
  │                    (P-EASY/P-MEDIUM/P-HARD) from
  │                    WeeyugaWeb/scripts/benchmarks/prompts.py.
  ├── requirements.txt Empty by intent (stdlib-only); placeholder for
  │                    pip-tools / agent auto-install patterns.
  ├── .gitignore       __pycache__/ etc.
  └── suites/          Six bundled JSON suites copied verbatim from
       Sloba's MyServers/instances/vps-81-17-99-14/telemetry/:
       small_model_eval_questions.json, python_task_suite_questions.json,
       parallel_qwen_same_model_20q_suite.json,
       parallel_qwen_mixed_model_20q_suite.json,
       python_context_edge_append_questions.json,
       python_context_edge_suite_only.json.

submissions/
  README.md            Folder convention + naming + reviewability rules
  EXAMPLE/mac-m1-8gb/run-00000000-...-000000000000/
       Synthetic-but-shape-complete contribution template:
       manifest.json, hardware.json, run.jsonl (5 example lines),
       metadata.json, run.md (with privacy attestation, methodology
       deviations, reproducibility command). Marked as synthetic at
       the top so future analysis doesn't accidentally cite it.

LICENSE-MIT
  MIT for harness/*.py and future helper code. Existing LICENSE
  (CC-BY-4.0) covers data files.

README.md (modified)
  Updated to reflect dual purpose. Layout diagram updated.
  Maintainer credits: Ben for catalogue/methodology + Bane for harness.
  Contributor quick-start added. Status table extended.

Privacy posture:
  - All 6 suite JSON files privacy-scanned for cluster IPs / hostnames /
    paths / tokens. Two prompts contain project names ("MyBoard" auth
    debugging in 20Q-Q14, generic SSH troubleshooting in 5Q-Q03);
    flagged in chat for Sloba's review. Otherwise clean.
  - run_benchmark.py default target_url is 127.0.0.1:11434 (no internal
    IPs leaked).
  - manifest.json captures host_hostname_short via socket.gethostname()
    .split('.')[0] — agent should review before PR if hostname is
    sensitive.
  - CLAUDE.md §8 spells out the privacy-grep before push.

Verification:
  - py_compile run_benchmark.py: OK
  - --help renders cleanly
  - All 6 suite JSON files: valid
  - All 4 example JSON files: valid
  - Example run.jsonl (5 lines): valid

This commit lands on branch feature/runner-and-agent-instructions.
NOT pushed to main; staying on the feature branch until Sloba reviews
on Gitea and merges. Bus dispatch to Ben + Sam announcing the
architectural pivot lives in the WeeyugaWeb coordination repo.
2026-05-06 19:05:22 +02:00

646 lines
25 KiB
Python
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
#!/usr/bin/env python3
"""weeyuga-benchmarks-public — generic runner.
A portable, agent-driven version of the Weeyuga benchmark harness. Mirrors
the canonical Pavilion methodology (temperature=0.1, num_ctx=4096,
num_predict=2048, single-loaded-model single-parallel-slot) and runs every
suite in `harness/suites/` against every model the agent declares, in
sequence. One JSONL ledger captures every call.
This script is a TEMPLATE, not a one-click button.
The friend running this is expected to be working with a coding agent (Claude
Code, Codex, Aider, etc.). The agent reads CLAUDE.md / AGENTS.md to learn the
methodology, probes the friend's hardware, picks a target model + an
OpenAI-compatible runtime (Ollama / llama.cpp / vLLM / MLX / etc.), then
adapts this script to the friend's reality before running.
Read CLAUDE.md before reading this code.
The runner does ONE thing well: drive an OpenAI-compat /v1/chat/completions
endpoint with the canonical prompts, write a JSONL ledger of every call, and
emit a manifest. Everything else (hardware probe, model pull, runtime tuning,
output curation, PR submission) is the agent's job.
Usage examples:
# smoke (one cell × hello call only — verify the runtime is reachable)
python3 harness/run_benchmark.py --smoke \\
--target-url http://127.0.0.1:11434 \\
--models qwen3.5:0.8b \\
--cell-id-prefix mac:ollama \\
--submitter-handle alice \\
--device-tag mac-m1-8gb
# the canonical full suite (all phases × all models)
python3 harness/run_benchmark.py \\
--target-url http://127.0.0.1:11434 \\
--models qwen3.5:0.8b,qwen3.5:4b \\
--cell-id-prefix mac:ollama \\
--submitter-handle alice \\
--device-tag mac-m1-8gb
Required output goes under:
submissions/<submitter-handle>/<device-tag>/run-<benchmark-run-id>/
run.jsonl — canonical event stream (one JSON object per line)
manifest.json — who, what, when, where (no PII)
hardware.json — what device the run happened on (filled by agent)
metadata.json — computed aggregates (filled by --post-process or by hand)
run.md — human-readable summary (filled by agent or by
`harness/render_summary.py`)
"""
from __future__ import annotations
import argparse
import datetime as dt
import json
import os
import platform
import re
import socket
import sys
import time
import urllib.error
import urllib.request
import uuid
from pathlib import Path
from typing import Any
HARNESS_VERSION = "public-1"
REPO_ROOT = Path(__file__).resolve().parents[1]
SUITES_DIR = REPO_ROOT / "harness" / "suites"
SUBMISSIONS_DIR = REPO_ROOT / "submissions"
DEFAULT_TARGET_URL = os.environ.get(
"WEEYUGA_TARGET_URL", "http://127.0.0.1:11434"
)
DEFAULT_TIMEOUT_S = 360 # 6-minute hard wall-clock per call
HELLO_PROMPT = "hi can you help me?"
# Canonical knobs (Sloba's reference values from the Pavilion methodology).
# An agent running this on different hardware MAY override via flags or by
# editing this file — but the override has to be recorded in manifest.json
# `canonical_options_overrides` so the run is honestly comparable.
CANONICAL_OPTIONS = {
"temperature": 0.1,
"num_ctx": 4096,
"num_predict": 2048,
}
# Suite files in run order. ALL of these run, sequentially, per model, per
# this run's --phases setting. Don't reorder — comparability across runs
# depends on stable ordering. To skip a suite, use --phases.
SUITES = {
"5q": "small_model_eval_questions.json",
"20q": "python_task_suite_questions.json",
"parallel_same": "parallel_qwen_same_model_20q_suite.json",
"parallel_mixed": "parallel_qwen_mixed_model_20q_suite.json",
"edge_append": "python_context_edge_append_questions.json",
"edge_suite": "python_context_edge_suite_only.json",
}
DEFAULT_PHASES = "hello,5q,20q" # the safest "first run" set
# ── tiny utilities ──────────────────────────────────────────────────
def utc_now() -> str:
return dt.datetime.now(dt.timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
def utc_filename_stamp() -> str:
return dt.datetime.now(dt.timezone.utc).strftime("%Y%m%dT%H%M%SZ")
def load_avg() -> tuple[float, float, float]:
try:
return os.getloadavg()
except (OSError, AttributeError):
return (0.0, 0.0, 0.0)
def log(message: str) -> None:
print(f"{utc_now()} {message}", flush=True)
def write_jsonl(handle, record: dict[str, Any]) -> None:
handle.write(json.dumps(record, ensure_ascii=False) + "\n")
handle.flush()
try:
os.fsync(handle.fileno())
except (OSError, ValueError):
pass
# ── HTTP layer ──────────────────────────────────────────────────────
def list_models(target_url: str, timeout: int = 15) -> list[str]:
"""Read /v1/models from the target. Standard OpenAI-compat endpoint."""
req = urllib.request.Request(f"{target_url.rstrip('/')}/v1/models")
try:
with urllib.request.urlopen(req, timeout=timeout) as resp:
body = json.loads(resp.read().decode("utf-8"))
except (urllib.error.URLError, urllib.error.HTTPError, TimeoutError, OSError) as exc:
log(f"WARN: could not list models on {target_url} ({exc!r}); "
"agent must pass --models explicitly.")
return []
out: list[str] = []
for item in body.get("data", []):
if item.get("status") == "placeholder":
continue
mid = item.get("id")
if mid:
out.append(mid)
return out
def chat_completion(
target_url: str,
model: str,
user_prompt: str,
timeout: int = DEFAULT_TIMEOUT_S,
max_tokens_override: int | None = None,
canonical_options: dict | None = None,
) -> dict[str, Any]:
"""One non-streaming call. Returns a dict with timing + content + error."""
opts = dict(canonical_options or CANONICAL_OPTIONS)
body = {
"model": model,
"messages": [{"role": "user", "content": user_prompt}],
"stream": False,
"max_tokens": max_tokens_override or opts["num_predict"],
"temperature": opts["temperature"],
# Ollama-flavored extras; harmless on llama.cpp / vLLM (ignored).
"extra_body": {"options": opts},
}
req = urllib.request.Request(
f"{target_url.rstrip('/')}/v1/chat/completions",
data=json.dumps(body).encode("utf-8"),
headers={"Content-Type": "application/json"},
)
started = time.perf_counter()
out: dict[str, Any] = {
"duration_seconds": None,
"response_text": "",
"response_chars": 0,
"prompt_tokens": None,
"completion_tokens": None,
"total_tokens": None,
"tokens_per_second": None,
"finish_reason": None,
"status_code": None,
"error": None,
}
try:
with urllib.request.urlopen(req, timeout=timeout) as resp:
body_bytes = resp.read()
out["status_code"] = resp.status
elapsed = time.perf_counter() - started
out["duration_seconds"] = round(elapsed, 3)
payload = json.loads(body_bytes.decode("utf-8"))
choices = payload.get("choices") or []
if choices:
msg = choices[0].get("message") or {}
out["response_text"] = msg.get("content") or ""
out["finish_reason"] = choices[0].get("finish_reason")
out["response_chars"] = len(out["response_text"])
usage = payload.get("usage") or {}
out["prompt_tokens"] = usage.get("prompt_tokens")
out["completion_tokens"] = usage.get("completion_tokens")
out["total_tokens"] = usage.get("total_tokens")
if (
out["completion_tokens"]
and out["duration_seconds"]
and out["duration_seconds"] > 0
):
out["tokens_per_second"] = round(
out["completion_tokens"] / out["duration_seconds"], 2
)
except (urllib.error.URLError, urllib.error.HTTPError, TimeoutError, OSError) as exc:
out["duration_seconds"] = round(time.perf_counter() - started, 3)
out["error"] = repr(exc)
except json.JSONDecodeError as exc:
out["duration_seconds"] = round(time.perf_counter() - started, 3)
out["error"] = f"JSONDecodeError: {exc}"
return out
# ── eval helpers (mirrored from canonical Pavilion harness) ─────────
def marker_hits(required: list[str], text: str) -> list[str]:
lowered = (text or "").lower()
return [m for m in required if m.lower() in lowered]
def format_ok(rule: str, text: str) -> bool:
"""Lightweight heuristic format-checker. Mirrors capture_small_model_eval.py."""
stripped = (text or "").strip()
lines = [line.strip() for line in stripped.splitlines() if line.strip()]
if rule == "bash_code":
return stripped.startswith("#!/usr/bin/env bash")
if rule == "python_code":
return (
"def is_valid_ipv4" in stripped
and sum(1 for line in lines if line.startswith("def test_")) >= 3
)
if rule == "shell_lines":
return (
all(not line.startswith(("1.", "- ", "* ")) for line in lines[:3])
and "nginx -t" in stripped
)
if rule == "four_numbered_steps":
return sum(1 for line in lines if re.match(r"^[1-4]\. ", line)) >= 4
if rule == "five_bullets":
return sum(1 for line in lines if line.startswith(("- ", "* "))) >= 5
if rule == "json_dict":
try:
parsed = json.loads(stripped)
return isinstance(parsed, dict)
except (json.JSONDecodeError, ValueError):
return False
if rule == "pytest_code":
return (
"def test_" in stripped
and ("import pytest" in stripped or "pytest" in stripped)
)
return False
# ── per-call record builder ─────────────────────────────────────────
def make_call_record(
cell_id: str,
model: str,
phase: str,
question_id: str,
prompt: str,
run_idx: int,
required_markers: list[str],
format_rule: str,
target_url: str,
timeout: int,
canonical_options: dict | None = None,
) -> dict[str, Any]:
result = chat_completion(
target_url, model, prompt, timeout=timeout,
canonical_options=canonical_options,
)
text = result["response_text"]
hits = marker_hits(required_markers, text)
return {
"type": "call",
"ts_utc": utc_now(),
"cell_id": cell_id,
"model": model,
"phase": phase,
"question_id": question_id,
"run_idx": run_idx,
"duration_seconds": result["duration_seconds"],
"prompt_tokens": result["prompt_tokens"],
"completion_tokens": result["completion_tokens"],
"tokens_per_second": result["tokens_per_second"],
"finish_reason": result["finish_reason"],
"status_code": result["status_code"],
"response_chars": result["response_chars"],
"response_preview": (text or "")[:240],
"required_markers": required_markers,
"markers_hit": hits,
"marker_hit_rate": (
round(len(hits) / len(required_markers), 3)
if required_markers
else None
),
"format_rule": format_rule,
"format_ok": format_ok(format_rule, text) if format_rule else None,
"usable_answer": bool((text or "").strip()),
"error": result["error"],
}
# ── phase runners ───────────────────────────────────────────────────
def run_hello(handle, model, target_url, timeout, cell_id_prefix, options):
cell_id = f"{cell_id_prefix}:{model}"
log(f"[hello] {model}")
rec = make_call_record(
cell_id=cell_id, model=model, phase="hello",
question_id="hello_check", prompt=HELLO_PROMPT, run_idx=0,
required_markers=[], format_rule="",
target_url=target_url, timeout=timeout,
canonical_options=options,
)
write_jsonl(handle, rec)
log(f"{rec['duration_seconds']}s "
f"completion_tokens={rec['completion_tokens']} "
f"finish={rec['finish_reason']} err={rec['error']}")
def run_frozen_prompts(handle, model, target_url, timeout, cell_id_prefix, options):
"""3 frozen prompts (P-EASY/P-MEDIUM/P-HARD) from harness/prompts.py."""
sys.path.insert(0, str(REPO_ROOT / "harness"))
from prompts import PROMPTS # type: ignore[import-not-found]
cell_id = f"{cell_id_prefix}:{model}"
for q_idx, (pid, p) in enumerate(PROMPTS.items()):
log(f"[frozen] {model} {pid}")
rec = make_call_record(
cell_id=cell_id, model=model, phase="frozen",
question_id=pid, prompt=p["prompt"], run_idx=q_idx,
required_markers=[], format_rule="",
target_url=target_url, timeout=timeout,
canonical_options=options,
)
# Honor per-prompt max_tokens floor
rec["max_tokens_used"] = options.get("num_predict") if options else CANONICAL_OPTIONS["num_predict"]
write_jsonl(handle, rec)
log(f"{rec['duration_seconds']}s "
f"completion_tokens={rec['completion_tokens']} err={rec['error']}")
def run_suite(handle, model, target_url, timeout, suite_path, phase, cell_id_prefix, options):
"""Drive any of the 6 .json suites (5q / 20q / parallel_* / edge_*)."""
suite = json.loads(suite_path.read_text(encoding="utf-8"))
questions = suite.get("questions") or []
if not questions:
log(f"WARN: suite {suite_path.name} has no questions; skipping")
return
cell_id = f"{cell_id_prefix}:{model}"
for q_idx, q in enumerate(questions):
log(f"[{phase}] {model} {q.get('id', f'q{q_idx}')} "
f"({q_idx + 1}/{len(questions)})")
rec = make_call_record(
cell_id=cell_id, model=model, phase=phase,
question_id=q.get("id", f"q{q_idx}"), prompt=q["prompt"],
run_idx=q_idx,
required_markers=q.get("required_markers") or [],
format_rule=q.get("format_rule") or "",
target_url=target_url, timeout=timeout,
canonical_options=options,
)
write_jsonl(handle, rec)
log(f"{rec['duration_seconds']}s "
f"completion_tokens={rec['completion_tokens']} "
f"format_ok={rec['format_ok']} markers={rec['marker_hit_rate']} "
f"err={rec['error']}")
# ── CLI ─────────────────────────────────────────────────────────────
def parse_args() -> argparse.Namespace:
p = argparse.ArgumentParser(
description="Run the Weeyuga benchmark suite against a local model.",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=(
"Read CLAUDE.md / AGENTS.md before running. The friend's coding\n"
"agent is expected to adapt this script to the friend's hardware\n"
"(GPU vs CPU, VRAM constraints, cold-start tuning) and record any\n"
"deviations in manifest.json."
),
)
p.add_argument("--target-url", default=DEFAULT_TARGET_URL,
help="OpenAI-compat base URL (default http://127.0.0.1:11434).")
p.add_argument("--models", default="auto",
help="Comma-separated model list, or 'auto' to pull from /v1/models.")
p.add_argument("--cell-id-prefix", required=False, default="local:ollama",
help="Prefix for cell_id. Pattern: <node-tag>:<engine>. "
"Examples: mac-m1:ollama, predator:llamacpp, vps:cpu.")
p.add_argument("--phases", default=DEFAULT_PHASES,
help=("Comma-separated subset of: hello, frozen, "
+ ", ".join(SUITES.keys())
+ ". Default: hello,5q,20q. Pass 'all' for the full suite."))
p.add_argument("--timeout", type=int, default=DEFAULT_TIMEOUT_S,
help="Per-call wall-clock timeout (default 360).")
p.add_argument("--probe", action="store_true",
help="Health-check only: list models, do one hello call, exit.")
p.add_argument("--smoke", action="store_true",
help="One model × hello-only. End-to-end runtime validation.")
p.add_argument("--submitter-handle", required=False, default=None,
help="Your Gitea (or any) handle. Used in submissions/<handle>/...")
p.add_argument("--device-tag", required=False, default=None,
help="Short device tag. Examples: mac-m1-8gb, rtx-4090-pc, "
"predator-gtx-1060-6gb. Used in submissions/<handle>/<tag>/...")
p.add_argument("--run-id", default=None,
help="Override auto-generated run UUID (for resuming).")
p.add_argument("--out-dir", default=None,
help="Override the submissions/<handle>/<tag>/<run-id>/ output dir.")
p.add_argument("--temperature", type=float, default=None,
help="Override canonical temperature 0.1.")
p.add_argument("--num-ctx", type=int, default=None,
help="Override canonical num_ctx 4096.")
p.add_argument("--num-predict", type=int, default=None,
help="Override canonical num_predict 2048.")
return p.parse_args()
def resolve_phases(phases_arg: str) -> list[str]:
if phases_arg.strip().lower() == "all":
return ["hello", "frozen"] + list(SUITES.keys())
raw = [p.strip() for p in phases_arg.split(",") if p.strip()]
out = []
for r in raw:
if r in ("hello", "frozen") or r in SUITES:
out.append(r)
else:
raise SystemExit(
f"unknown phase {r!r}; valid: hello, frozen, "
+ ", ".join(SUITES.keys()) + ", or 'all'"
)
return out
def resolve_models(models_arg: str, target_url: str) -> list[str]:
available = list_models(target_url)
log(f" /v1/models reports {len(available)}: {available}")
if models_arg == "auto":
if not available:
raise SystemExit(
"auto-list found no models on target; pass --models explicitly "
"(e.g. --models qwen3.5:0.8b,qwen3.5:4b)."
)
return available
wanted = [m.strip() for m in models_arg.split(",") if m.strip()]
if available:
missing = [m for m in wanted if m not in available]
if missing:
log(f"WARN: requested but not on target: {missing}. "
"The runner will try anyway — your runtime may auto-pull.")
return wanted
def write_manifest(out_dir: Path, args, run_id: str, phases: list[str], models: list[str], options: dict) -> None:
manifest = {
"schema_version": "manifest-1.0",
"run_id": run_id,
"harness_version": HARNESS_VERSION,
"submitter_handle": args.submitter_handle,
"device_tag": args.device_tag,
"cell_id_prefix": args.cell_id_prefix,
"target_url": args.target_url,
"phases_run": phases,
"models_run": models,
"canonical_options": dict(CANONICAL_OPTIONS),
"canonical_options_overrides": {
k: v for k, v in {
"temperature": args.temperature,
"num_ctx": args.num_ctx,
"num_predict": args.num_predict,
}.items() if v is not None
},
"timeout_seconds": args.timeout,
"started_at_utc": utc_now(),
"host_hostname_short": socket.gethostname().split(".")[0],
"platform_system": platform.system(),
"platform_release": platform.release(),
"python_version": platform.python_version(),
}
(out_dir / "manifest.json").write_text(
json.dumps(manifest, indent=2, ensure_ascii=False) + "\n",
encoding="utf-8",
)
def main() -> int:
args = parse_args()
target_url = args.target_url.rstrip("/")
log(f"target_url = {target_url}")
options = dict(CANONICAL_OPTIONS)
if args.temperature is not None: options["temperature"] = args.temperature
if args.num_ctx is not None: options["num_ctx"] = args.num_ctx
if args.num_predict is not None: options["num_predict"] = args.num_predict
log("listing /v1/models on target …")
models = resolve_models(args.models, target_url)
if args.probe:
if not models:
log("no models available; abort")
return 1
smallest = sorted(models, key=lambda m: (
0 if "0.5b" in m or "0.6b" in m
else 1 if "0.8b" in m
else 2 if "1.5b" in m or "1b" in m
else 3 if "2b" in m
else 4
))[0]
log(f"probe: hello call against smallest = {smallest}")
rec = make_call_record(
cell_id=f"{args.cell_id_prefix}:{smallest}",
model=smallest, phase="probe",
question_id="hello_check", prompt=HELLO_PROMPT, run_idx=0,
required_markers=[], format_rule="",
target_url=target_url, timeout=min(args.timeout, 60),
canonical_options=options,
)
log(json.dumps(rec, indent=2, ensure_ascii=False))
return 0 if not rec["error"] else 3
if args.smoke:
models = [models[0]] if models else []
if not models:
log("no models available for smoke; abort")
return 1
log(f"smoke: hello-only against {models[0]}")
phases = ["hello"]
else:
phases = resolve_phases(args.phases)
if not models:
log("no models to run; abort")
return 1
if not args.submitter_handle and not args.out_dir:
raise SystemExit(
"--submitter-handle is required (or pass --out-dir for ad-hoc runs)."
)
if not args.device_tag and not args.out_dir:
raise SystemExit(
"--device-tag is required (or pass --out-dir for ad-hoc runs)."
)
run_id = args.run_id or str(uuid.uuid4())
out_dir = (
Path(args.out_dir) if args.out_dir
else SUBMISSIONS_DIR / args.submitter_handle / args.device_tag / f"run-{run_id}"
)
out_dir.mkdir(parents=True, exist_ok=True)
out_path = out_dir / "run.jsonl"
log(f"writing JSONL ledger to {out_path}")
write_manifest(out_dir, args, run_id, phases, models, options)
started_at = utc_now()
started_load = load_avg()
meta = {
"type": "meta",
"benchmark_run_id": run_id,
"harness_version": HARNESS_VERSION,
"started_at_utc": started_at,
"host_hostname_short": socket.gethostname().split(".")[0],
"load_avg_start": started_load,
"target_url": target_url,
"cell_id_prefix": args.cell_id_prefix,
"submitter_handle": args.submitter_handle,
"device_tag": args.device_tag,
"execution_shape": "per-model-block",
"phases_planned": phases,
"models_planned": models,
"canonical_options": dict(CANONICAL_OPTIONS),
"canonical_options_effective": options,
"timeout_seconds": args.timeout,
"platform_system": platform.system(),
"platform_release": platform.release(),
"python_version": platform.python_version(),
}
with out_path.open("w", encoding="utf-8") as fh:
write_jsonl(fh, meta)
try:
for model in models:
log(f"=== model block: {model} ===")
for phase in phases:
if phase == "hello":
run_hello(fh, model, target_url, args.timeout,
args.cell_id_prefix, options)
elif phase == "frozen":
run_frozen_prompts(fh, model, target_url, args.timeout,
args.cell_id_prefix, options)
else:
suite_path = SUITES_DIR / SUITES[phase]
if not suite_path.exists():
log(f"WARN: suite missing at {suite_path}; skip")
continue
run_suite(fh, model, target_url, args.timeout,
suite_path, phase, args.cell_id_prefix, options)
except KeyboardInterrupt:
log("interrupted — partial ledger written")
write_jsonl(fh, {
"type": "interrupted",
"ts_utc": utc_now(),
"reason": "KeyboardInterrupt",
})
return 130
finally:
write_jsonl(fh, {
"type": "footer",
"ts_utc": utc_now(),
"finished_at_utc": utc_now(),
"load_avg_end": load_avg(),
})
log(f"done — ledger at {out_path}")
log(f"next: agent fills hardware.json + run.md, then opens a PR.")
return 0
if __name__ == "__main__":
raise SystemExit(main())