weeyuga-benchmarks-public/submissions/EXAMPLE/mac-m1-8gb/run-00000000-0000-0000-0000-000000000000/run.md

# EXAMPLE — mac-m1-8gb — qwen3.5:0.8b — 2026-05-12

> **This is a synthetic example so contributors can see the shape of a
> submission end-to-end. The numbers are plausible but not from a real run.
> Don't cite this directory in analysis. Don't copy-paste these numbers.
> Real submissions live alongside this folder under `submissions/<handle>/`.**

**Run ID:** `00000000-0000-0000-0000-000000000000`
**Submitter:** EXAMPLE (synthetic)
**Hardware:** Apple MacBook Air M1, 8 GB unified, macOS 14.5
**Runtime:** Ollama 0.5.13 (default settings; NUM_PARALLEL=1, KEEP_ALIVE=5m)
**Models:** qwen3.5:0.8b
**Phases run:** hello, 5q, 20q
**Phases skipped:** parallel_same, parallel_mixed, edge_append, edge_suite — RAM constraint, parallel suites need ≥2 warm copies of the model and 8 GB unified didn't fit; edge suites time-budget skipped (would have been ~30 min more)

## Headline numbers

| Cell | Phase | n_calls | tok/s mean | tok/s p50 | duration p50 | format_ok rate |
|---|---|---|---|---|---|---|
| mac-m1:ollama:qwen3.5:0.8b | hello | 1 | 22.7 | 22.7 | 1.8 s | n/a |
| mac-m1:ollama:qwen3.5:0.8b | 5q | 5 | 21.4 | 22.3 | 4.2 s | 80% |
| mac-m1:ollama:qwen3.5:0.8b | 20q | 20 | 17.0 | 20.9 | 9.6 s | 70% |

## What I observed (qualitative)

- **Hello-call cold-start was fast** — 1.8 s including initial model load.
  Ollama reports the 0.8B GGUF as ~600 MB; on Apple Silicon unified memory
  this loads in well under 2 s.
- **5Q tasks were uniformly handled** — all five formats (bash, python,
  shell, four-numbered-steps, json) parsed correctly except one
  (Q3, "shell_lines" — model started with `1.` numbered list instead of
  raw shell command).
- **20Q tasks bifurcated** — the simple ones (Q01-Q08) ran at full
  ~20 tok/s with high format-correctness; the longer ones (Q09+ with
  multi-paragraph context) saw throughput drop to ~12-15 tok/s, with
  format_ok dropping to ~60%. p95 duration of 41 s was Q14 (the MyBoard
  triage prompt — long context, mixed format).
- **No errors, no timeouts.** Cleanest run was on AC power; the laptop
  fan never spun up.

## Methodology

Followed the canonical Pavilion methodology with these deviations:

- **NUM_PARALLEL=1** instead of canonical 3 — laptop, not server; one slot
  is enough for sequential per-model-block execution.
- **KEEP_ALIVE=5m** instead of canonical 2400h — laptop, no need to pin.
- **Phases `parallel_same`, `parallel_mixed`, `edge_append`, `edge_suite`
  skipped** — see top of file. Run not eligible for `flagship` grade,
  intended as `standard`.

## Caveats

- 8 GB unified RAM is below the comfort floor for parallel suites with this
  model; results above are NOT a refutation of the canonical parallel
  numbers — they're from a different shape of run.
- macOS Spotlight indexing was disabled before the run started. If you
  rerun without disabling, expect ~5-10% additional variance from
  background I/O.
- `format_ok` rate of 70% on 20Q is consistent with Sloba's flagship 20Q
  numbers for qwen3.5:0.8b on Pavilion (~74-78% in the v1 baseline) within
  measurement noise.

## Reproducibility

```
ollama pull qwen3.5:0.8b
ollama serve  # in a separate terminal

python3 harness/run_benchmark.py \
    --target-url http://127.0.0.1:11434 \
    --models qwen3.5:0.8b \
    --cell-id-prefix mac-m1:ollama \
    --phases hello,5q,20q \
    --submitter-handle alice \
    --device-tag mac-m1-8gb
```

Took ~16 minutes wall-clock on this hardware.

## Privacy attestation

I scanned `run.jsonl` for personal paths, API tokens, SSH keys, and
home-directory leakage:
```
grep -nE "Bearer |sk-|api_key|/Users/|/home/|password|ssh-rsa|ssh-ed25519" \
    submissions/EXAMPLE/mac-m1-8gb/run-00000000-0000-0000-0000-000000000000/*
```
No matches outside the SSH-troubleshooting prompt in 5Q (Q3) which is
intentional curriculum. Safe to ship.

— EXAMPLE (synthetic; not a real contributor)