weeyuga-benchmarks-public/runs/09d8fbde-0008-49bb-99da-03eeaca72be1/run.md

# Predator trio bench — 09d8fbde-0008-49bb-99da-03eeaca72be1

- Started: 2026-05-04T16:01:52Z
- Host (driver): Slobodans-MacBook-Air.local
- Node: predator (http://10.8.0.7:11436)
- Engine: llamacpp
- Harness: predator-trio-1

## Models

| Key | Cell | GGUF |
|---|---|---|
| granite | `predator:llamacpp:granite-4.1:8b-q4km` | `granite-4.1-8b-Q4_K_M.gguf` |
| gemma | `predator:llamacpp:gemma-4:e4b-it-q4km` | `gemma-4-E4B-it-Q4_K_M.gguf` |
| qwen | `predator:llamacpp:qwen3.5:9b-q4km` | `Qwen3.5-9B-Q4_K_M.gguf` |

## llama-bench synthetic throughput (pp512 + tg128)

| Model | pp512 (t/s) | tg128 (t/s) |
|---|---|---|
| granite | 142.89 | 12.19 |
| gemma | 498.70 | 24.53 |
| qwen | 63.64 | 7.43 |

## VRAM snapshot (used / free / total MB, post-load)

| Model | Used | Free | Total |
|---|---|---|---|
| granite | 5990 | 40 | 6144 |
| gemma | 4408 | 1622 | 6144 |
| qwen | 5954 | 76 | 6144 |

## Per-prompt latency + tokens/sec

| Model | Prompt | Cold dur (ms) | Cold tok/s | Warm tok/s mean | Warm dur p50 (ms) | Errors |
|---|---|---|---|---|---|---|
| granite | hello | 1782 | 5.6 | 11.1 | 917 | 0 |
| granite | P-MEDIUM | 6095 | 14.4 | 15.3 | 6602 | 0 |
| granite | P-HARD | 20275 | 15.6 | 15.7 | 18761 | 0 |
| gemma | hello | 3519 | 18.2 | 19.5 | 2850 | 0 |
| gemma | P-MEDIUM | 3858 | 22.3 | 22.9 | 12258 | 0 |
| gemma | P-HARD | 28440 | 23.6 | 23.5 | 15838 | 0 |
| qwen | hello | 5514 | 11.6 | 13.8 | 4633 | 0 |
| qwen | P-MEDIUM | 35678 | 14.4 | 14.5 | 35311 | 0 |
| qwen | P-HARD | 73737 | 13.9 | 14.6 | 70291 | 0 |

## Notes

- `phase=cold` is the FIRST request after llama-server boot; subsequent runs are `warm`.
- `tokens_per_sec` here is `completion_tokens / (total_duration_ms / 1000)` —
  includes prefill time, so true generation rate is slightly higher. Use llama-bench's
  `tg128` for prefill-free generation throughput.
- For Qwen3.5-9B Q4_K_M (~5.4 GB) on GTX 1060 6 GB: KV cache + model exceeds VRAM with
  ctx 4096, so partial CPU offload is expected. Compare against granite-4.1-8B (5.0 GB)
  which fits more comfortably and gemma-4-E4B (4.6 GB) which fits with room to spare.