# Contributing benchmarks from your hardware Welcome — and thank you for adding a data point. This file is the human-readable companion to `CLAUDE.md` / `AGENTS.md`. The agent files have the full mechanical detail; this one has the humans-in-the-loop story. ## What you're contributing You're running the same Weeyuga benchmark suite that powers [benchmarks.weeyuga.com](https://benchmarks.weeyuga.com), on your hardware, and submitting the raw output as a PR. Your numbers join Sloba's cluster numbers as a comparison point. More devices = more honest ladder. ## What you need - A device with ≥3 GB free disk space (model files are 0.5–10 GB depending on size) - An OpenAI-compatible LLM runtime — Ollama (easiest), llama.cpp, vLLM, or MLX - A coding agent — Claude Code, Codex, Aider, Cursor, etc. — to read the runbook and adapt the harness to your hardware - A Gitea account on `git.weeyuga.com` (free; you'll need it to fork + PR) - Maybe 1–4 hours wall-clock, mostly idle while the benchmark runs ## How it works 1. **You clone** this repo: ```bash git clone https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public.git cd weeyuga-benchmarks-public ``` 2. **You point your agent at it** with one sentence: > "Read `CLAUDE.md` (or `AGENTS.md`), then run the Weeyuga benchmark on > my hardware. Pick a model that fits, document everything, and prepare > a PR. I'll review before pushing." 3. **The agent does the work** — probes your hardware, picks a model size, adapts the runner's parameters, runs the suite, generates a JSONL ledger plus a human-readable summary. 4. **You review** what the agent produced — especially `run.md` for honesty and `run.jsonl` for any accidentally-leaked secrets. 5. **You fork → push → open PR** (the agent can do this too; you click the merge button on Gitea after). 6. **Sloba reviews the PR** and merges. He may have a question or two; the agent can address them on the same branch. ## Why crowdsource benchmarks? Hardware variance is huge. Sloba's published numbers come from a Mac M1, a laptop with a GTX 1060, and three VPSes — that's a thin slice of the world. Your RTX 4090, your Snapdragon X Elite, your Ryzen 7950X3D, your old Xeon all have stories worth recording. Same prompts. Same suites. Different hardware. The numbers compose. ## What we ask - Run the canonical suite as-is (5q + 20q minimum; the rest if you have time) - **Document deviations honestly.** If you had to skip parallel suites because of RAM, say so. If you tweaked NGL because the default OOM'd, say so. The point is comparable runs, not perfect runs. - **Privacy-scan before pushing.** `run.jsonl` stores response previews — if the model echoed your home directory or an API key from your shell history, redact before PR. - **One PR per device per session.** Don't bundle "my laptop AND my desktop AND my friend's PC" — separate PRs are easier to review. ## What the maintainers (Sloba + team) commit to - We respond to PRs within ~3 days - We don't merge without reading; if your run.md has clear caveats we'll usually merge - We credit you by handle in `catalogue.json` if/when your run becomes a flagship - We never expose anything from your `run.md` or `manifest.json` beyond what you submitted; if you used a pseudonym, that's the name that ships - If we ask for a re-run with different parameters, that's a separate dispatch — we don't silently reinterpret your run ## License of your contribution By PR-ing data into this repo, you license it under [CC-BY-4.0](LICENSE) (data) and the harness/runner code under [MIT](LICENSE-MIT). Attribution stays with you (your handle becomes part of the run record). ## What this repo is NOT - A leaderboard with prizes - A way to "win" against other devices (the point is honest measurement, not bragging rights) - A vehicle for marketing claims (vendor PR runs need a separate flow we haven't designed yet — please don't astroturf the catalogue) ## Found a bug or methodology gap? Open an issue. We'd rather hear about a flawed prompt or a misleading metric than ship more data using it. ## Code of conduct, the short version Be kind, be honest about your data, don't try to game the catalogue, and don't dox other contributors. Sloba reserves the right to remove submissions that violate spirit-of-the-thing — but we'll say why. — The Weeyuga team