weeyuga-benchmarks-public/CONTRIBUTING.md

# Contributing benchmarks from your hardware

Welcome — and thank you for adding a data point. This file is the
human-readable companion to `CLAUDE.md` / `AGENTS.md`. The agent files have
the full mechanical detail; this one has the humans-in-the-loop story.

## What you're contributing

You're running the same Weeyuga benchmark suite that powers
[benchmarks.weeyuga.com](https://benchmarks.weeyuga.com), on your hardware,
and submitting the raw output as a PR. Your numbers join Sloba's
cluster numbers as a comparison point. More devices = more honest
ladder.

## What you need

- A device with ≥3 GB free disk space (model files are 0.5–10 GB depending on size)
- An OpenAI-compatible LLM runtime — Ollama (easiest), llama.cpp, vLLM, or MLX
- A coding agent — Claude Code, Codex, Aider, Cursor, etc. — to read the runbook and adapt the harness to your hardware
- A Gitea account on `git.weeyuga.com` (free; you'll need it to fork + PR)
- Maybe 1–4 hours wall-clock, mostly idle while the benchmark runs

## How it works

1. **You clone** this repo:
   ```bash
   git clone https://git.weeyuga.com/slobodanmargetic988/weeyuga-benchmarks-public.git
   cd weeyuga-benchmarks-public
   ```
2. **You point your agent at it** with one sentence:
   > "Read `CLAUDE.md` (or `AGENTS.md`), then run the Weeyuga benchmark on
   > my hardware. Pick a model that fits, document everything, and prepare
   > a PR. I'll review before pushing."
3. **The agent does the work** — probes your hardware, picks a model size,
   adapts the runner's parameters, runs the suite, generates a JSONL ledger
   plus a human-readable summary.
4. **You review** what the agent produced — especially `run.md` for honesty
   and `run.jsonl` for any accidentally-leaked secrets.
5. **You fork → push → open PR** (the agent can do this too; you click
   the merge button on Gitea after).
6. **Sloba reviews the PR** and merges. He may have a question or two; the
   agent can address them on the same branch.

## Why crowdsource benchmarks?

Hardware variance is huge. Sloba's published numbers come from a Mac M1, a
laptop with a GTX 1060, and three VPSes — that's a thin slice of the world.
Your RTX 4090, your Snapdragon X Elite, your Ryzen 7950X3D, your old Xeon
all have stories worth recording.

Same prompts. Same suites. Different hardware. The numbers compose.

## What we ask

- Run the canonical suite as-is (5q + 20q minimum; the rest if you have time)
- **Document deviations honestly.** If you had to skip parallel suites because
  of RAM, say so. If you tweaked NGL because the default OOM'd, say so. The
  point is comparable runs, not perfect runs.
- **Privacy-scan before pushing.** `run.jsonl` stores response previews — if
  the model echoed your home directory or an API key from your shell history,
  redact before PR.
- **One PR per device per session.** Don't bundle "my laptop AND my desktop
  AND my friend's PC" — separate PRs are easier to review.

## What the maintainers (Sloba + team) commit to

- We respond to PRs within ~3 days
- We don't merge without reading; if your run.md has clear caveats we'll usually merge
- We credit you by handle in `catalogue.json` if/when your run becomes a flagship
- We never expose anything from your `run.md` or `manifest.json` beyond what
  you submitted; if you used a pseudonym, that's the name that ships
- If we ask for a re-run with different parameters, that's a separate dispatch — we don't silently reinterpret your run

## License of your contribution

By PR-ing data into this repo, you license it under
[CC-BY-4.0](LICENSE) (data) and the harness/runner code under
[MIT](LICENSE-MIT). Attribution stays with you (your handle becomes part of
the run record).

## What this repo is NOT

- A leaderboard with prizes
- A way to "win" against other devices (the point is honest measurement, not bragging rights)
- A vehicle for marketing claims (vendor PR runs need a separate flow we
  haven't designed yet — please don't astroturf the catalogue)

## Found a bug or methodology gap?

Open an issue. We'd rather hear about a flawed prompt or a misleading metric
than ship more data using it.

## Code of conduct, the short version

Be kind, be honest about your data, don't try to game the catalogue, and
don't dox other contributors. Sloba reserves the right to remove submissions
that violate spirit-of-the-thing — but we'll say why.

— The Weeyuga team