How we
measure.
This document is the editorial charter for Agent Almanac. Task selection, scoring, statistical handling, the dimensions any issue may vary, how issues are published, and how readers can challenge specific numbers. Per-issue methodologies are published with their issue and cite this document as the governance baseline11Issue I (Code Uniformity) ships its own per-issue methodology at v0.1.0 covering function-level structural uniformity measurement via this document. Future issues lock and publish their own per-issue methodology documents the same way..
What we measure
A deployment decision needs more than a single accuracy number. It needs cost per successful task, latency at p95, variance across seeds, the cost-quality frontier across hardware, and the framework-level effects that get hidden when only the model name is reported.
Each Agent Almanac report fixes a domain and a task suite, then varies the dimensions below. Numbers in the report are always paired with the configuration that produced them.
Dimensions varied per report
- Capability. Pass@1 on the task suite. Pass@k where applicable.
- Performance. Latency p50 and p95, tokens consumed, steps to completion.
- Cost. $/successful task at public list pricing for API models. Measured electricity plus amortized hardware for open-weight runs.
- Reliability. Variance across seeds. Reported alongside every aggregate.
- Hardware. Apple M-series, RTX 4090 24GB, A6000 48GB, single H100 80GB, multi-GPU H100, AMD MI300X 192GB, frontier API.
- Cross-vendor hardware. NVIDIA against AMD on identical workloads, where the inference engine permits parity.
- Agentic frameworks. LangGraph, CrewAI, OpenAI Agents SDK, Claude Skills, AutoGen, aider, OpenHands, custom harnesses. The framework is a variable, not an assumption.
- Models behind the agent. Frontier APIs and open-weight models on the same harness.
- Modalities. Text, code, vision, audio, computer-use. Selected per report based on the task domain.
Task selection
A task suite is eligible when four conditions hold:
- It reflects work practitioners actually do. Coding fixes, browser flows, customer-workflow tool use, document QA. Not synthetic puzzles.
- Each task has a deterministic correctness signal, or a rubric-graded answer with inter-rater agreement ≥0.7.
- The suite has a held-out test partition. We never run on training data.
- The suite is actively maintained.
Contamination of canonical benchmarks (most frontier models have been trained on them) is the primary reason no task suite is committed-to outside the issue it ships in. Each issue names its task suite in its own per-issue methodology, alongside the locked SHAs of any freshly-sampled real-world tasks (e.g. PRs merged after every evaluated model's training cutoff) used to resist contamination.
Build-in-the-open task sources
Reports may also draw tasks from public build-in-the-open initiatives. Possible sources include Karpathy autoresearch traces, public agent traces from Cursor, Claude, and Codex blog posts, issue trackers, and PR sequences from notable OSS projects. Tasks from these sources carry the build-in-public provenance tag and report in a separate column from academic-suite tasks. The split lets readers see the gap between real-world signal and academic-suite signal.
Provenance axes
Each task in any report is tagged:
- Academic — peer-reviewed or established benchmark.
- Build-in-public — real engineering work surfaced publicly.
- Synthesized — Agent Almanac-generated, with rationale published.
Models under test
Frontier API models
Released within the past 12 months. Generally available (no closed previews). Public pricing. Context window ≥1M tokens.
Open-weight models
Released within the past 12 months. Weights publicly downloadable. Runs on at least one hardware tier we test on (consumer 24GB, A6000, single H100, multi-GPU H100, MI300X). Context window ≥32K tokens.
Multimodal models
Same criteria as above where the task domain warrants. Vision-language and audio-language models test on modality-appropriate suites, with rubrics that adapt where pass@1 doesn't apply.
Snapshot locking
Each report locks specific snapshots. API models name a snapshot date and exact version string. Open-weight models name a HuggingFace repo, commit SHA, and quantization (e.g. Qwen/Qwen3-Coder-32B-Instruct@a1b2c3d, AWQ-4bit). If a model updates between reports, the next report re-runs the new version and labels both.
Agentic frameworks
The agent framework is a variable. A model that wins under one framework can lose under another. Planning depth, retry policy, tool-call conventions, and memory live in the framework code, not the model weights.
Reports come in two shapes:
- Cross-model, framework-fixed. Framework locked (e.g. aider, OpenHands, or a custom harness, named in the per-issue methodology). Models vary. Isolates model capability.
- Cross-framework, model-fixed. Model locked (e.g. Claude Sonnet 4.6). Frameworks vary across LangGraph, CrewAI, OpenAI Agents SDK, Claude Skills, AutoGen, aider, OpenHands, and reference harnesses. Isolates framework effects.
Each report names the shape upfront. Frameworks are open-source and version-locked by name, repo, and commit SHA. Default settings unless noted; deviations disclosed with rationale.
Modalities
Agents operate in different modalities depending on the task — text, code, vision (browser screenshots, document images), audio (transcription, speech tool-use), and computer-use (screen + keyboard + mouse). The benchmark fits the modality.
When a report covers a modality beyond text/code, the methodology addendum names the modality-specific scoring approach (e.g., grounding accuracy for visual element selection, transcription accuracy for audio chains, frame-by- frame action correctness for computer-use). Cross-modality reports compare agents that handle the same task across different modal inputs (e.g., text-only HTML vs vision-grounded screenshot for the same browsing task).
Scoring
Primary metric: pass@1. Secondary: pass@k where applicable, steps-to-success, tokens-to-success, cost-to-success, latency-to-success.
LLM-as-judge only where deterministic scoring is unavailable. Judge model differs from any model under test. Judge prompts publish verbatim in the per-report reproduction kit. 10% of judgements get a human spot-check. If human-judge agreement falls below 85%, the rubric is rewritten and the affected runs replayed.
No partial credit unless the upstream suite defines it. No preference comparisons between models on the same task. Preference adds judge bias without measuring outcome.
Statistical handling
n ≥ 3 seeds per task per model for stochastic tasks. n=1 for deterministic tasks (temperature=0 with explicit seeding, noted explicitly). Bootstrap 95% CI (10,000 resamples) reported alongside every aggregate score. When two models' CIs overlap, we report "no significant difference" rather than ranking them.
API errors and timeouts retried up to 3 times. Persistent failures count as task failures, not exclusions. If a model fails >10% of tasks due to non-task errors, the run is repeated; if reproducible, disclosed.
Cost, latency & hardware
Cost. API: (input_tokens × input_price + output_tokens × output_price) summed across all turns. Pricing snapshot URL recorded at the report-lock date. Open-weight on consumer GPU: (GPU_amortization_per_hour × wall_clock_hours) + (electricity_kWh × $0.12), disclosed transparently as estimate.
Latency. Time-to-first-action and time-to-completion, p50 and p95 reported.
Hardware tiers we test on
- Edge — Apple M-series (8/16/32GB)
- Consumer 24GB — RTX 4090 (owned baseline)
- Pro single-GPU — RTX 6000 Ada / A6000 48GB
- Datacenter NVIDIA single — H100 80GB / H200 141GB (rented per-report)
- Datacenter AMD — MI300X 192GB (rented per-report)
- Multi-GPU NVIDIA — 4× / 8× H100
- Multi-GPU AMD — 4× / 8× MI300X
- Frontier API — vendor infra; public pricing accounted
NVIDIA vs AMD protocol. Same model weights and quantization (BF16 preferred for parity). Inference engine matched (vLLM on both, or llama.cpp ROCm vs CUDA, or TGI on both). Driver / ROCm / CUDA versions disclosed. Parity-check sanity test on a small fixed prompt set before main benchmarks. Pricing comparison uses public spot/on-demand rates from at least two providers per accelerator.
Open vs closed
Open: this methodology document; the per-issue methodology document published with each issue; the per-issue reference Python package that produced the numbers (e.g. agent-uniformity for Issue I — installable from PyPI, MIT licensed); aggregated result CSVs and per-task raw outputs published with each issue on HuggingFace.
Closed: the generic benchmark orchestration harness that runs every Agent Almanac issue (parallel runners, partial-write persistence, HuggingFace publish pipeline). The per-issue analysis code is always public so any reader can reproduce any single number from scratch.
The split is: methodology + analysis is the public contribution; the runner that orchestrates production runs of many benchmarks is the internal productized asset.
Versioning & updates
Reports are dated and snapshotted. Once published, an issue does not update mid-cycle even if new models ship. New models or frameworks released between issues get an interim post (free, on LinkedIn and X) that runs the same harness and slots into the most recent report's tables. Interim entries are labeled as such.
Methodology version increments when scoring rules, statistical handling, or task-selection rules change. Each report cites the methodology version that produced it. Corrections ship as dated errata appended to the original report. Reports are never silently rewritten.
Known limitations
- Public pricing only. Enterprise volume discounts and committed-use rates are not modeled.
- Snapshot-time results. Reports represent a specific snapshot; do not extrapolate in perpetuity.
- No human user-study layer. Pass@1 measures task correctness, not user satisfaction. We will not publish satisfaction scores without a properly designed user study.
- Rented-hardware variance. H100 / MI300 results are run on rented capacity. Provider, region, and noisy-neighbor conditions disclosed; results may differ marginally from on-prem.
- Per-report scope. Not every report covers every hardware tier or every language — scope is stated upfront in each report's addendum.
Challenge process
If you believe a specific number in a report is wrong:
- Open an issue on the report's GitHub repo with the finding and your reasoning. For Issue I, that's agent-uniformity-q2-2026.
- Or, install the per-issue reference package (
pip install agent-uniformityfor Issue I) and re-run the locked tasks yourself. The methodology document for each issue specifies how. - We respond within 14 days with either (a) a step-by-step reproduction you can replay, (b) an erratum if the number is wrong, or (c) a methodology rationale.
- All challenges and resolutions are public.