Agent Almanac

Agent Almanac runs independent benchmarks of AI agents on real-world tasks. Each issue covers one agent domain. Coding agents, browsing agents, computer-use, tool-use, multi-hop research, agentic RAG. The metrics published in every issue are capability, cost, latency, and reliability.

What each report varies

An issue fixes a domain and a task suite, then varies the dimensions that change a deployment decision.

Hardware. Apple M-series, RTX 4090, A6000, single H100, multi-GPU H100, AMD MI300X, frontier API. Cost-quality tradeoffs come from the full stack.
Agentic frameworks. LangGraph, CrewAI, OpenAI Agents SDK, Claude Skills, AutoGen, aider, OpenHands, and reference harnesses. The framework is a variable, with its own report shape that holds the model fixed and varies the harness.
Models. Frontier APIs and open-weight models on the same agent harness. Cost per successful task reported alongside score.
Modalities. Text, code, vision, audio, computer-use. Selected per issue based on the task domain.

How issues are published

Reports are dated and snapshotted. Once published, an issue is not silently revised. Result corrections are appended as dated errata. New models that ship between issues get an interim post against the same harness, labeled as interim, and slot into the most recent issue's tables.

The methodology document is versioned. Increments happen when scoring rules, statistical handling, or task-selection rules change. Each issue cites the methodology version that produced it.

Open and closed

The methodology document is open. The per-issue reference Python package that produced the numbers is public and installable (e.g. agent-uniformity for Issue I, MIT licensed). Aggregated result CSVs and per-task raw outputs ship with each issue on HuggingFace.

The orchestration harness is closed. Per-task traces, prompts, and reasoning chains stay in the private vault. The private-vault rotation is what keeps published numbers uncontaminated across issues.

Editor

Yash Datta, saucam. Writes the reports. Maintains the methodology. Runs the errata process.

Contact

Public challenges to specific numbers: open an issue on the per-report repo (e.g. agent-uniformity-q2-2026).
Editorial and general correspondence: via the author socials below.
Author socials: twitter / ydatta, github / saucam, linkedin / yash datta