Per-issue methodology · Issue I · Code Uniformity · Q2 2026v0.1.0Frozen 2026-05-08Yash Datta · saucam

How we
measured.

The pre-registered measurement plan for Code Uniformity · Q2 2026. Locked at v0.1.0 on 2026-05-08, before any sampling. Six hypotheses, eight measurement axes, the stratified sample design, and the statistical handling that turned 12,254 functions across 48 public OSS repositories into the published findings. The publication’s editorial charter is upstream of this document.

version
0.1.0
frozen
2026-05-08 · before any sampling or measurement
authors
Yash Datta · saucam
license
CC BY 4.0 (this document) · see agent-uniformity for code license (MIT)
01

What this report measures

The structural impact of AI authorship on production code. Specifically: when AI agents (Claude Code, Cursor, Aider, etc.) author code that ends up merged into public repositories, how does that code differ from human-written code in the same repo, across the same language, and across languages?

The report is descriptive empirical, not prescriptive. We measure observed properties; we do not claim AI code is “better” or “worse” without an explicit quality proxy attached.

02

The wedge

Most takes about AI-generated code are speculative or anecdotal. We measure what AI authorship does to the code that humans (and other agents) read, debug, and modify next month: uniformity, repetition, cohesion, isolation, naming consistency, complexity. These are the levers of downstream readability and maintenance cost.

03

Pre-registered hypotheses

Locked at methodology-freeze time, before sampling. We will report results whether or not these hold.

#HypothesisDirectionRationale
H1AI-authored functions have higher mean top-K similarity to other functions in the same repo than human-authored functionsAI > humanAI tools converge on shared patterns from training data
H2The AI-vs-human similarity gap (H1 magnitude) grows with the repo's overall AI authorship ratioPositive correlationMore AI code = AI patterns dominate the repo
H3DRY-cluster density is higher in AI-heavy reposAI-heavy > human-heavyAI generates lookalike functions instead of factoring shared helpers
H4The most-isolated functions in AI-heavy repos are disproportionately human-authoredYesHand-coded edges (provider integrations, rare utilities) survive as the unique parts
H5AI code has lower cyclomatic complexity per function than human code at matched function sizeAI < humanAI tools default to simple control flow
H6Across languages, the AI uniformity gap (H1) varies; some languages produce more native-feeling AI output than othersVariation presentDifferent training-data densities, idiom complexity

We additionally pre-register exploratory questions where we have no directional prediction:

  • E1: Does cross-language semantic mirroring scale with AI ratio in multilingual repos?
  • E2: Is there a U-curve between repo uniformity and project popularity (stars × commit cadence)?
  • E3: Does comment density correlate with AI authorship?
04

Definitions

4.1 AI authorship — ground truth

A commit is classified as AI-authored if any of the following hold:

  1. Commit message contains literal text Co-Authored-By: Claude (the Claude Code default footer)
  2. Commit message contains literal text Generated with [Claude Code]
  3. Author name contains a parenthetical AI tag: (aider), (claude-...), (claude-sonnet-...), (claude-opus-...), (gpt-...)

A line of code is AI-authored if its git blame SHA is in the AI-authored commit set.

A function’s AI ratio = (count of AI-authored lines within the function’s line range) / (total lines in function range).

We acknowledge this signal is conservative: AI commits without these markers are not detected. False positive rate is near zero (the markers are explicit). False negative rate is unknown but likely substantial.

4.2 Function selection

We extract top-level functions and class methods using language-specific AST / tree-sitter parsers:

  • Python: ast module
  • JavaScript / TypeScript: tree-sitter-javascript, tree-sitter-typescript
  • Go: tree-sitter-go
  • Rust: tree-sitter-rust

Inclusion criteria:

  • Function body ≥ 4 source lines (excludes one-line stubs)
  • Not in test directories (any path containing /test, /tests, /spec, __tests__)
  • Not in vendored code (/vendor, /node_modules, /dist, /build, /__pycache__, /venv)
  • Not in migrations (/migrations for Django / Rails-shaped repos)
  • Public name (not starting with _ for Python, except dunders); language-appropriate visibility for others

4.3 Similarity

Computed by semble (MinishLab/semble, version locked per run). Semble’s hybrid retrieval combines:

  • Static embeddings via Model2Vec (potion-code-16M model)
  • Lexical matching via BM25
  • Reciprocal Rank Fusion to combine

Semble version is captured per run (semble.__version__). We use semble’s defaults for indexing; no manual hyperparameter tuning. This locks results to a known, reproducible configuration.

For each function we query semble with the function body (truncated to 1500 chars) and read the top-15 nearest chunks. We exclude self-matches (same file with overlapping line range). The remaining 10 are the function’s siblings.

4.4 The metrics

Axis 1 — Pattern Uniformity

MetricFormulaRangeInterpretation
mean_topk_simmean(score for top-10 siblings excluding self)0..1 (typically 0.005–0.05 in semble)Function fits a common pattern (high) vs unusual (low). Raw signal, context-dependent.
top1_simscore of nearest sibling0..1If high with cross-file pair → DRY merge candidate
repo_uniformity_indexmean(mean_topk_sim) across all functions0..1Codebase-level convergence
pattern_cluster_density(DRY pairs above per-language threshold) / (function count)≥ 0Lower is better for maintainability (less duplicated structure)

Axis 2 — Cohesion vs Coupling

MetricFormulaInterpretation
same_file_cohesionmean(score) for sibling pairs in same fileHigher is generally better — related code stays together
cross_file_couplingmean(score) for sibling pairs in different filesLower is better (less structural duplication across the codebase)
cohesion_coupling_ratiosame_file_cohesion / cross_file_couplingHigher is better — code organized by responsibility

Axis 3 — Isolation

MetricFormulaInterpretation
isolation_rate% of functions with mean_topk_sim below 1st percentile of repoRaw signal; both extremes interesting
most_isolated_top_nbottom-N functions by mean_topk_simThe repo's hand-curated edges, protect during refactors

Axis 4 — AI Authorship Correlation

MetricFormulaInterpretation
ai_ratio(AI-authored lines) / (total lines) at repo levelRaw signal, used to stratify all other metrics
ai_uniformity_gapmean(mean_topk_sim | AI ratio ≥ 0.7) − mean(mean_topk_sim | AI ratio ≤ 0.1)H1 magnitude. Positive = AI code is more uniform
ai_cluster_contribution% of DRY pairs where ≥ 1 function has AI ratio ≥ 0.7Higher = AI is the source of duplication
ai_isolation_rate% of high-AI functions that are isolatedLower expected; surprisingly high values would be interesting

Axis 5 — Cross-Language (multilingual repos only)

MetricFormulaInterpretation
cross_lang_pair_count# of DRY pairs above threshold where source.language ≠ target.languageHigh → shared logic mirrored across languages
cross_lang_pair_scoremean score of cross-language pairsTightness of the mirroring

Axis 6 — Per-Language Comparison (across all 50+ repos)

MetricFormulaInterpretation
language_uniformity_indexmean(repo_uniformity_index) across repos in this languageRaw signal, language tendency
language_ai_gapmean(ai_uniformity_gap) across repos in this languageLower = AI writes natively in this language; higher = AI imposes alien pattern
ai_friendliness_rankcomposite ranking (see Axis 7)Headline leaderboard for the report

Axis 7 — Composite

ScoreComponentsUse
ai_code_health_score(cohesion - coupling) × (1 - cluster_density) × (1 - ai_uniformity_gap), normalized (per repo)One-number summary for how cleanly AI is integrated
ai_friendliness_rankmean(ai_code_health_score) across repos in languageThe leaderboard chart

Axis 8 — Quality Proxies

These come from independent tooling and let us cross-reference uniformity findings with traditional quality signals.

MetricToolInterpretation
cyclomatic_complexity_per_functionradon (Python), lizard (multi-lang)Lower generally easier to read
lines_per_functionASTDistribution shape; outliers are interesting
comment_densityline counterLines starting with # / // / /* ÷ total non-blank lines
project_popularityGitHub APIlog(stars) + log(commits last 90d) + log(PRs last 90d)
test_ratiopath-based count(lines under /tests/ or /test/) / (total lines)

4.5 Per-language similarity threshold

Different languages have different inherent similarity densities (Python has dense embeddings; Rust has more lexical variance). We calibrate the DRY threshold per language at the 95th percentile of the within-repo pair-similarity distribution for repos at that language’s bottom AI bucket (low-AI baseline). This bucket establishes natural similarity for human code in that language. The threshold is then applied uniformly to all repos in that language.

Thresholds are reported in the methodology output and can be inspected per-language.

05

Sampling

5.1 Universe

Public GitHub repositories where the AI ratio can be computed (i.e., commits include detectable AI authorship signals or are clearly absent). Excludes:

  • Private or archived repos
  • Repos whose primary language is not in the v1 set (Python, JavaScript, TypeScript, Go, Rust)
  • Repos under 100 functions (per inclusion criteria above)
  • Repos over 2 GB on disk
  • Forks (we use the canonical upstream)
  • Repos where commit history is unavailable or squashed-only (≤ 50 visible commits)

5.2 Stratification

5 single-language buckets × 3 AI-ratio strata × ~3.3 repos per cell = 50 single-language repos. Plus ~5 multilingual repos for Axis 5 = ~55 total. Actual shipped sample was 48 repos; the gap is documented in the report’s sample section.

Low AI (<30%)Mid AI (30–70%)High AI (>70%)Total
Python~3~3–4~310
TypeScript~3~3–4~310
JavaScript~3~3–4~310
Go~3~3–4~310
Rust~3~3–4~310
Multilingual~2~2~1~5
Total~17~19~16~55

Within each cell:

  • Half by GitHub stars (popularity-weighted)
  • Half by recent commit activity (last 90 days)

This avoids over-sampling famous-but-stale repos and no-name-but-AI-heavy repos. The selected repo list is published in sampling.md after discovery; the SHA of each repo at clone time is captured.

5.3 Reproducibility per repo

For each repo, we capture and publish:

  • Full URL
  • git rev-parse HEAD (commit SHA at time of analysis)
  • git describe --tags --always (nearest tag)
  • Repo size in MB
  • Function count (after inclusion filtering)
  • AI ratio (computed)
  • Language(s) detected
  • Run timestamp
  • semble version
  • methodology version

These let any reader fully reproduce the run. The reference package (agent-uniformity) is public and MIT-licensed; install it and re-run any single repo from the locked SHAs to reproduce within ~1%.

06

Statistical handling

  • Confidence intervals: bootstrap (1000 iterations) for any reported mean
  • Multiple-comparison correction: Benjamini-Hochberg at α = 0.05 across all hypothesis tests reported
  • Outliers: never silently removed; reported separately if dropped from a chart
  • Tied scores: reported as ties; ranks use mean-rank handling
  • Per-language thresholds: published, not buried
07

What this methodology does not cover (v1 scope)

  • Functional correctness (does the code work?)
  • Type or bug density (would need static analysis or runtime tests)
  • Security (would need SAST tools)
  • Performance (would need profiling)
  • Long-tail languages beyond top 5

These are out of scope for v1. v2 may add complexity-density combined metrics (Axis 8) and security proxies.

08

Bias and caveats

  • AI detection is conservative. Our ground truth catches Claude-tagged commits. Cursor, Copilot, Aider-without-tag, GPT-via-curl, and Claude-via-API-without-footer are missed. Real AI ratios are likely higher than reported.
  • Repo selection bias. Repos that publicly use AI footers self-select for “willing to advertise AI use.” This may skew the high-AI bucket toward indie or hobby or agentic-tool projects. Documented as a caveat.
  • Semble similarity is not true semantic similarity. It is hybrid lexical + embedding retrieval. Functions with shared variable names will score higher even if logically distinct.
  • Language coverage. Top 5 languages cover most modern web / infra / ML, but exclude C / C++ / Java / PHP / Ruby / Swift / Kotlin. v2 expansion planned.
  • Static analysis. semble does not see runtime behavior, dynamic dispatch, reflection, or metaprogramming. Languages with heavy runtime polymorphism (Ruby, JavaScript) may underestimate true similarity.
09

Versioning

Any change to:

  • Hypothesis set
  • Metric definition
  • Threshold formula
  • Sampling stratification
  • AI detection signal set

… requires a methodology version bump (0.1.00.2.0). Old reports remain published with their methodology version stamped; new reports cite the new methodology.

Patch versions (0.1.00.1.1) are reserved for non-substantive corrections (typos, broken links, clarifications).

10

Open / closed policy

The shipped report’s posture is more open than the v0.1.0 freeze contemplated. Current state, in effect on 2026-05-10:

ComponentAccess
This methodology documentCC BY 4.0 — public, citable
Per-issue reference package (agent-uniformity)MIT — public, installable, version-pinned per issue
Per-issue analysis package (agent-uniformity-q2-2026)MIT — public; methodology + analysis CSVs + locked SHAs
Raw per-task outputs (HuggingFace dataset)Public — partial results per repo, replayable
Generic orchestration harness (multi-issue parallel runner)Closed — the internal productized asset; reproducibility does not require it

The split: methodology + analysis is the public contribution; the runner that orchestrates production runs across many benchmarks is the internal productized asset. The shipped report does not gate any number behind a paywall; any reader can reproduce any single number from the public package.

11

Changelog

VersionDateChange
0.1.02026-05-08Initial methodology. Pre-registered hypotheses. 50+5 repo stratified sample target across Python / TypeScript / JavaScript / Go / Rust + multilingual. Per-language calibrated DRY thresholds. Bootstrap CIs, Benjamini-Hochberg correction.

Per-issue methodology v0.1.0 · Frozen 2026-05-08 · Yash Datta · saucam ·Back to the report

Methodology questions, corrections, or replication notes: open an issue at github.com/saucam/agent-uniformity-q2-2026.