🔬 Methodology

Everything we compare, how we rank, and where our data comes from. Full transparency: no number we can't back with an official source.

📋 What we compare

Each model is documented across about fifteen dimensions, grouped into five families. All are visible in the side-by-side comparator.

Performance

Arena Elo
Human-preference ranking (LMArena).
MMLU
General knowledge, 57 subjects.
SWE-Bench
Solving real GitHub bugs.
HumanEval
Functional code generation.

Technical

Context window
Tokens handled at once (32k → 2M).
Speed
Generation throughput in tokens/second.
Release date
Model age.
Specialty
Primary domain of strength.

Economic

Input/output price
Cost in $ per million tokens.
Free tier
Whether a free tier exists.

Sovereignty & compliance

GDPR compliance
Adherence to EU regulation.
Open source
Weights available, self-hostable.
Origin
Country / publishing company.

Perceived quality

French quality
Fluency and nuance in French (editorial assessment).
Reasoning depth
Ability to chain complex logical steps (editorial assessment).
Creativity
Originality and richness of output (editorial assessment).
Factual reliability
Tendency to avoid hallucinations (editorial assessment).

🧮 How we rank (Podium)

The podium is a subset: a weighted score per category. The weights below are those actually used by the algorithm. They differ based on what matters in each domain.

Generalists

MMLU
30%
GPQA
20%
HellaSwag
10%
Prix
20%
Contexte
10%
Fraîcheur
10%

Code

SWE-bench
50%
HumanEval
30%
Prix
10%
Fraîcheur
10%

Vision

MMBench
45%
MMMU
25%
Prix
15%
Fraîcheur
15%

Multilingual

MMLU-multi
40%
FLORES
25%
Culturel
15%
Prix
10%
Fraîcheur
10%

Open Source

Licence
30%
Benchmarks
30%
Communauté
20%
Self-host
20%

How values are normalized

Priceinverted: cheaper = better. Free = 100, <$1/M = 95, <$5 = 85, <$20 = 70, <$50 = 50, <$100 = 30, beyond = 15.

Contexttiered: ≥1M = 100, ≥500k = 90, ≥200k = 80, ≥128k = 70, ≥32k = 50.

Freshness<1 month = 100, <3 months = 90, <6 months = 75, <1 year = 55, then decreases.

LicenseApache/MIT = 100, BSD = 95, GPL = 85, Llama (restrictions) = 60.

Self-hostby size: ≤8B = 100 (runs on a Mac), ≤30B = 85, ≤70B = 70, beyond needs a cluster.

Reliability & sources

This is what sets this comparator apart from a mere table. Our data commitment:

Official sources only
Prices and benchmarks verified on publishers' own pages (Anthropic, OpenAI, Google, Mistral, Moonshot…), not third-party aggregators.
Per-model traceability
Each price carries a source URL and a verification date. What isn't verified isn't presented as such.
Monthly verification
On the 1st of each month, a check is performed. The change history is public.
Sourced or excluded
A model whose benchmarks can't be independently verified (e.g. a non-public model) isn't ranked like the others and is flagged "restricted access".

⚖️ Acknowledged limits

Price is part of the score: a model can rank well largely because it's cheap. Our ranking reflects value for money, not raw power alone.

Freshness is rewarded: a recent model gains a few points. It's a deliberate choice, since the field moves fast — but it can over-weight novelty.

Benchmarks don't say everything: a model can score well yet disappoint in practice. That's why every ranking is human-reviewed before publication.