🔬 Methodology

Everything we compare, how we rank, and where our data comes from. Full transparency: no number we can't back with an official source.

📋 What we compare

Each model is documented across about fifteen dimensions, grouped into five families. All are visible in the side-by-side comparator.

Performance

Arena Elo

Human-preference ranking (LMArena).

MMLU

General knowledge, 57 subjects.

SWE-Bench

Solving real GitHub bugs.

HumanEval

Functional code generation.

Technical

Context window

Tokens handled at once (32k → 2M).

Speed

Generation throughput in tokens/second.

Release date

Model age.

Specialty

Primary domain of strength.

Economic

Input/output price

Cost in $ per million tokens.

Free tier

Whether a free tier exists.

Sovereignty & compliance

GDPR compliance

Adherence to EU regulation.

Open source

Weights available, self-hostable.

Origin

Country / publishing company.

Perceived quality

French quality

Fluency and nuance in French (editorial assessment).

Reasoning depth

Ability to chain complex logical steps (editorial assessment).

Creativity

Originality and richness of output (editorial assessment).

Factual reliability

Tendency to avoid hallucinations (editorial assessment).

🧮 How we rank (Podium)

The podium is a subset: a weighted score per category. The weights below are those actually used by the algorithm. They differ based on what matters in each domain.

Generalists

MMLU

30%

GPQA

20%

HellaSwag

10%

Prix

20%

Contexte

10%

Fraîcheur

10%

Code

SWE-bench

50%

HumanEval

30%

Prix

10%

Fraîcheur

10%

Vision

MMBench

45%

MMMU

25%

Prix

15%

Fraîcheur

15%

Multilingual

MMLU-multi

40%

FLORES

25%

Culturel

15%

Prix

10%

Fraîcheur

10%

Open Source

Licence

30%

Benchmarks

30%

Communauté

20%

Self-host

20%

How values are normalized

Price — inverted: cheaper = better. Free = 100, <$1/M = 95, <$5 = 85, <$20 = 70, <$50 = 50, <$100 = 30, beyond = 15.

Context — tiered: ≥1M = 100, ≥500k = 90, ≥200k = 80, ≥128k = 70, ≥32k = 50.

Freshness — <1 month = 100, <3 months = 90, <6 months = 75, <1 year = 55, then decreases.

License — Apache/MIT = 100, BSD = 95, GPL = 85, Llama (restrictions) = 60.

Self-host — by size: ≤8B = 100 (runs on a Mac), ≤30B = 85, ≤70B = 70, beyond needs a cluster.

✅ Reliability & sources

This is what sets this comparator apart from a mere table. Our data commitment:

Official sources only

Prices and benchmarks verified on publishers' own pages (Anthropic, OpenAI, Google, Mistral, Moonshot…), not third-party aggregators.

Per-model traceability

Each price carries a source URL and a verification date. What isn't verified isn't presented as such.

Monthly verification

On the 1st of each month, a check is performed. The change history is public.

Sourced or excluded

A model whose benchmarks can't be independently verified (e.g. a non-public model) isn't ranked like the others and is flagged "restricted access".

⚖️ Acknowledged limits

Price is part of the score: a model can rank well largely because it's cheap. Our ranking reflects value for money, not raw power alone.

Freshness is rewarded: a recent model gains a few points. It's a deliberate choice, since the field moves fast — but it can over-weight novelty.

Benchmarks don't say everything: a model can score well yet disappoint in practice. That's why every ranking is human-reviewed before publication.

← Back to podium

📋 What we compare

⚡ Performance

🔧 Technical

💰 Economic

🛡️ Sovereignty & compliance

🎨 Perceived quality

🧮 How we rank (Podium)

Generalists

Code

Vision

Multilingual

Open Source

How values are normalized

✅ Reliability & sources

⚖️ Acknowledged limits

Performance

Technical

Economic

Sovereignty & compliance

Perceived quality