SourceBench

Can AI answers cite high-quality web sources?

SourceBench focuses on a different evaluation target from standard answer-quality benchmarks. Instead of only asking whether a model answered well, it asks whether the model cited sources that are relevant, accurate, fresh, transparent, authoritative, and usable. This Space hosts the public-facing leaderboard frontend. Official leaderboard entries are validated and judged by the SourceBench team.

Benchmark target Quality of cited sources, not just final answer correctness.

Current scope Generative engines with built-in web search, plus official validation for leaderboard inclusion.

Official policy Official leaderboard entries are validated and judged by the SourceBench team using fixed hidden evaluation settings.

Models in current board

Current leaderboard snapshot

Query types

Benchmark query slices

Top model

Highest weighted content score

Top weighted score

Weighted source-quality metric

Leaderboard

Ranking Table

Generated at -

SourceBench ranks systems by judged source quality rather than answer fluency alone. The main leaderboard target is the weighted overall score.

Weighted Score. The main leaderboard score, combining the judged dimensions into one overall source-quality metric.
Unweighted Mean. The simple average across the judged dimension scores, without weighting.
% In SE. Percentage of model cited sources appearing in the first five pages of Google Search.

Turn on Show dimension scores in Overall view to inspect the eight judged dimensions: semantic relevance, factual accuracy, freshness, objectivity, layout/ad density, accountability, transparency, and authority.

View Sort by Filter model

Show dimension scores

Overall ranking

0 rows

Loading leaderboard data...

DeepSeek Tool Study

DeepSeek variants with different search backends

SourceBench also includes a focused comparison of DeepSeek variants paired with different retrieval setups. This is a separate study rather than part of the main model family ranking: the purpose is to isolate how search backend choice and reasoning mode change citation quality, overlap with search results, and the final weighted source score.

Load leaderboard data to see the DeepSeek tool study.

Official Policy

How official leaderboard evaluation works

Local self-check can be run with the public SourceBench benchmark code and the fixed public query split.

Official leaderboard entries are not accepted from participant-computed final scores alone. Instead, entries are validated and judged by the SourceBench team.

For official evaluation, SourceBench uses hidden holdout queries, the fixed judging setup, and the fixed metric computation pipeline so that leaderboard rows remain comparable across systems.

Submission

What participants should submit

Preferred submission: endpoint access. Submit the model endpoint, API key, model name, API format, and optional generation settings. The SourceBench team will run hidden queries, source collection, scraping, judging, and metric computation on our side.

Fallback submission: answer + cited URL bundle. If endpoint access cannot be shared, submit per-query answer text together with cited URLs. The SourceBench team will run scraping, post-processing, judging, and metric computation server-side.

Why these boundaries? They keep the standardized parts of the benchmark under SourceBench control. Official leaderboard entries are validated and judged by the SourceBench team, rather than accepted from participant-provided final scores.

Benchmark repository and submission examples:

SourceBench benchmark repository
leaderboard/examples/endpoint_submission.example.json in the benchmark repo
leaderboard/examples/answer_url_bundle.example.json in the benchmark repo