SourceBench Leaderboard for cited source quality in generative engines
SourceBench

Can AI answers cite high-quality web sources?

SourceBench focuses on a different evaluation target from standard answer-quality benchmarks. Instead of only asking whether a model answered well, it asks whether the model cited sources that are relevant, accurate, fresh, transparent, authoritative, and usable. This Space hosts the public-facing leaderboard frontend. Official leaderboard entries are validated and judged by the SourceBench team.

Benchmark target Quality of cited sources, not just final answer correctness.
Current scope Generative engines with built-in web search, plus official validation for leaderboard inclusion.
Official policy Official leaderboard entries are validated and judged by the SourceBench team using fixed hidden evaluation settings.
Models in current board
0
Current leaderboard snapshot
Query types
0
Benchmark query slices
Top model
-
Highest weighted content score
Top weighted score
-
Weighted source-quality metric
Leaderboard

Ranking Table

Generated at -

SourceBench ranks systems by judged source quality rather than answer fluency alone. The main leaderboard target is the weighted overall score.

Turn on Show dimension scores in Overall view to inspect the eight judged dimensions: semantic relevance, factual accuracy, freshness, objectivity, layout/ad density, accountability, transparency, and authority.

Overall ranking

0 rows
Loading leaderboard data...
DeepSeek Tool Study

DeepSeek variants with different search backends

SourceBench also includes a focused comparison of DeepSeek variants paired with different retrieval setups. This is a separate study rather than part of the main model family ranking: the purpose is to isolate how search backend choice and reasoning mode change citation quality, overlap with search results, and the final weighted source score.

Load leaderboard data to see the DeepSeek tool study.
Official Policy

How official leaderboard evaluation works

Local self-check can be run with the public SourceBench benchmark code and the fixed public query split.

Official leaderboard entries are not accepted from participant-computed final scores alone. Instead, entries are validated and judged by the SourceBench team.

For official evaluation, SourceBench uses hidden holdout queries, the fixed judging setup, and the fixed metric computation pipeline so that leaderboard rows remain comparable across systems.

Submission

What participants should submit

Preferred submission: endpoint access. Submit the model endpoint, API key, model name, API format, and optional generation settings. The SourceBench team will run hidden queries, source collection, scraping, judging, and metric computation on our side.

Fallback submission: answer + cited URL bundle. If endpoint access cannot be shared, submit per-query answer text together with cited URLs. The SourceBench team will run scraping, post-processing, judging, and metric computation server-side.

Why these boundaries? They keep the standardized parts of the benchmark under SourceBench control. Official leaderboard entries are validated and judged by the SourceBench team, rather than accepted from participant-provided final scores.

Benchmark repository and submission examples:

  • SourceBench benchmark repository
  • leaderboard/examples/endpoint_submission.example.json in the benchmark repo
  • leaderboard/examples/answer_url_bundle.example.json in the benchmark repo