One Question, One World.

Qworld: Question-specific evaluation criteria for LLMs.

Shanghua Gao1,*, Yuchang Su1,*, Pengwei Sui1, Curtis Ginder1,2, Marinka Zitnik1,3,4,5,‡

1Department of Biomedical Informatics, Harvard Medical School   2Department of Medicine, Brigham and Women's Hospital
3Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University   4Broad Institute of MIT and Harvard   5Harvard Data Science Initiative

*Equal contribution. Correspondence: marinka@hms.harvard.edu   Harvard University

Given a question, Qworld recursively expands it into scenarios, perspectives, and fine-grained binary criteria — building a evaluation world unique to that question.

One question, one world of evaluation criteria.

Evaluating large language models (LLMs) on open-ended questions is difficult because response quality depends on the question's context. Binary scores and static rubrics fail to capture these context-dependent requirements. Existing methods define criteria at the dataset level or generate them in a single pass, which limits their ability to explore the evaluation space implied by each question.

This work introduces One-Question-One-World (Qworld), a method that generates question-specific evaluation criteria using a recursive expansion tree. Given a question, Qworld decomposes it into scenarios, perspectives, and fine-grained binary criteria through structured hierarchical and horizontal expansion. The resulting criteria specify what a high-quality answer must address for that question.

On HealthBench, Qworld covers 89% of expert-authored criteria and generates 79% novel criteria validated by human experts. Experts rate Qworld criteria higher in insight and granularity than those produced by prior methods. When applied to 11 frontier LLMs on HealthBench and Humanity's Last Exam, Qworld reveals capability differences in dimensions such as long-term impact, equity, error handling, and interdisciplinary reasoning that coarse rubrics do not distinguish.

By formulating criteria generation as structured coverage of question-implied evaluation axes, Qworld enables evaluation that adapts to each question rather than relying on fixed task-level criteria.

Every question opens a new world.

Different domains demand different lenses. Qworld discovers hundreds of fine-grained evaluation dimensions — each uniquely shaped by the questions it serves.

HealthBench

530+ dimensions from 200k+ question-specific criteria

Humanity's Last Exam

950+ dimensions from 100k+ question-specific criteria

Beyond expert criteria.

Expert-authored criteria on HealthBench span just 5 broad dimensions. Qworld's recursive expansion generates 45+ criteria per question, together it uncovers 530+ fine-grained evaluation dimensions, expert-clustered into 24 structured dimensions — exposing capability differences in crucial dimensions like long-term impact, equity, and privacy that coarse rubrics cannot distinguish.

Expert criteria

5 dimensions

Qworld criteria

200k+ criteria· 530+ dimensions· 24 clusters

Fine-grained evaluation leaderboard.

Model performance scored against Qworld-generated criteria — revealing capability differences across HealthBench and Humanity's Last Exam.

Per-question criteria explorer.

Each question inhabits its own evaluation world. Select a question to see the tailored criteria Qworld generated through recursive expansion.

Select a question to explore its Qworld-generated criteria

Try Qworld.

Generate question-specific evaluation criteria using the Recursive Expansion Tree. Enter any question and watch Qworld decompose it into scenarios, perspectives, and fine-grained criteria.

Configuration

Click or drag to add an image
Scenario expands 2
Perspective expands 2
Criteria expands 2
Dedup threshold 0.7

Output

Ready

Configure a question and click Generate Criteria to start.



              
              

              
              
            

How Qworld works.

A Recursive Expansion Tree that turns one question into a complete evaluation world.

Recursive Expansion Tree diagram
1

Scenarios

The distinct real-world contexts a question implies — each with its own intent, audience, and constraints.

2

Perspectives

The evaluation axes along which answer quality should be measured within each scenario.

3

Criteria

Specific, binary conditions with importance weights — 45+ tailored criteria per question.

Validate the quality of Qworld criteria.

Qworld-generated criteria compared against four state-of-the-art methods on HealthBench, evaluated with both automatic metrics and human expert ratings. For more detailed quantitative results, refer to our paper.

Method Coverage ↑ Uniqueness ↑ Insight ↑ Granularity ↑
TICK 0.46 0.24 0.20 0.90
RocketEval 0.53 0.26 0.42 0.94
OpenRubrics 0.54 0.37 0.36 0.54
EvalAgent 0.83 0.50 0.43 0.73
Qworld 0.89 0.79 0.83 0.96
Coverage Proportion of expert-authored criteria captured by generated criteria
Uniqueness Proportion of generated criteria not in the expert set (novel criteria)
Insight Expert-rated score for surfacing non-obvious requirements (Qworld: 0.40 above prior best)
Granularity Expert-rated score for specificity and actionability of criteria

Citation

BibTeX
@article{gao2025qworld,
  title={Qworld: Question-Specific Evaluation Criteria for LLMs},
  author={Gao, Shanghua and Su, Yuchang and Sui, Pengwei and Ginder, Curtis and Zitnik, Marinka},
  year={2025}
}

Evaluation dimensions

Qworld uncovers 950+ fine-grained evaluation dimensions organized into 23 expert-curated clusters — spanning mathematical rigor, interdisciplinary reasoning, and frontier knowledge assessment.

Click a cluster to explore its fine-grained dimensions.