From the Lab
Research that Ships
Published artifacts from HFC built with Scale research teams and collaborators across the industry's top labs.
Fellowship work intended to be used, cited, and extended.

Published in NatureJanuary 2026
A benchmark of expert-level academic questions to assess AI capabilities
Humanity's Last Exam is a benchmark designed to measure AI performance against expert-level academic questions across disciplines.
A HFC-guided research artifact built with collaborators across frontier AI and evaluation.
Read the paper here
ASPI: Prompt Injection in LLM Agents
Benchmark of 728 task–attack scenarios testing whether clarification-seeking agents become more vulnerable to prompt injection.

SciPredict
Experimental benchmark evaluating how models move real science forward without breaking the constraints of the physical world.

MoReBench: Moral Reasoning in LLMs
1,000 moral scenarios with expert rubrics for identifying considerations, weighing trade-offs, and recommending action.

Professional Reasoning Benchmark
Benchmark for professional-grade reasoning in high-stakes domains.

Research Rubrics: Evaluating Deep Research Agents
Rubrics for judging research agents on real research behaviors.

Best Practices for Biorisk Evaluations
Method guidance for high-consequence bio evaluation.

Building Autoraters for Expert-Level Reasoning Data
Autoraters that scale expert judgment into consistent signals.

MultiChallenge: Multi-Turn Conversation Evaluation
Benchmark for multi-turn robustness under realistic conversation pressure.
Kinds of Contribution
Most Fellows enter research through applied work. A select group go on to shape studies and, sometimes, co-author full papers.
Contribution I
Co-Authoring Research Papers
For a subset of projects, Fellows become named authors on full papers and long-form reports.
Contribution II
Design Research Projects and Lab Advisory
Help grade model behavior, stress-test benchmarks, and refine rubrics so results stand up to real scrutiny.
Contribution III
Research Style Review, Rubrics, and Evaluation
Join small working groups that scope questions, choose metrics, and advise Scale Research on study design.

Results from the Field
Scale Research turns domain expertise into published benchmarks, leaderboards, and evaluation methods that shape how the industry measures frontier AI.
Selected Collaborators
Frontier Leaderboards
View all leaderboards on Scale SEALSciPredict
Forecasting scientific experiment outcomes
Professional Reasoning Benchmark - Finance
Evaluating Professional Reasoning in Finance
Professional Reasoning Benchmark - Legal
Evaluating Professional Reasoning in Legal Practice
Humanity's Last Exam
Challenging LLMs at the frontier of human knowledge
Humanity's Last Exam (Text Only)
Challenging LLMs at the frontier of human knowledge
MCP Atlas
Evaluating real-world tool use through the Model Context Protocol (MCP)
