From the Lab

Research that Ships

Published artifacts from HFC built with Scale research teams and collaborators across the industry's top labs.

Fellowship work intended to be used, cited, and extended.

Published in NatureJanuary 2026

A benchmark of expert-level academic questions to assess AI capabilities

Humanity's Last Exam is a benchmark designed to measure AI performance against expert-level academic questions across disciplines.

A HFC-guided research artifact built with collaborators across frontier AI and evaluation.

Read the paper here

ASPI: Prompt Injection in LLM Agents

Benchmark of 728 task–attack scenarios testing whether clarification-seeking agents become more vulnerable to prompt injection.

Why it matters: Standard execution-time security evals underestimate the attack surface of interactive agents.

SciPredict

Experimental benchmark evaluating how models move real science forward without breaking the constraints of the physical world.

Why it matters: AI lab partners need grounding in actual scientific practice.

MoReBench: Moral Reasoning in LLMs

1,000 moral scenarios with expert rubrics for identifying considerations, weighing trade-offs, and recommending action.

Why it matters: Math and code benchmarks fail to predict moral reasoning—and models favor specific ethical frameworks.

Professional Reasoning Benchmark

Benchmark for professional-grade reasoning in high-stakes domains.

Why it matters: Defines a bar that models can't clear with surface fluency alone.

Research Rubrics: Evaluating Deep Research Agents

Rubrics for judging research agents on real research behaviors.

Why it matters: Standards make research quality measurable and repeatable.

Best Practices for Biorisk Evaluations

Method guidance for high-consequence bio evaluation.

Why it matters: Risk work needs defensible methods, not vibes.

Building Autoraters for Expert-Level Reasoning Data

Autoraters that scale expert judgment into consistent signals.

Why it matters: Reliable measurement is what lets research compound.

MultiChallenge: Multi-Turn Conversation Evaluation

Benchmark for multi-turn robustness under realistic conversation pressure.

Why it matters: Failures often appear after turn three.

The Research Pathway

Kinds of Contribution

Most Fellows enter research through applied work. A select group go on to shape studies and, sometimes, co-author full papers.

Contribution I

Co-Authoring Research Papers

For a subset of projects, Fellows become named authors on full papers and long-form reports.

Contribution II

Design Research Projects and Lab Advisory

Help grade model behavior, stress-test benchmarks, and refine rubrics so results stand up to real scrutiny.

Contribution III

Research Style Review, Rubrics, and Evaluation

Join small working groups that scope questions, choose metrics, and advise Scale Research on study design.

Results from the Field

3D orbital shape representing interconnected research domains

Results from the Field

Scale Research turns domain expertise into published benchmarks, leaderboards, and evaluation methods that shape how the industry measures frontier AI.

Selected Collaborators

Center for AI SafetyGenerative AI LabsIndustry Partners+ more

Frontier Leaderboards

View all leaderboards on Scale SEAL

SciPredict

Forecasting scientific experiment outcomes

gemini-3-pro-preview

25.27±1.92

claude-opus-4-5-20251101

23.05±0.51

claude-opus-4-1-20250805

22.22±1.48

View Full Ranking →

Professional Reasoning Benchmark - Finance

Evaluating Professional Reasoning in Finance

Professional Reasoning Benchmark - Legal

Evaluating Professional Reasoning in Legal Practice

Humanity's Last Exam

Challenging LLMs at the frontier of human knowledge

gemini-3-pro-previewNew

37.52±1.90

gpt-5-pro-2025-10-06

31.64±1.82

gpt-5.2-2025-12-11New

27.80±1.76

View Full Ranking →

Humanity's Last Exam (Text Only)

Challenging LLMs at the frontier of human knowledge

gemini-3-pro-previewNew

37.72±2.04

gpt-5-pro-2025-10-06

33.32±1.99

gpt-5.2-2025-12-11New

28.50±1.90

View Full Ranking →

MCP Atlas

Evaluating real-world tool use through the Model Context Protocol (MCP)

claude-opus-4-5-20251101

62.30±1.76

gpt-5.2-2025-12-11New

60.57±1.62

gemini-3-flash-previewNew

57.40±1.48

View Full Ranking →