CRUST-bench Leaderboard

CRUST-bench is a benchmark that measures the performance on the C-to-Rust translation task.

Please see our blog post for a more detailed description.

Comparison of test success rates across different repair strategies.

Model	Pass@1		Compiler repair		Test repair
Model	Build	Test	Build	Test	Build	Test
gpt-5 (high)	48	26	92	43	85	70
gemini-3-pro-preview	48	23	88	41	83	66
claude-4.5-opus	47	26	61	32	59	42
claude-opus-4-20250514	43	22	78	29	65	40
o3-2025-04-16	35	19	68	31	63	48
o1-preview-2024-09-12	32	15	69	28	54	37
claude-3.7-sonnet-20250219	26	13	54	23	49	32
claude-3.5-sonnet-20240620	26	11	49	21	38	24
o1-mini-2024-09-12	19	9	47	16	27	21
gpt-4o	18	7	52	18	42	22
gemini-1.5-pro	11	3	35	11	30	14
arcee-ai/Virtuoso-Medium-v2	2	2	21	6	10	6
Qwen/Qwen-Coder-32B	0	0	0	0	1	0
DeepSeek/DeepSeek-Coder-33B	1	0	2	0	1	0
Qwen/QwQ-32B-Preview	1	0	1	0	1	0
Adapted SWE-agent (claude-3-7-sonnet-20250219)	41	32	–	–	–	–

Benchmark Details