// kql benchmark

Which model writes the best KQL?

We score frontier models on 188 real natural-language threat-hunting prompts — measuring detection accuracy against cost and latency. The leaderboard below is the result.

| benchmark.kql14 models · 188 questions

1Benchmarks
2| where task == "natural-language → KQL"
3| summarize accuracy, cost, latency by model
4| order by accuracy desc

▸ resultsordered by accuracy

// leaderboard

#
01	o1-low	63.3%	2.60	37.90s	$93.89
02	o1-high	63.3%	2.71	40.35s	$98.50
03	gpt-5-high	63.3%	2.78	141.28s	$28.74
04	gpt-4.1	61.7%	2.74	6.93s	$5.36
05	grok-3-mini-beta	58.5%	2.53	16.55s	$0.75
06	o3-high	54.8%	3.04	51.90s	$11.88
07	o3-mini-low	51.6%	2.85	23.32s	$5.24
08	o3-mini-high	51.6%	2.79	21.35s	$4.92
09	gemini-2.5-flash-preview-04-17	51.1%	2.88	13.56s	$3.82
10	o4-mini-high	51.1%	3.26	39.39s	$6.05
11	grok-3-beta	48.9%	3.03	9.99s	$12.07
12	gpt-5-mini-high	48.4%	3.25	25.04s	$2.82
13	gpt-5-mini-low	46.0%	3.65	28.68s	$2.73
14	gpt-5-mini-medium	45.5%	3.50	25.72s	$2.81
15	o4-mini-low	43.1%	3.46	36.72s	$5.84
16	gpt-4.1-mini	41.5%	3.20	7.29s	$1.08
17	gpt-4-turbo-2024-04-09	39.4%	3.52	7.79s	$32.66
18	gpt-4o	37.8%	3.46	7.15s	$8.15
19	gpt-5-nano-high	30.3%	3.82	23.40s	$1.30
20	gpt-4.1-finetuned	26.1%	4.22	9.90s	$7.78
21	gpt-4.1-nano	24.5%	4.01	3.67s	$0.27
22	gpt-5-nano-medium	23.8%	4.03	20.05s	$1.29
23	gpt-35-turbo	17.0%	4.07	1.24s	$1.75

// accuracy vs. cost

Up and to the left is the sweet spot — high detection accuracy for less spend. Cost uses a log scale.

// accuracy over time

How model accuracy on KQL has tracked with release date.

// go deeper

See how the benchmark is built

Read the methodology behind the scores, or browse the full set of natural-language threat-hunting scenarios models are tested on.

Methodology Scenarios