KQL Benchmark Dashboard

Comprehensive AI evaluation framework testing large language models' ability to generate cybersecurity detection rules using real-world attack scenarios

Learn How It Works Explore All Questions

Model Performance Comparison


o1-low	63.3%	2.60	37.90s	$93.89
o1-high	63.3%	2.71	40.35s	$98.50
gpt-4.1	61.7%	2.74	6.93s	$5.36
grok-3-mini-beta	58.5%	2.53	16.55s	$0.75
o3-mini-low	51.6%	2.85	23.32s	$5.24
o3-mini-high	51.6%	2.79	21.35s	$4.92
gemini-2.5-flash-preview-04-17	51.1%	2.88	13.56s	$3.82
o4-mini-high	51.1%	3.26	39.39s	$6.05
grok-3-beta	48.9%	3.03	9.99s	$12.07
o4-mini-low	43.1%	3.46	36.72s	$5.84
gpt-4.1-mini	41.5%	3.20	7.29s	$1.08
gpt-4-turbo-2024-04-09	39.4%	3.52	7.79s	$32.66
gpt-4o	37.8%	3.46	7.15s	$8.15
gpt-4.1-finetuned	26.1%	4.22	9.90s	$7.78
gpt-4.1-nano	24.5%	4.01	3.67s	$0.27
gpt-35-turbo	17.0%	4.07	1.24s	$1.75

Performance vs. Cost Analysis

Performance Over Time

Dive Deeper into the Benchmark

Explore our comprehensive methodology, detailed model analysis, and the complete dataset of cybersecurity scenarios used to evaluate AI performance.

Learn the Methodology Browse Test Scenarios