KQL Benchmark Dashboard

Comprehensive AI evaluation framework testing large language models' ability to generate cybersecurity detection rules using real-world attack scenarios

Learn How It WorksExplore All Questions

Model Performance Comparison

o1-low63.3%2.6037.90s$93.89
o1-high63.3%2.7140.35s$98.50
gpt-5-high63.3%2.78141.28s$28.74
gpt-4.161.7%2.746.93s$5.36
grok-3-mini-beta58.5%2.5316.55s$0.75
o3-high54.8%3.0451.90s$11.88
o3-mini-low51.6%2.8523.32s$5.24
o3-mini-high51.6%2.7921.35s$4.92
gemini-2.5-flash-preview-04-1751.1%2.8813.56s$3.82
o4-mini-high51.1%3.2639.39s$6.05
grok-3-beta48.9%3.039.99s$12.07
gpt-5-mini-high48.4%3.2525.04s$2.82
gpt-5-mini-low46.0%3.6528.68s$2.73
gpt-5-mini-medium45.5%3.5025.72s$2.81
o4-mini-low43.1%3.4636.72s$5.84
gpt-4.1-mini41.5%3.207.29s$1.08
gpt-4-turbo-2024-04-0939.4%3.527.79s$32.66
gpt-4o37.8%3.467.15s$8.15
gpt-5-nano-high30.3%3.8223.40s$1.30
gpt-4.1-finetuned26.1%4.229.90s$7.78
gpt-4.1-nano24.5%4.013.67s$0.27
gpt-5-nano-medium23.8%4.0320.05s$1.29
gpt-35-turbo17.0%4.071.24s$1.75

Performance vs. Cost Analysis

Performance Over Time

Dive Deeper into the Benchmark

Explore our comprehensive methodology, detailed model analysis, and the complete dataset of cybersecurity scenarios used to evaluate AI performance.

Learn the MethodologyBrowse Test Scenarios
HomeHow It WorksAbout Us

© 2025 KQLBench. All rights reserved.