How the KQL Benchmark Works

A comprehensive evaluation framework that tests AI models' ability to generate effective cybersecurity detection rules using real-world attack scenarios

188 Test Scenarios

Curated from Atomic Red Team tests covering real cybersecurity threats

AI-Powered Evaluation

Tests 14+ language models on natural language to KQL translation

Human-Validated

Three-step validation process ensures benchmark quality and reliability

Test Selection & Question Generation

Atomic Red Team Foundation

Our benchmark starts with 2,253 Atomic Red Team tests - real-world attack simulations that cover the MITRE ATT&CK framework. These tests simulate actual cybersecurity threats across different platforms and attack techniques.

Windows PlatformLinux PlatformMITRE ATT&CK

AI-Generated Questions

Large Language Models automatically generate realistic analyst-level questions based on each test scenario. This creates natural language queries that security professionals would actually ask when investigating threats.

“A reconnaissance tool was executed on a Windows system. Identify the specific function of the tool executed...”

Log Collection & Environment Setup

Controlled Test Environment

Isolated Windows and Linux virtual machines
Microsoft Defender for comprehensive logging
Real-time protection disabled to avoid interference

Realistic Data Collection

When tests execute, we collect all logs from the environment - not just the malicious activity. This creates realistic noise levels that mirror real-world security operations centers.

💡 This “needle in the haystack” approach ensures AI models are tested under realistic conditions

AI Model Evaluation Process

Query Generation

AI models receive natural language questions and generate KQL queries

Real-time Execution

Generated queries run against actual log data in Azure Log Analytics

Iterative Refinement

Models get up to 5 attempts to self-correct and find the right answer

What We Measure:

• Success Rate: Percentage of correct answers
• Average Attempts: How many tries to succeed

• Execution Time: Speed of query generation
• Cost Analysis: API usage costs per model

Three-Step Human Validation

Step 1: Spot Check

Manual review of 38 representative questions to verify AI-generated queries return correct results

Step 2: Unsolved Review

Examination of all questions no AI model could solve to identify and remove “poisoned” or unsolvable tests

Step 3: Cross-Validation

Comprehensive review using a dashboard to catch ambiguous questions with multiple valid answers

Quality Assurance Results:

📊 Final Dataset: 188 high-quality, validated test cases

🎯 Removed: 48 ambiguous or problematic questions

✅ Verification: 100% accuracy in spot-check validation

🔍 Transparency: All validation steps documented

Benchmark Impact & Results

Our rigorous methodology provides the cybersecurity community with reliable, actionable insights into AI model capabilities for threat detection automation.

63.3%

Top Model Success Rate (O1-high)

14+

AI Models Evaluated

$254

Total Benchmark Cost

6.93s

Average Execution Time

Ready to Explore the Results?

Dive into our interactive dashboard to compare model performance, analyze cost-effectiveness, and explore detailed results from all 188 benchmark scenarios.

View Benchmark Results