How the KQL Benchmark Works
A comprehensive evaluation framework that tests AI models' ability to generate effective cybersecurity detection rules using real-world attack scenarios
Curated from Atomic Red Team tests covering real cybersecurity threats
Tests 14+ language models on natural language to KQL translation
Three-step validation process ensures benchmark quality and reliability
Test Selection & Question Generation
Atomic Red Team Foundation
Our benchmark starts with 2,253 Atomic Red Team tests - real-world attack simulations that cover the MITRE ATT&CK framework. These tests simulate actual cybersecurity threats across different platforms and attack techniques.
AI-Generated Questions
Large Language Models automatically generate realistic analyst-level questions based on each test scenario. This creates natural language queries that security professionals would actually ask when investigating threats.
“A reconnaissance tool was executed on a Windows system. Identify the specific function of the tool executed...”
Log Collection & Environment Setup
Controlled Test Environment
- Isolated Windows and Linux virtual machines
- Microsoft Defender for comprehensive logging
- Real-time protection disabled to avoid interference
Realistic Data Collection
When tests execute, we collect all logs from the environment - not just the malicious activity. This creates realistic noise levels that mirror real-world security operations centers.
💡 This “needle in the haystack” approach ensures AI models are tested under realistic conditions
AI Model Evaluation Process
Query Generation
AI models receive natural language questions and generate KQL queries
Real-time Execution
Generated queries run against actual log data in Azure Log Analytics
Iterative Refinement
Models get up to 5 attempts to self-correct and find the right answer
What We Measure:
- • Success Rate: Percentage of correct answers
- • Average Attempts: How many tries to succeed
- • Execution Time: Speed of query generation
- • Cost Analysis: API usage costs per model
Three-Step Human Validation
Manual review of 38 representative questions to verify AI-generated queries return correct results
Examination of all questions no AI model could solve to identify and remove “poisoned” or unsolvable tests
Comprehensive review using a dashboard to catch ambiguous questions with multiple valid answers
Quality Assurance Results:
📊 Final Dataset: 188 high-quality, validated test cases
🎯 Removed: 48 ambiguous or problematic questions
✅ Verification: 100% accuracy in spot-check validation
🔍 Transparency: All validation steps documented
Benchmark Impact & Results
Our rigorous methodology provides the cybersecurity community with reliable, actionable insights into AI model capabilities for threat detection automation.
Ready to Explore the Results?
Dive into our interactive dashboard to compare model performance, analyze cost-effectiveness, and explore detailed results from all 188 benchmark scenarios.
View Benchmark Results