A comprehensive evaluation framework that tests AI models' ability to generate effective cybersecurity detection rules using real-world attack scenarios
Curated from Atomic Red Team tests covering real cybersecurity threats
Tests 14+ language models on natural language to KQL translation
Three-step validation process ensures benchmark quality and reliability
Our benchmark starts with 2,253 Atomic Red Team tests - real-world attack simulations that cover the MITRE ATT&CK framework. These tests simulate actual cybersecurity threats across different platforms and attack techniques.
Large Language Models automatically generate realistic analyst-level questions based on each test scenario. This creates natural language queries that security professionals would actually ask when investigating threats.
“A reconnaissance tool was executed on a Windows system. Identify the specific function of the tool executed...”
When tests execute, we collect all logs from the environment - not just the malicious activity. This creates realistic noise levels that mirror real-world security operations centers.
💡 This “needle in the haystack” approach ensures AI models are tested under realistic conditions
AI models receive natural language questions and generate KQL queries
Generated queries run against actual log data in Azure Log Analytics
Models get up to 5 attempts to self-correct and find the right answer
Manual review of 38 representative questions to verify AI-generated queries return correct results
Examination of all questions no AI model could solve to identify and remove “poisoned” or unsolvable tests
Comprehensive review using a dashboard to catch ambiguous questions with multiple valid answers
📊 Final Dataset: 188 high-quality, validated test cases
🎯 Removed: 48 ambiguous or problematic questions
✅ Verification: 100% accuracy in spot-check validation
🔍 Transparency: All validation steps documented
Our rigorous methodology provides the cybersecurity community with reliable, actionable insights into AI model capabilities for threat detection automation.
Dive into our interactive dashboard to compare model performance, analyze cost-effectiveness, and explore detailed results from all 188 benchmark scenarios.
View Benchmark Results