How the KQL Benchmark Works

A comprehensive evaluation framework that tests AI models' ability to generate effective cybersecurity detection rules using real-world attack scenarios

188 Test Scenarios

Curated from Atomic Red Team tests covering real cybersecurity threats

AI-Powered Evaluation

Tests 14+ language models on natural language to KQL translation

Human-Validated

Three-step validation process ensures benchmark quality and reliability

1

Test Selection & Question Generation

Atomic Red Team Foundation

Our benchmark starts with 2,253 Atomic Red Team tests - real-world attack simulations that cover the MITRE ATT&CK framework. These tests simulate actual cybersecurity threats across different platforms and attack techniques.

Windows PlatformLinux PlatformMITRE ATT&CK

AI-Generated Questions

Large Language Models automatically generate realistic analyst-level questions based on each test scenario. This creates natural language queries that security professionals would actually ask when investigating threats.

“A reconnaissance tool was executed on a Windows system. Identify the specific function of the tool executed...”

2

Log Collection & Environment Setup

Controlled Test Environment

  • Isolated Windows and Linux virtual machines
  • Microsoft Defender for comprehensive logging
  • Real-time protection disabled to avoid interference

Realistic Data Collection

When tests execute, we collect all logs from the environment - not just the malicious activity. This creates realistic noise levels that mirror real-world security operations centers.

💡 This “needle in the haystack” approach ensures AI models are tested under realistic conditions

3

AI Model Evaluation Process

Query Generation

AI models receive natural language questions and generate KQL queries

Real-time Execution

Generated queries run against actual log data in Azure Log Analytics

Iterative Refinement

Models get up to 5 attempts to self-correct and find the right answer

What We Measure:

  • Success Rate: Percentage of correct answers
  • Average Attempts: How many tries to succeed
  • Execution Time: Speed of query generation
  • Cost Analysis: API usage costs per model
4

Three-Step Human Validation

Step 1: Spot Check

Manual review of 38 representative questions to verify AI-generated queries return correct results

Step 2: Unsolved Review

Examination of all questions no AI model could solve to identify and remove “poisoned” or unsolvable tests

Step 3: Cross-Validation

Comprehensive review using a dashboard to catch ambiguous questions with multiple valid answers

Quality Assurance Results:

📊 Final Dataset: 188 high-quality, validated test cases

🎯 Removed: 48 ambiguous or problematic questions

Verification: 100% accuracy in spot-check validation

🔍 Transparency: All validation steps documented

Benchmark Impact & Results

Our rigorous methodology provides the cybersecurity community with reliable, actionable insights into AI model capabilities for threat detection automation.

63.3%
Top Model Success Rate (O1-high)
14+
AI Models Evaluated
$254
Total Benchmark Cost
6.93s
Average Execution Time

Ready to Explore the Results?

Dive into our interactive dashboard to compare model performance, analyze cost-effectiveness, and explore detailed results from all 188 benchmark scenarios.

View Benchmark Results