grok-3-mini-beta vs o1-low KQL Benchmark
o1-low wins by 4.8%
Compared on 188 shared test questions
Overall Accuracy
grok-3-mini-beta
58.5%
110 / 188 correct
o1-low
63.3%
119 / 188 correct
Average Cost per Query
grok-3-mini-beta: $0.0040
o1-low: $0.4994
o1-low costs 12496.0% more
Average Execution Time
grok-3-mini-beta: 22.38s
o1-low: 50.90s
o1-low takes 127.4% longer
Question-by-Question Analysis
Question-by-Question Comparison
Detailed comparison showing where each model succeeded or failed
Showing 1 to 25 of 188 questions
Page 1 of 8
T1057 | On a Windows device, review the process execution logs to find instances where a built-in listing tool was piped into a string filter. Identify the process name that the attacker was searching for. | ✓ | ✗ | grok-3-mini-beta Wins |
T1057 | On a Windows host, investigate process events to find when Task Manager was launched via cmd with an unusual flag. What was the full command executed? | ✓ | ✗ | grok-3-mini-beta Wins |
T1059.004 | On a Linux system, analyze the process logs for suspicious command line activity that includes a sequence of commands indicating a pipe-to-shell operation. Identify the tool that was used to execute this piped command, paying special attention to its use in downloading and running script content. | ✓ | ✗ | grok-3-mini-beta Wins |
T1070.003 | On a Windows endpoint, review process execution logs to see if any PowerShell sessions were wiped clean. Which command was executed to clear the PowerShell history? | ✓ | ✗ | grok-3-mini-beta Wins |
T1069.001 | Investigate Windows process execution logs for a PowerShell cmdlet used to list group members. Look for entries where a group name is provided after a '-Name' flag and identify which group was queried. | ✓ | ✗ | grok-3-mini-beta Wins |
T1070.003 | On a Linux endpoint, you suspect malicious clearing of the bash history by redirecting from the null device. Explore process or file events to uncover the exact shell command that performed this action. | ✓ | ✗ | grok-3-mini-beta Wins |
T1070.008 | An attacker on Linux used bash to copy all files from /var/spool/mail into a newly created subdirectory before modifying them. What is the name of that subdirectory? | ✓ | ✗ | grok-3-mini-beta Wins |
T1082 | On Windows systems, identify when the built-in Shadow Copy utility is used to enumerate existing snapshots. What was the full command executed? | ✓ | ✗ | grok-3-mini-beta Wins |
T1082 | A Windows system shows a cmd.exe process spawn that appears to have been used for environment discovery. Review the process creation records to identify the exact command the adversary ran to enumerate environment variables. | ✓ | ✗ | grok-3-mini-beta Wins |
T1197 | A suspicious BITS transfer was orchestrated via bitsadmin.exe on Windows, creating a job to download and then execute a payload. Investigate the process event logs to determine what custom job name was specified when the BITS job was created. | ✓ | ✗ | grok-3-mini-beta Wins |
T1497.003 | On a Linux host, identify any processes that used ping with a large count value to introduce a delay before launching another process. What was the command executed immediately after the ping delay? | ✓ | ✗ | grok-3-mini-beta Wins |
T1546.004 | A suspicious file modification on a Linux device targeted the ~/.bash_profile file, apparently adding a new line. What was the full command string that was appended? | ✓ | ✗ | grok-3-mini-beta Wins |
T1547.014 | A Windows endpoint shows an Active Setup entry under Internet Explorer Core Fonts being altered with a StubPath value. Investigate the registry events and identify the payload that was set. | ✓ | ✗ | grok-3-mini-beta Wins |
T1557.001 | On Windows devices, hunt for PowerShell activity where a remote script is fetched and executed to perform LLMNR/NBNS spoofing. Which cmdlet kicked off the listener? | ✓ | ✗ | grok-3-mini-beta Wins |
T1562.004 | On a Windows device, a new inbound firewall rule was created unexpectedly. Review process execution records to identify the command-line utility responsible for adding the rule. | ✓ | ✗ | grok-3-mini-beta Wins |
T1562.004 | Investigate Windows registry modification events to find the name of the registry value that was changed under the WindowsFirewall policy path when someone turned the firewall off. | ✓ | ✗ | grok-3-mini-beta Wins |
T1562 | Review Linux process execution logs to find where the system journal service was stopped. Which utility was invoked to disable journal logging? | ✓ | ✗ | grok-3-mini-beta Wins |
T1562.006 | A .NET tracing environment variable was turned off in a user’s registry on a Windows system. Which built-in command-line tool was used to make this registry change? | ✓ | ✗ | grok-3-mini-beta Wins |
T1622 | On the Windows device, a security check was run to detect debugger processes via PowerShell. Which tool (process) carried out this check? | ✓ | ✗ | grok-3-mini-beta Wins |
T1003.001 | Using Windows process event logs, investigate PowerShell activity around lsass.exe memory capture. What was the name of the script file invoked to perform the dump? | ✗ | ✓ | o1-low Wins |
T1016.001 | On a Linux host, a ping command was executed to test internet connectivity. Determine which IP address was used as the ping target. | ✗ | ✓ | o1-low Wins |
T1016.001 | An analyst notices a PowerShell process on a Windows host that appears to be checking SMB connectivity. Which PowerShell cmdlet was executed to perform this outbound port 445 test? | ✗ | ✓ | o1-low Wins |
T1018 | Review Linux process execution records for any commands that list TCP metric cache entries and filter out loopback interfaces. Which utility was used? | ✗ | ✓ | o1-low Wins |
T1016 | A Linux host’s Syslog shows a shell-based network discovery script ran multiple commands. One of them listed current TCP connections. Which utility was invoked? | ✗ | ✓ | o1-low Wins |
T1036.004 | A threat actor on a Windows system crafted and registered a service named almost identically to the standard time service, but redirecting execution to a custom script. Review the logging data to determine which native command-line tool was used to perform this action. What utility was invoked? | ✗ | ✓ | o1-low Wins |
Page 1 of 8
Explore individual model performance and detailed analysis