Local News

Researchers at RIT create a new benchmark that tests how well artificial intelligence models understand cyber threats

Published

5 months ago

June 11, 2025

Rochester, New York – At a time when artificial intelligence (AI) tools are rapidly entering high-stakes industries like cybersecurity, a team at Rochester Institute of Technology (RIT) has developed a new way to find out just how smart—and safe—these systems really are. The new tool, known as CTIBench, acts like a test for AI models, scoring them on their ability to perform cybersecurity-related tasks. The tool is already being used by tech giants like Google, Cisco, and Trend Micro to evaluate how reliable their AI models are in detecting and responding to cyber threats.

CTIBench, short for Cyber Threat Intelligence Benchmark, is designed to evaluate large language models (LLMs)—the same kind of AI that powers assistants like ChatGPT—in the context of Cyber Threat Intelligence (CTI). CTI is a key part of modern security operations, helping analysts stay ahead of cybercriminals by identifying, understanding, and countering threats before damage is done.

While AI is being increasingly integrated into security systems—Microsoft Copilot being one of the more prominent examples—the question of trust remains. Can these systems be relied upon when the stakes are high?

“Is the LLM reliable and trustworthy? Can I ask it a question and expect a good answer? Will it hallucinate?” asked Nidhi Rastogi, assistant professor in RIT’s Department of Software Engineering and lead researcher of the CTIBench project.

Until now, there wasn’t a clear way to measure the performance of LLMs in the CTI space. CTIBench changes that. It is the first comprehensive benchmark suite focused solely on evaluating AI tools in cybersecurity intelligence roles.

The tool simulates the kind of work done by analysts at Security Operations Centers (SOCs). The AI is tested through five main benchmarks, completing tasks like mapping the root cause of incidents, evaluating vulnerabilities, and assigning Common Vulnerability Scoring System (CVSS) scores. Each task is designed to see how deeply the AI understands the technical world of cybersecurity.

“One of the most challenging and rewarding aspects was designing appropriate tasks to quantitatively evaluate the capabilities of LLMs in the domain of Cyber Threat Intelligence,” said Md Tanvirul Alam, a Ph.D. student in computing and information sciences who helped build CTIBench.

To train and test the benchmark, the team generated 2,000 cybersecurity questions using ChatGPT. These weren’t just surface-level prompts—they involved extensive trial and error and careful prompt engineering to ensure realism. All questions were later reviewed and validated by actual cybersecurity professionals and graduate students, covering topics from basic terminology to advanced National Institute of Standards and Technology (NIST) guidelines.

The project began when the RIT team worked on another benchmark called SECURE, which focused on testing AI in industrial control systems. That experience highlighted the need for broader tools. “Since there was no reliable benchmark for CTI, we felt it was the right time to build one,” said Ph.D. student Dipkamal Bhusal.

CTIBench was developed under the umbrella of RIT’s AI4Sec Research Lab, directed by Rastogi. She and her team, including Alam, Bhusal, and fellow Ph.D. student Le Nguyen, spent months developing, refining, and testing CTIBench across five different LLMs. Each model was given a score based on how accurately it handled the cybersecurity tasks.

The resulting paper, CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence, was published at NeurIPS 2024, the prestigious Conference on Neural Information Processing Systems. Out of thousands of submissions, CTIBench was selected as a spotlight paper, placing it among the top 2 percent of accepted research at the conference.

Since its debut, CTIBench has been made free and open access, available for anyone to use via the Hugging Face API and GitHub. And it’s already attracting serious attention from major companies. Google is using the tool to evaluate its new cybersecurity-focused model, Sec-Gemini v1. Cisco and Trend Micro are also applying the benchmark to their own AI models, testing them for real-world use.

Even Yahoo Security’s Chris Madden has taken note, incorporating CTIBench insights into a Common Weakness Enumeration benchmarking effort in partnership with the MITRE Corporation.

“We should embrace using AI, but there should always be a human in the loop,” Rastogi said. “That’s why we are creating benchmarks—to see what these models are good at and what their capabilities are. We’re not blindly following AI but smartly integrating it into our lives.”

As AI continues to evolve, the work done by the RIT team highlights a crucial reality: tools like CTIBench aren’t just academic exercises—they’re essential guides in deciding when and how we can trust machines in critical roles.

“The quick adoption of CTIBench validates our research impact and positions us well in cybersecurity and LLM research,” said Rastogi. “This is opening doors to new collaborations, funding, and real-world industry impact.”

In a world where cyber threats grow more complex and AI systems more powerful, CTIBench serves as a timely and necessary reminder: intelligence may be artificial, but trust has to be earned.

Monroe County Reporter

Researchers at RIT create a new benchmark that tests how well artificial intelligence models understand cyber threats

Local News

Researchers at RIT create a new benchmark that tests how well artificial intelligence models understand cyber threats

Trending