AI Does Well At Detecting And Defeating Malicious Code...Doesn't It?



I am more optimistic about artificial intelligence (AI) than our other leaders at Pythia Cybersecurity. I'm not going to name names. And I've written about how to employ AI to create white box cybersecurity.

Given all that, we can fairly say I have appreciation for AI in cybersecurity.

Thus, let's discuss the DARPA AIxCC competition in terms of cybersecurity and consider how well the AI did.

DARPA, the acronym for Defense Advanced Research Projects Agency,  has as its mission the development and implementation of emerging technology for the US military. This year DARPA ran a competition, the AI Cyber Challenge (AIxCC), for teams to impement AI in order to (per DARPA) "demonstrate the ability of novel autonomous systems using AI to secure the open-source software that underlies critical infrastructure."

This was a fabulous and high-paying ($1.4MM) competition. Bravo to Team Atlanta, the winners!

What were the final stats of this victory?

From DARPA:

"In the Final Competition scored round, teams’ systems attempted to identify and generate patches for synthetic vulnerabilities across 54 million lines of code. Since the competition was based on real-world software, team CRSs [cyber reasoning systems] could discover vulnerabilities not intentionally introduced to the competition. The scoring algorithm prioritized competitors’ performance based on the ability to create patches for vulnerabilities quickly and their analysis of bug reports. The winning team performed best at finding and proving vulnerabilities, generating patches, pairing vulnerabilities and patches, and scoring with the highest rate of accurate and quality submissions.

In total, competitors’ systems discovered 54 unique synthetic vulnerabilities in the Final Competition’s 70 challenges. Of those, they patched 43.

In the Final Competition, teams also discovered 18 real, non-synthetic vulnerabilities that are being responsibly disclosed to open source project maintainers. Of these, six were in C codebases—including one vulnerability that was discovered and patched in parallel by maintainers—and 12 were in Java codebases. Teams also provided 11 patches for real, non-synthetic vulnerabilities."

Moreover:

"Competitor CRSs proved they can create valuable bug reports and patches for a fraction of the cost of traditional methods, with an average cost per competition task of about $152. Bug bounties can range from hundreds to hundreds of thousands of dollars.

AIxCC technology has advanced significantly from the Semifinal Competition held in August 2024. In the Final Competition scored round, teams identified 77% of the competition’s synthetic vulnerabilities, an increase from 37% at semifinals, and patched 61% of the vulnerabilities identified, an increase from 25% at semifinals. In semifinals, teams were most successful in finding and patching vulnerabilities in C codebases. In finals, teams had similar success rates at finding and patching vulnerabilities across C codebases and Java codebases."

That is, the best teams found 54 out of 70 known vulnerabilities, a hit rate of 77.1%, and 43 of those 54 were patched -- a success rate of 79.6%. A less optimistic way to look at it is that 43 out of 70 vulnerabilities were patched, or 61.4%.

If your organization decides to go with an AI-based platform for cybersecurity, which might be in your future, you must insist on evaluating vendors carefully. User experience or testimonials are not valid criteria for evaluating the AI platform. They are secondary or tertiary factors. Instead you need third-party evidence of performance. And furthermore, no AI platform is going to succeed all the time; maybe a range of 61% to 80% -- review the data provided here and decide what your bogey is -- might be the high end. 

To be pessimistic, the people you work against are not resting with one set of attacks but are instead increasing their capabilities. If your AI platform is not adapting then you're wasting money.

Other Pythia Cybersecurity leaders had a different, more cynical observation: do you consider 80%  to be good? From my perspective, since I'm writing this and they are not, the human in the loop armed with the best processes (and an AI shield) is the best and most necessary component for a successful cybersecurity platform.

Ask us how we can help you put the right human in the loop.


Comments