SonarQube Enters the Security Realm and Makes a Good First Showing
For the last year, we’ve been quietly working to add security-related rules in SonarQube’s language plugins. At September’s SonarQube Geneva User Conference we stopped being quiet about it.
About a year ago, we realized that our tools were beginning to reach the maturity levels required to offer not just maintainability rules, but bug and security-related rules too, so we set our sights on providing an all-in-one tool and started an effort to specify and implement security-related rules in all languages. Java has gotten the furthest; it currently has nearly 50 security-related rules. Together, the other languages have offer another 50 or so.
That may not sound like a lot, but I’m pleased with our progress, particularly when tested against the OWASP Benchmark project. If you’ve heard of OWASP before, it was probably in the context of the OWASP Top 10, but OWASP is an umbrella organization with multiple projects under it (kinda like the Apache Foundation). The Top 10 is OWASP’s flagship project, and the benchmark is an up-and-comer.
The benchmark offers ~2700 Java servlets that do and do not demonstrate vulnerabilities corresponding to 11 different CWE items. The CWE (Common Weakness Enumeration) contains about 1,000 items, and broadly describes patterns of insecure and weak code.
The guys behind the benchmark are testing all they tools they can get their hands on and publishing the results. For commercial tools, they’re only publishing an average score (because the tool licenses don’t allow them to publish individual, named scores). For open source tools, they’re naming names. :-)
When I prepared my slides for my “Security Rules in SonarQube” talk, the SonarQube Java Plugin arguably had the best score, finding 50% of the things we’re supposed to and only flagging 17% of the things we should have ignored for an overall score of 33% (50-17 = 33). Compare that to the commercial average, which has a 53% True Positive Rate and 28% False Positive rate for a final score of 26%. Since then, a new version of Find Security Bugs has been released, and it’s spot on the graph has jumped some, but I’m still quite happy with our score, both in relative and absolute terms. Here’s the summary graph presented on the site:
Notice that the dots are positioned on the x and y axes based on the True Positive Rate (y-axis) and False Positive Rate (x-axis.) Find Security Bugs is higher on the True Positive axis than SonarQube, which threw me for a minute, but it’s also further out on the False Positive axis too. That’s why I graphed the tools’ overall scores:
Looked at this way, it’s probably quite clear why I’m still happy with the SonarQube Java scores. But I’ll give you some detail to show that it isn’t (merely) about one-upsmanship:
This graph shows the Java plugin’s performance on each of the 11 CWE code sets individually. I’ll start with the five 0/0 scores in the bottom-left. For B, E, G, and K we don’t yet have any rules implemented (they’re “coming soon”). So… yeah, we’re quite happy to score a 0 there. :-) For F, SQL Injection, we have a rule, but every example of the vulnerability in this benchmark slips through a hole in it. (That should be fixed soon.) On a previous version of the benchmark, we got a better score for SQL Injection, but with the newest iteration, the code has been pared from 21k files to 2.7k, and apparently all the ones we were finding got eliminated. That’s life.
For A and D, it’s interesting to note that while the dots are placed toward the upper-right of the graph, they have scores of -2% and 0% respectively. That’s because the false positives cancelled out the true positives in the scoring. Clearly, we’d rather see a lower false positive rate, but we knew we’d hit some FP’s when we decided to write security rules. And with a mindset that security-related issues require human verification, this isn’t so bad. After all, what’s worse: manually eliminating false positives, or missing a vulnerability because of a false negative?
For ‘I’, we’ve got about the best score we can get. The cases we’re missing are designed to be picked up only by dynamic analysis. Find Security Bugs gets the same score on this one: 68%.
For the rest, C, H, and J, we’ve got perfect scores: a 100% True Positive Rate and a 0% False Positive Rate. Woo hoo!
Of course, saying we’ve got 100% on item C or 33% overall is only a reflection of how we’re doing on those particular examples. We do better on some vulnerabilities and less so on others. Over time, I’m sure the benchmark will grow to cover more CWE items and cover in more depth the items it already touches on. As it does, we’ll continue to test ourselves against it to see what we’ve missed and where our holes are. I’m sure our competitors will too, and we’ll all get gradually better. That’s good for everybody. But you won’t be surprised if I say we’ll stay on top of making sure SonarQube is always the best.