C/C++/Objective-C: Dark past, bright future

We’ve just released version 3.3 of the C/C++/Objective-C plugin, which features an increased scope and precision of analysis for C, as well as detection of real bugs such as null pointer dereferences and bugs related to types for C. These improvements were made possible by the addition of semantic analysis and symbolic execution, which is the analysis not of the structure of your code, but of what the code is actually doing.

Semantic analysis was part of the original goal set for the plugin about three years ago. Of course, the goal was broader than that: develop a static analyser for C++. The analyzer needed to continuously check your code’s conformance with your coding standards and practices, and more importantly detect bugs and vulnerabilities to help you to keep technical debt under control.

At the time, we didn’t think it would be hard, because many languages were already in our portfolio, including Java, COBOL, PL/SQL. Our best engineers, Freddy Mallet and Dinesh Bolkensteyn, were already working on C, the natural predecessor of C++. I joined them, and together we started work on C++. With the benefit of hindsight, I can say that we all were blind. Totally blind. We had no idea what a difficult and ambitious task we had set ourselves.

You see, a static analyzer is a program which is able to precisely understand what another program does. And, roughly speaking, a bug is detected when this understanding is different from what the developer really wanted to write. Huh! Already, the task is complex, but it’s doubly so for C++. Why is automatic analysis of C++ so complicated?

First of all, both C and C++ have the concept of preprocessing. For example consider this code:

struct command commands[] = { cmd(quit), cmd(help) };

One would think that there are two calls of the “cmd” function with the parameters “quit” and “help”. But that might not be the case if just before this line there’s a preprocessing directive:

#define cmd(name) { #name, name ## _command }

That directive completely changes meaning of the original code, literally turning it into

struct command commands[] = { { "quit", quit_command }, { "help", help_command } };

The existence of the preprocessor complicates many things on many different levels for an analysis. But most important is that the correct interpretation of preprocessing directives is crucial for the correctness and precision of an analysis. We rewrote our preprocessor implementation from scratch three times before we were satisfied with it. And it’s worth mentioning that on the market of static analysers (both commercial and open-source) you can easily find tools that don’t do preprocessing at all or do it only imprecisely.

Let’s move to the next difficulty. I’ve mentioned in the past that C and C++ are hard to parse. It’s time to talk a little bit about why. Roughly speaking, parsing is the process of recognizing language constructions – i.e. seeing what’s a statement, what’s an expression, and so on. Let’s take some example code and try to figure out what it is.

T * a

If this were Java code, the answer would be straightforward: most probably this is multiplication, and part of bigger expression. But the answer isn’t that simple in for C/C++. In general, the answer is “it depends…” This could indeed be an expression statement, if both “T” and “a” are variables:

int T, a;
T * a;

But it could also be the declaration of variable “a” with a type of pointer to “T”, if “T” is a type:

typedef int T;
T * a;

In other words, the context can completely change the meaning of code. This is called ambiguity.

Like natural languages, the grammars of programming languages can be ambiguous. While the C language has just a few ambiguous constructions, C++ has tons of them. And as you’ve seen, correct parsing is not possible without information about types. But getting that information is a difficulty in and of itself because it requires semantic analysis of language constructs before you can understand their types and relations. And that’s where it starts to be really complex. To parse we need semantic analysis, and to do semantic analysis we need to parse. Chicken and egg problem.

We had hit a wall, and when we looked around, we realized we weren’t alone. Many tools don’t even try to parse, get information about types or distinguish between ambiguous and unambiguous cases.

And then we found GLL, a relatively new theory about generalized parsing. It was first published in 2010, and there still aren’t any ready-to-use, publicly-available implementations for Java. Implementing a GLL parser wasn’t easy, and took us quite a while, but the ROI was high. This parser is able to preserve information about encountered ambiguities without their actual resolution. That allows us to do precise analysis of at least the unambiguous constructions without producing false-positives on ambiguous constructions.

The GLL parser was a win-win, and game changer! After 2 years of development from the first commit (which was approximately a year ago) we released precise preprocessing and parsing in version 2.0 of the C++ Plugin.

With the original goal well on the way to being met, we started to dream again, raised our expectations even higher, and were ready to welcome new developers. Today, I still work on the plugin, but it’s maintained primarily by Massimo Paladin and Samuel Mercier. They solved the analysis configuration problem, added support of Objective-C and Microsoft Component Extensions to the plugin.

Our next goal is to apply Semantic Analysis and Symbolic Execution on Objective-C and of course after that on C++, and to use them to cover more MISRA rules. So this is probably not the end of the story about difficulties in development of static analyser for C/C++/Objective-C – who knows what else will be encountered on the way. But now we are not blind as it was before, now we know that this is difficult. However based on past, I can say that we in SonarSource are unstoppable and even most incredible dreams come true! So keep dreaming! And just never ever give up!

© 2008-2016, SonarSource S.A, Switzerland. All content is copyright protected. SONARQUBE, SONARLINT and SONARSOURCE are
trademarks of SonarSource SA. All other trademarks and copyrights are the property of their respective owners. All rights are expressly reserved.