A glimpse of undefined behavior in C

Reblogged from “A glimpse of undefined behavior in C” at Christopher Cole’s blog, “Christopher Cole” .

Hey! Today I was practicing my first year CS201 programming language, C, when I came across a puzzling situation. Several lines of the following code gave different output on different compilers. Which also differed from what the output should logically be.

#include
int main()
{
        int i=5;
        printf("%d %d %d\n",i,++i,i++); //7 7 5
        i=5;
        printf("%d %d %d\n",i,i++,++i); //7 6 7
        i=5;
        printf("%d %d %d\n",i,++i,++i); //7 7 7
        i=5;
        printf("%d %d %d\n",i,i++,i++); //7 6 5
        i=5;
        printf("%d %d %d\n",i++,i++,i); //6 5 7
        i=1;
        printf("%d %d %d\n",i++,i++,i); //2 1 3
        return 0;
}

And so I started googling the possible reasons for that happening. Several stack exchange pages told me that this is an “Undefined Behaviour” that happens in C and C++. But why would it be undefined? What really goes on behind the scenes? Asking my friends for help on facebook led me to the aforementioned blog post that breaks the program down to assembly language which helped me understand what was really going on. I’ve reblogged that post here. Hope it helps. Also, I should mention that this didn’t help me understand WHY it was happening. I merely assumed that it was happening because of the way the compiler processes the code.

Here is the reblogged post.

A few weeks ago a coworker of mine stopped by my desk with a coding question. We’d recently been quizzing each other over our C knowledge, so I smiled and braced myself for the hell which was clearly to come.

He wrote a few lines of code up on the whiteboard and asked what the program output:

#include
int main(){
int i = 0;
int a[] = {10,20,30};

int r = 1 * a[i++] + 2 * a[i++] + 3 * a[i++];
printf("%d\n", r);
return 0;
}

It seemed pretty straightforward. I explained the order of operator precedence – postfix binds before multiplication which binds before addition and also that multiplication and addition’s associativity was left to right, so I grabbed the marker and started writing out the arithmetic.

int r = 1 * a[i++] + 2 * a[i++] + 3 * a[i++];
// = a[0] + 2 * a[1] + 3 * a[2];
// = 10 + 40 + 90;
// = 140

After I smugly scribbled down the answer, my coworker replied a simple “nope”. I thought for a few minutes, and was stumped. I couldn’t quite remember the order of the postfix operators’ associativity. Furthermore, I knew that wouldn’t even change the order of evaluation here as associativity rules only apply between operators of the same precedence level, but thought I’d give it a shot and walk through the arithmetic as if the postfix operators were all evaluated right to left.

int r = 1 * a[i++] + 2 * a[i++] + 3 * a[i++];
// = a[2] + 2 * a[1] + 3 * a[0];
// = 30 + 40 + 30;
// = 100

Again, my coworker replied it was still wrong. At this point I threw up my hands and asked what the answer was. It turned out that this small sample was actually reduced from a larger code snippet he’d put together. To verify his question, he’d compiled and run the larger sample but was surprised to see it not behave as he’d thought. After reducing the unexpected behavior to this illuminating example, he’d compiled it with gcc 4.7.3 and it output a startling result: “60″.

At this point I was intrigued. I remembered that, in C, the order of function argument evaluation was undefined, so we thought maybe the postfix operators were just being evaluated in some random non-left-to-right order. Still convinced that postfix had a higher operator precedence than addition and multiplication, we quickly proved to ourselves that there was no order we could evaluate the i++’s such that the three elements of the array were added/multiplied together to get “60″.

Now I was hooked. My first thought was to look at the disassembly of the snippet and try and track down what was actually going on. I compiled the sample with debugging symbols and after a quick objdump we had the source-annotated x86_64 disassembly:

Disassembly of section .text:

0000000000000000 :
#include

int main(){
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 48 83 ec 20 sub $0x20,%rsp
int i = 0;
8: c7 45 e8 00 00 00 00 movl $0x0,-0x18(%rbp)
int a[] = {10,20,30};
f: c7 45 f0 0a 00 00 00 movl $0xa,-0x10(%rbp)
16: c7 45 f4 14 00 00 00 movl $0x14,-0xc(%rbp)
1d: c7 45 f8 1e 00 00 00 movl $0x1e,-0x8(%rbp)
int r = 1 * a[i++] + 2 * a[i++] + 3 * a[i++];
24: 8b 45 e8 mov -0x18(%rbp),%eax
27: 48 98 cltq
29: 8b 54 85 f0 mov -0x10(%rbp,%rax,4),%edx
2d: 8b 45 e8 mov -0x18(%rbp),%eax
30: 48 98 cltq
32: 8b 44 85 f0 mov -0x10(%rbp,%rax,4),%eax
36: 01 c0 add %eax,%eax
38: 8d 0c 02 lea (%rdx,%rax,1),%ecx
3b: 8b 45 e8 mov -0x18(%rbp),%eax
3e: 48 98 cltq
40: 8b 54 85 f0 mov -0x10(%rbp,%rax,4),%edx
44: 89 d0 mov %edx,%eax
46: 01 c0 add %eax,%eax
48: 01 d0 add %edx,%eax
4a: 01 c8 add %ecx,%eax
4c: 89 45 ec mov %eax,-0x14(%rbp)
4f: 83 45 e8 01 addl $0x1,-0x18(%rbp)
53: 83 45 e8 01 addl $0x1,-0x18(%rbp)
57: 83 45 e8 01 addl $0x1,-0x18(%rbp)
printf("%d\n", r);
5b: 8b 45 ec mov -0x14(%rbp),%eax
5e: 89 c6 mov %eax,%esi
60: bf 00 00 00 00 mov $0x0,%edi
65: b8 00 00 00 00 mov $0x0,%eax
6a: e8 00 00 00 00 callq 6f 
return 0;
6f: b8 00 00 00 00 mov $0x0,%eax
}
74: c9 leaveq
75: c3 retq

The first few and last few instructions just set up the stack frame, initial variable values, and do the printf call and return from main, so we really only cared about instructions 0×24 through 0×57. That’s where the interesting behavior was. Let’s walk through the instructions a few at a time:

24: 8b 45 e8 mov -0x18(%rbp),%eax
27: 48 98 cltq
29: 8b 54 85 f0 mov -0x10(%rbp,%rax,4),%edx

The first three instructions were as we expected. First, it loads the value of i (0) into register eax, sign-extends it to 64 bits, then loads a[0] into register edx. The 1 * was obviously optimized out by the compiler, but everything looked normal here. The next few started out much the same:

2d: 8b 45 e8 mov -0x18(%rbp),%eax
30: 48 98 cltq
32: 8b 44 85 f0 mov -0x10(%rbp,%rax,4),%eax
36: 01 c0 add %eax,%eax
38: 8d 0c 02 lea (%rdx,%rax,1),%ecx

The first mov loads the value of i (still 0) into eax, sign-extends it to 64-bits, then loads a[0] into eax. That’s intersting – we expected the i++ to have been performed before these three instructions again, but maybe the last two instructions will do some sort of assembly magic to get us to the expected result (2 * a[1]). The two instructions add eax to itself, effectively performing 2 * a[0], then add that result to the previous computation and stores it in ecx. At this point the instructions have evaluated a[0] + 2 * a[0]. Things are starting to look a little strange but again, maybe there’s some compiler magic going on.

3b: 8b 45 e8 mov -0x18(%rbp),%eax
3e: 48 98 cltq
40: 8b 54 85 f0 mov -0x10(%rbp,%rax,4),%edx
44: 89 d0 mov %edx,%eax

These next few instructions are starting to look pretty familiar. They load the value of i (still 0), sign extend it to 64-bits, load a[0] into edx, then copy edx into eax. Hmm, ok let’s take a look at a few more:

46: 01 c0 add %eax,%eax
48: 01 d0 add %edx,%eax
4a: 01 c8 add %ecx,%eax
4c: 89 45 ec mov %eax,-0x14(%rbp)

Here, it adds a[0] to itself 3 times, adds it the previous computation, then stores it into variable ‘r’. Now that’s wild – our variable r now contains a[0] + 2 * a[0] + 3 * a[0]. Sure enough, that’s what the program output: “60″. But what happened to those postfix operations? They’re all there at the end:

4f: 83 45 e8 01 addl $0x1,-0x18(%rbp)
53: 83 45 e8 01 addl $0x1,-0x18(%rbp)
57: 83 45 e8 01 addl $0x1,-0x18(%rbp)

It seemed our compiled version of the code just seemed completely wrong! Why had the postfix operations been thrown down at the end after the assignment had already happened? With my faith in reality dwindling, I decided to go straight to the source. No, not the compiler’s source – that’s just an implementation – I grabbed the C11 language specification.

The issue lay within the details of the postfix operations. In our case, we were thrice performing a postfix increment to our array index in a single expression. When evaluating a postfix operator it returns the initial value of the variable. The assignment of the new value back to the variable is known as a side effect. It turns out that side effects are only defined to have been committed between sequence points. See section 5.1.2.3 of the standard for the specifics of where sequence points are defined, but in our case our expression exhibits undefined behavior. It’s entirely up to the compiler when the side-effect of the assignment of the new value back to the variable is performed with respect to the other parts of the expression.

In the end, we both learned a bit more about the C language. It’s well-known best practice to avoid constructing complex expressions with prefix/postfix operators, and this is a great example of why that is the case.

The sample program and the output on my computer.
The sample program and the output on my computer.