A glimpse of undefined behavior in C
Reblogged from “A glimpse of undefined behavior in C” at Christopher Cole’s blog, “Christopher Cole” .
Hey! Today I was practicing my first year CS201 programming language, C, when I came across a puzzling situation. Several lines of the following code gave different output on different compilers. Which also differed from what the output should logically be.
#includeint main() { int i=5; printf("%d %d %d\n",i,++i,i++); //7 7 5 i=5; printf("%d %d %d\n",i,i++,++i); //7 6 7 i=5; printf("%d %d %d\n",i,++i,++i); //7 7 7 i=5; printf("%d %d %d\n",i,i++,i++); //7 6 5 i=5; printf("%d %d %d\n",i++,i++,i); //6 5 7 i=1; printf("%d %d %d\n",i++,i++,i); //2 1 3 return 0; }
And so I started googling the possible reasons for that happening. Several stack exchange pages told me that this is an “Undefined Behaviour” that happens in C and C++. But why would it be undefined? What really goes on behind the scenes? Asking my friends for help on facebook led me to the aforementioned blog post that breaks the program down to assembly language which helped me understand what was really going on. I’ve reblogged that post here. Hope it helps. Also, I should mention that this didn’t help me understand WHY it was happening. I merely assumed that it was happening because of the way the compiler processes the code.
Here is the reblogged post.
A few weeks ago a coworker of mine stopped by my desk with a coding question. We’d recently been quizzing each other over our C knowledge, so I smiled and braced myself for the hell which was clearly to come.
He wrote a few lines of code up on the whiteboard and asked what the program output:
#includeint main(){ int i = 0; int a[] = {10,20,30}; int r = 1 * a[i++] + 2 * a[i++] + 3 * a[i++]; printf("%d\n", r); return 0; }
It seemed pretty straightforward. I explained the order of operator precedence – postfix binds before multiplication which binds before addition and also that multiplication and addition’s associativity was left to right, so I grabbed the marker and started writing out the arithmetic.
int r = 1 * a[i++] + 2 * a[i++] + 3 * a[i++]; // = a[0] + 2 * a[1] + 3 * a[2]; // = 10 + 40 + 90; // = 140
After I smugly scribbled down the answer, my coworker replied a simple “nope”. I thought for a few minutes, and was stumped. I couldn’t quite remember the order of the postfix operators’ associativity. Furthermore, I knew that wouldn’t even change the order of evaluation here as associativity rules only apply between operators of the same precedence level, but thought I’d give it a shot and walk through the arithmetic as if the postfix operators were all evaluated right to left.
int r = 1 * a[i++] + 2 * a[i++] + 3 * a[i++]; // = a[2] + 2 * a[1] + 3 * a[0]; // = 30 + 40 + 30; // = 100
Again, my coworker replied it was still wrong. At this point I threw up my hands and asked what the answer was. It turned out that this small sample was actually reduced from a larger code snippet he’d put together. To verify his question, he’d compiled and run the larger sample but was surprised to see it not behave as he’d thought. After reducing the unexpected behavior to this illuminating example, he’d compiled it with gcc 4.7.3 and it output a startling result: “60″.
At this point I was intrigued. I remembered that, in C, the order of function argument evaluation was undefined, so we thought maybe the postfix operators were just being evaluated in some random non-left-to-right order. Still convinced that postfix had a higher operator precedence than addition and multiplication, we quickly proved to ourselves that there was no order we could evaluate the i++’s such that the three elements of the array were added/multiplied together to get “60″.
Now I was hooked. My first thought was to look at the disassembly of the snippet and try and track down what was actually going on. I compiled the sample with debugging symbols and after a quick objdump we had the source-annotated x86_64 disassembly:
Disassembly of section .text:
0000000000000000 : #includeint main(){ 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: 48 83 ec 20 sub $0x20,%rsp int i = 0; 8: c7 45 e8 00 00 00 00 movl $0x0,-0x18(%rbp) int a[] = {10,20,30}; f: c7 45 f0 0a 00 00 00 movl $0xa,-0x10(%rbp) 16: c7 45 f4 14 00 00 00 movl $0x14,-0xc(%rbp) 1d: c7 45 f8 1e 00 00 00 movl $0x1e,-0x8(%rbp) int r = 1 * a[i++] + 2 * a[i++] + 3 * a[i++]; 24: 8b 45 e8 mov -0x18(%rbp),%eax 27: 48 98 cltq 29: 8b 54 85 f0 mov -0x10(%rbp,%rax,4),%edx 2d: 8b 45 e8 mov -0x18(%rbp),%eax 30: 48 98 cltq 32: 8b 44 85 f0 mov -0x10(%rbp,%rax,4),%eax 36: 01 c0 add %eax,%eax 38: 8d 0c 02 lea (%rdx,%rax,1),%ecx 3b: 8b 45 e8 mov -0x18(%rbp),%eax 3e: 48 98 cltq 40: 8b 54 85 f0 mov -0x10(%rbp,%rax,4),%edx 44: 89 d0 mov %edx,%eax 46: 01 c0 add %eax,%eax 48: 01 d0 add %edx,%eax 4a: 01 c8 add %ecx,%eax 4c: 89 45 ec mov %eax,-0x14(%rbp) 4f: 83 45 e8 01 addl $0x1,-0x18(%rbp) 53: 83 45 e8 01 addl $0x1,-0x18(%rbp) 57: 83 45 e8 01 addl $0x1,-0x18(%rbp) printf("%d\n", r); 5b: 8b 45 ec mov -0x14(%rbp),%eax 5e: 89 c6 mov %eax,%esi 60: bf 00 00 00 00 mov $0x0,%edi 65: b8 00 00 00 00 mov $0x0,%eax 6a: e8 00 00 00 00 callq 6f return 0; 6f: b8 00 00 00 00 mov $0x0,%eax } 74: c9 leaveq 75: c3 retq
The first few and last few instructions just set up the stack frame, initial variable values, and do the printf call and return from main, so we really only cared about instructions 0×24 through 0×57. That’s where the interesting behavior was. Let’s walk through the instructions a few at a time:
24: 8b 45 e8 mov -0x18(%rbp),%eax 27: 48 98 cltq 29: 8b 54 85 f0 mov -0x10(%rbp,%rax,4),%edx
The first three instructions were as we expected. First, it loads the value of i (0) into register eax, sign-extends it to 64 bits, then loads a[0] into register edx. The 1 * was obviously optimized out by the compiler, but everything looked normal here. The next few started out much the same:
2d: 8b 45 e8 mov -0x18(%rbp),%eax 30: 48 98 cltq 32: 8b 44 85 f0 mov -0x10(%rbp,%rax,4),%eax 36: 01 c0 add %eax,%eax 38: 8d 0c 02 lea (%rdx,%rax,1),%ecx
The first mov loads the value of i (still 0) into eax, sign-extends it to 64-bits, then loads a[0] into eax. That’s intersting – we expected the i++ to have been performed before these three instructions again, but maybe the last two instructions will do some sort of assembly magic to get us to the expected result (2 * a[1]). The two instructions add eax to itself, effectively performing 2 * a[0], then add that result to the previous computation and stores it in ecx. At this point the instructions have evaluated a[0] + 2 * a[0]. Things are starting to look a little strange but again, maybe there’s some compiler magic going on.
3b: 8b 45 e8 mov -0x18(%rbp),%eax 3e: 48 98 cltq 40: 8b 54 85 f0 mov -0x10(%rbp,%rax,4),%edx 44: 89 d0 mov %edx,%eax
These next few instructions are starting to look pretty familiar. They load the value of i (still 0), sign extend it to 64-bits, load a[0] into edx, then copy edx into eax. Hmm, ok let’s take a look at a few more:
46: 01 c0 add %eax,%eax 48: 01 d0 add %edx,%eax 4a: 01 c8 add %ecx,%eax 4c: 89 45 ec mov %eax,-0x14(%rbp)
Here, it adds a[0] to itself 3 times, adds it the previous computation, then stores it into variable ‘r’. Now that’s wild – our variable r now contains a[0] + 2 * a[0] + 3 * a[0]. Sure enough, that’s what the program output: “60″. But what happened to those postfix operations? They’re all there at the end:
4f: 83 45 e8 01 addl $0x1,-0x18(%rbp) 53: 83 45 e8 01 addl $0x1,-0x18(%rbp) 57: 83 45 e8 01 addl $0x1,-0x18(%rbp)
It seemed our compiled version of the code just seemed completely wrong! Why had the postfix operations been thrown down at the end after the assignment had already happened? With my faith in reality dwindling, I decided to go straight to the source. No, not the compiler’s source – that’s just an implementation – I grabbed the C11 language specification.
The issue lay within the details of the postfix operations. In our case, we were thrice performing a postfix increment to our array index in a single expression. When evaluating a postfix operator it returns the initial value of the variable. The assignment of the new value back to the variable is known as a side effect. It turns out that side effects are only defined to have been committed between sequence points. See section 5.1.2.3 of the standard for the specifics of where sequence points are defined, but in our case our expression exhibits undefined behavior. It’s entirely up to the compiler when the side-effect of the assignment of the new value back to the variable is performed with respect to the other parts of the expression.
In the end, we both learned a bit more about the C language. It’s well-known best practice to avoid constructing complex expressions with prefix/postfix operators, and this is a great example of why that is the case.
Add the text of this. Even I did some research on it afterwards.
http://chatsagnik.quora.com/