CppCon 2017: J. McNellis, J. Mola, K. Sykes “Time Travel Debugging…”

Just another WordPress site

CppCon 2017: J. McNellis, J. Mola, K. Sykes “Time Travel Debugging…”

– [Jordi] Thank you everyone for coming It’s a privilege to be here today with my colleagues Ken – Hello – I’m James – It’s nice to see you all here this morning – My name is Jordi Mola I’m the architect for this time travel debugging tools and we are happy to present them to you today So I got the clue from a great presenter one time to start with questions up front So I’m going to open Notepad and type up a few questions from the audience, if there is any question before we start? Yes, go ahead – I wonder how this differs from RR I mean, aside from the fact that the platforms are mutually exclusive – How does this differ from RR? OK There was another question here up front Sorry? – Windows only – Windows only? – And which Windows? – Oh Which versions of Windows? OK Let’s just start with those and then I’m sure that there will be more questions as we go forward So What is Time Travel Debugging? OK Time Travel Debugging, TTD, not test development, just Time Travel Debugging, for short, TTD, is a reverse debugging solution So we record the execution of the process Think of it as a movie We take a movie of the process execution As a movie, when you have a DVR or a DVD or some of those things, you can go forward, some backwards, to look for things But also you can search through the trace using queries Ken will do a demo on how to do queries to search through the trace When we look at debugging, we set out to solve a few problems Debugging is time consuming I’m sure all of us here have a single step through a function painfully to make sure we didn’t get past the point of the function that was of interest, and then spend an hour just single stepping to not yet be at the point we wanted Debugging is really hard We actually will do a demo today of debugging a garbage collection bug that can be very nasty to debug without some of these tools Debugging is complex So we made sure that we didn’t have to learn new things All of the things that you are used to, when you debug live, break points, single stepping, going, memory break points, work on backwards debugging And finally, this is a pet peeve of mine, the number of many times you have to repo a bug when you are debugging normally You have to go and, oh, I learned something, start over Because it was a little different than I thought Now I learned something else and start over and in time travel debugging, you just simply have to go back Some of us have used a trick of binary searching a bug Which is look for a place the bug happening, not yet, OK, this isn’t the second half of the trace And move on A little bit of background and key features for this, this is based on a Microsoft research project that was internally available about 10 years ago But three, four years ago, we decided to make this publicly available, so we went in and did work to convert the, excuse me, research code into production code, that was an effort, we also feel that we are ready for preview, but it’s a preview, it’s not final, we still have some bugs in several areas to work on In particular, for those that have worked at Microsoft have used research technology, our replay is about two orders of magnitude faster than it was for the old tool Record is still as much overhead as it was before It’s something we will be working on in the upcoming months and years It’s written in C++, but it’s C++ of different kinds A recorder is a piece of code that we inject inside of the process that we are recording, so that uses no exceptions Cannot allocate memory It has a bunch of restrictions that resemble embedded systems On the other hand, the replay code is normal, modern C++ It uses coroutines, vectors, maps, the parallel library, and I mean, this is an enormous C++ piece of code Performance sensitive, but normal OK Our recorder is multicore That’s one of the differences with DRR projects since it was one of the questions We do not single core the execution of the process To actually record it If you have 50 runnable threads, and 50 cores, we will use all 50 cores This is important for us at Microsoft because this is used on things like chain servers, that run on 64 core machines, 128 core machines, so it was an important aspect for us to make sure that we could use significant amount of that machine when we were recording And this is used a lot in escalation cases When you get to escalation engineers at Microsoft that are working with you and your company or your consumer side And you get to people that have to debug your machine, this tool is used often Having quoting numbers above the 50% of the cases So The tool supports shared memory and a synchronous IO Now, different with the RR project

And we have the ability to query the trace And again, Ken will do a demo of that, which we think it’s an interesting approach to debugging If you take one thing out of here is these tools run in three steps You record the execution, you index the execution to debug it as it automatically for you as you open the trace for the first time, and then you replay, you consume the trace in the debugger We have done a few block post We will do more There’s a bunch of documents There’s how to download this, you have here a link to go and play with it How we use this at Microsoft How we use time travel debugging at Microsoft We use it to resolve problems from customers And we use it to resolve problems internally There are teams at Microsoft, in Windows, which is the organization we are in, that will not ask you for a repro of a bug, they will ask you for a trace of a bug A trace is essentially one of our recordings So you don’t give them, do these 15 steps to repro, and then you get bug, it doesn’t repro my bugs Which we all have used as an excuse at one time or two or three, a few times You get to trace them, and that also, bugs that have traces on them have a much higher chance of getting fixed than bugs that don’t have traces with them This is awesome for bugs that are difficult to repro If you have captured the repro on a trace, you have it, you don’t need anything else Corrupted memory, we will see that on an example Race conditions, resource leaks, it works very well at these type of problems How to get it It’s part of the WinDbg Preview tool set And here you have a link on the post that has help install it, a video of how to use it, channel nine, there’s a sticker on one of the sides of the laptop for that But instead of explaining a lot of, you know, what it is and how to do it, let’s go for demos James – Thank you, Jordi So, yes, some things are much easier to show than to talk about, so I’m going to walk through two demos of using time travel debugging In the first example, I’m going to show how you can record an execution of a process, and then take the trace that it’s produced, and actually debug it in the debugger So the first thing that you’re going to need to do is open WinDbg as an administrator So you have to have it open as an administrator in order to record a process If you’ve never used WinDbg before, this is it If you have used WinDbg before, this may look a little different than what you’re used to So this is a few weeks ago, we released the first preview of this new WinDbg UI So it’s, we’re trying to make some major user interface improvements, it should hopefully be a lot easier to use But it uses the same debugger engine under the hood, so if you’re familiar with, you know, a bunch of WinDbg commands, all of that works, you know, just as well as it did in the old one We’re really happy that we made some big steps forward with WinDbg here, and now that we’re adding reverse debugging, we’re looking forward to making a lot of steps backward as well So, what I’m going to do, is I’m going to go to the file menu, and we have two options here That support time travel debugging So we can either launch a new program, or we can attach to a running program So I want to run my own little test program So I’m going to pick launch executable And I’m going to type in the name of my program Which I’ve conveniently named test.exe Then I’m going to check this record process with time travel debugging Now when I click OK, what’s going to happen is the debugger will run this program, and it will be recording everything that it does It won’t, the debugger won’t be attached to the program, it’s just going to be letting it run normally with the recording active So I’ll click OK And we’ll see the program start up, and it’s just a little test program, so it prints out some strings, and then returns I’m gonna open the source file so we can take a look at it And so we can see here just this very simple program in our main function, we have a vector of strings We iterate over the first 10 letters of the alphabet, we create a string for each letter, we print it so we have something interesting to look at while we’re running the program, we add the string to the vector, and then we sleep for a little while so that the program actually takes a little time to run and doesn’t exit immediately The first thing to note is that when you open a trace, it starts off at the very beginning of where you started recording, so here we can see, we’re very early in the execution of the process Now, I want to get us into the main function, so I’m going to add a break point at the end of the loop, and I’m just going to click go And so what the debugger has now done, is it’s replayed execution of the program to the first point that it hits that break point So we can see, in the stack window, we’re inside of the main function, which is where the break point is, and we can see in the locals window, we can actually see all of the local variables, just like we would if we were debugging a live program, so we can see the current characters A, in the first iteration of this loop, and there’s one string inside of the vector So if we go, you know, a couple more times, we can see it advances through the program and it just keeps hitting that debug, or that

break point over and over again So there’s a few things to point out here The first is that so all of the visualizations, so for example, WinDbg uses the same Natvis visualizers that Visual Studio does, so you get all of the rich, you know, ability to inspect STL containers here I mean, this is a vector of strings Looking at that just in raw memory can be rather tedious and painful The other thing is, that all of the debugger commands are used, too, with live process, generally work So for example, we have a G command to go forward one step or one, that will run the program, and you can see that we’ve advanced again to the next time that that break point is hit In general, most of the commands you’re used to using will work, the one set of things you can’t do is you can’t do anything that would change the state of the program because, again, the program’s not executing right now, we’re just replaying what already happened Finally, I wanted to note here that you’ll notice that you don’t actually see the console window So when we ran the program originally, you could see the console window printing out the text The reason for that is when we replay the program, or when we replay the trace, everything is contained within the debugger So we don’t actually do the side effects that the program had So for example, if it did any IO on the disk or on the network, we’re not going to actually go back and try and read those files or do that network IO again, which is very important Additionally, if while you are running your program and tracing it, your computer lit on fire, and you somehow salvaged the trace, don’t worry, if you run it in the debugger, your computer isn’t going to, you know, catch on fire a second time So, all of that is well and good But this is a reverse debugger, so we also want to be able to, well, reverse execution So you can see here on the ribbon, we have the normal flow control buttons to go and step into, and we also have, for time travel debugging traces, reverse flow control buttons So we can go back or step into back So if I click the go back, what it will do is it will be as if it replays the program backwards and it will stop wherever the previous stop was So in this case, it will stop the previous time that this break point was hit So if I click that, you’ll see, we’re still at the break point, we’ve just hit at the previous time that it was hit So in the locals window, you can see instead of the current letter being D, it’s now C There’s only three elements in the vector, whereas there was four before So we have a G command in the command window that would let the program go forward There’s an equivalent G minus that goes backward So if I run that, we’re now, again, in the previous iteration of the loop If you’re familiar with using the P and T commands for stepping, there’s also P minus and T minus commands for stepping in reverse So those are the basics of how time travel debugging works in the debugger But it’s a lot more interesting if we look at how to actually debug a real bug So to that end, I thought since we were the first break out session, I would welcome you all to the conference So I wrote a little program to print out a greeting for you So we can see here it says, “Hello everyone, “and welcome to CPPCon 2017, enjoy the conference.” So consider yourself welcomed Oh It also seems to have crashed That’s kind of unfortunate So, thankfully, I have a crash dump for this that we can take a look at So I’m going to open up that crash dump, and we’ll see if we can figure out what went wrong So I’ll open the crash dump that we had for that And I’ll look at the stack window, because hopefully it will show us where the problem occurred and I see we’re inside a bunch of exception handling and then I see here, that doesn’t look like any of our code And nope, can’t, hmm So there’s not a lot of information here So I also, I saw this bug happen this morning, so I also recorded a trace So I’m gonna go and open the trace and see if we can debug it a little easier So this is the trace I recorded earlier You can open it, there’s that open trace option on the file menu And again, when we open the trace, it starts off at the beginning of the trace So one thing that you can do, if you want to move to another location, we have a bang TT command So I can run bang TT And one option is you can pass it a percentage So I’m going to pass 100%, which will move it to the end of the program And what I wanted to point out here is we’re, you know, this shows you we’re in, almost the exact, we’re basically the same place we were in in the crash dump, so again, the program was crashing, so you know, this is where we would expect the end of the trace to be, as well And I can move back to the beginning using bang TT with zero to go back to zero percent So since the program crashed, I’m going to assume that it’s because an exception occurred, and the exception was not handled in the program I don’t know if it was a C++ exception or something else, but chances are, it was an exception And so I want to find where, during execution of the program, that exception occurred And to do that, I’m going to use a feature of the debugger, a relatively new feature called

the debugger data model So the data model provides a uniform and simple way to get all sorts of information about the thing that you’re debugging In this case, the trace So I’m going to use the DX command, which is used to query the data model And I’m going to look at the current process Which is kind of one of the root-ish things that you can look at So in here, we can see the name of the program while we were tracing it is greeting.exe That was the name of my test program We can see the process ID that it had while it was running And then we can see a bunch of other information that we can expand Now, the information we’re looking for is time travel debugging civic, so I’m going to expand this TTD node here And then we’re looking for an exception event, so I’m going to expand the events list So the events list will have an event in it for every time that a DLL is loaded or unloaded in the process It will have an event for any time a thread was created or terminated And then it will have en event for every exception that was raised So if we see here, there’s one exception in the list, so that’s probably what we’re looking for So I’m going to click that entry, and then click exception to see the exception specific properties of it So we can see here, it was a hardware exception, so it was something that was actually raised by the CPU, it wasn’t a C++ exception or some other kind of software exception We can see the program counter, where the exception occurred We can see the exception code, so in this case, I know this is the exception code for an access violation So it’s likely that our program either tried to read, write, or execute some memory we didn’t have proper access to But most importantly, it also has this position field, which tells us the exact location during execution of the program where the exception occurred So what I can do is I can click this time travel link, which will actually take us directly to that point So that’s moved the debugger, so we’re now looking at the exact point where the exception was raised during execution of the program Now we can see here, in the stack window, unfortunately, it still doesn’t tell us where it happened We can see that instruction pointer, which was that instruction pointer where the exception occurred, but the debugger hasn’t been able to decode it It hasn’t been able to tell us, like, what function it was in or even just what DLL that was in So that’s not particularly useful Additionally, I noticed that the stack pointer here, and the base pointer, are pointing to two completely different locations in memory So between these two things, I’m kind of guessing this was a stack corruption So someone overran a buffer on the stack, or otherwise corrupted something on the stack, and then when they returned from the function, that ended up corrupting the CPU state with an invalid instruction pointer and base address So you know, from our crash dump AA, we wouldn’t have even got to this point Like, it would have been very difficult for us to figure out what was going on here Now, if we had a live repro, it might be inconsistent, it could be very difficult to, you know, to figure out from that We don’t have either of those, we have a trace We know exactly what the program did, and so we can just walk backwards and see what happened So in this case, I want to get back to before the CPU state was corrupted So before that return statement, you know, loaded that invalid instruction pointer so that we can look at kind of get an idea of where the exception happened And I think that happened recently, so I’m just going to step back a few times until we get a reasonable looking context So I’ll step back once, and that didn’t change, so I’m going to step back a second time And you can see now, we’ve stepped backwards The next instruction that the debugger would execute is a return instruction, so we’ve stepped back to before the return And so we see now that the instruction pointer, the debugger now knows where we are We’re inside of the main function The base pointer is still incorrect, though, because we haven’t stepped back to before that was corrupted, so I’m just gonna step back once more And now we can see most of the stack, and we can see we’re on the last line of the main function Now, we still can’t see our caller, and the reason for that is that we’ve stepped back to before the CPU state was corrupted, but we still haven’t found where the stack itself was corrupted, so the return address is still incorrect Now, I’m just curious, so I’m gonna, now that we have our base pointer, I’m curious what’s there on the stack Now we’d expect there to be two pointers there, we’d expect the instruction pointer to be there, and we’d expect the previous frames base pointer So I’m gonna open up the memory window and just copy that address in, take a look at what’s there That does not look like pointers That looks like the end of the string I printed out earlier And if we scroll up a little bit, we can see, yes, there’s the rest of the string, so definitely, we’ve corrupted the stack somewhere So the trick now is to figure out, where did that happen? And so what I’m going to do is I’m going to use a break point and then run the program backwards and have that break point stop on the exact point where we corrupted the stack So to do that, I’m going to use this BA command, which is break on access I’m interested only in rights to the for byte location Starting at that base address So basically, whoever writes to that pointer, this will stop And at that point, all I have to do is click go back,

and we’ve now stopped at the exact instruction that corrupted that pointer And we can see, so it’s inside of the string copy function, which is expected, we are copying, you know some string into this buffer on the stack, I expect the bug isn’t in the string copy function, I used to work on the C runtime, and I’m pretty sure we didn’t have bugs in that particular C runtime function So I’m gonna look in this get CPP greeting function, which is the first function in our program on the stack So we can see in here, this function just takes a wide character buffer, and the size of the buffer It has the welcome message And then it copies that message into the buffer We can see in the locals window, we can see the message points to the complete message We can see that the buffer has only part of the message and the reason is that we’ve stopped part way through the copying of the message into the buffer, and then we can see that the size of the buffer is allegedly 100 characters, or hex 64 So I’m going to guess that that’s not correct, given that we’ve overrun the actual buffer, so I’m going to look at our caller and see what it passed in So we can see here, the buffer actually only has 50 characters Well, there’s our problem And the problem is, well, I accidentally passed size of greeting instead of greeting.size, so I passed the number of bytes instead of the number of characters, so whoops I’m gonna blame my code reviewers But actually here, the, so with this little test program, it’s only 20 lines of code Like we could have easily gone and just looked at the code and diagnosed this and figured out what was going on, but in a large program, you know, debugging this kind of thing, especially if it’s a hard to repro bug, or if it only occurs occasionally, or if you’re just getting crash dumps from the field, this can be extremely tedious and difficult to debug So time travel debugging, you know, works just as well with large programs as it does with small ones And Jordi will be showing an example in a minute of debugging a large program So So in our first demo, I showed how you can actually record process using the debugger I showed how you can step forward and backward, that you can use all of the same commands you’re used to using with debugging live processes With the exception of things that modify the state of the program And the second example I showed you this debugger data model that provides you a very rich way of querying information about the process, and you’ll see more of that later in our upcoming demos I showed you how you can find events easily So we were very quickly able to find the exception and navigate right to that point during the execution of the program And then I finally showed how you can use memory break points to quickly find who last modified something, which is often the thing that we do most with these traces, right? Like, you see some data, and you say, hmm, how did that get there? Well, this makes it very easy to go and find that Right, you don’t have to go hunting around and see all the places that might write and try and divine where they came from You can actually go directly to the location where that right happened and find it directly So with that, Jordi is going to show a much more, I guess, large example – Thank you, James So we wanted to show you an example with production code, code is used normally, so we picked the Chakra engine This is the Javascript engine that is used in Edge on IE And I need an excuse to put the console debugger here since I’m a console debugger user so we are going to demo the check range inside of the debugger The debugger is scriptable in Javascript And we are going to just look at a bug in Chakra that way So what we did was inject a bug in the Chakra engine that is not that different from the bugs that happen normally in production, except it has a nice comment next to it when we find that it will be obvious that we are there Click So With that, I’m going to launch the debugger Conveniently in the debuggers folder And I’m going to run this live First So When the debugger comes up, I’m going to just, here’s the console debugger, the same debugger we were using, in the console, and I’m going to load the Javascript engine that has the bug injected in it, so let me put the backslash here just in case It loads it And then I’m going to run in a script, Javascript, that all it does is allocate some memory, use some memory and enforce some garbage collection So it just puts some pressure on the garbage collector It really doesn’t do much more than that And OK It has loaded the script and it has secured it second time, OK, this third time, it crashed and if we looked at it, it’s an access violation

We see here that it’s trying to read from this awful memory location here which is minus two, probably not the right address, and we see in a stack that is quite intimidating In fact, OK, this is memory, memory recycler Realloc If I have to figure out where I got this minus two from this caller stack, it’s going to be painful In particular, garbage collection bugs have the root cause usually at the distance from the side effect In fact, if you took a dump from this and looked for this minus two pattern, you will actually not find it that much into the trace Into the dump It’s not there So that makes things a little bit harder Let’s repeat this case now with recording a trace So what I’m going to do is do the same thing, I’m just going to record it And when I launch it, what you will see is that there is some overhead with the recording, in fact, the debugger has not yet come up If I have time at the end, I will show you a case where things are a lot of overhead Still usable, but a lot of overhead Let me do the same things as before GSprovider.dll Load the guy that has the bug Let’s run the script that we were running before And you can see, it’s a little bit slower and despite this, as I said, this is used more than half the cases situation engineers, and that was before the performance improvements that we did in the consumption side OK, so It crashed like the other one It’s indexing, so for expediency, I’m going to debug the same thing that I have recorded on an unoptimized trace There is an upcoming channel nine video, I think comes out next week, it’s in post production, where we debug this both optimized and unoptimized, so you can see that it works on both cases This is a trace of the recording, unoptimized, so it makes more C++ debugging And James already showed us that we can look at events to find the exception So I’m going to do it differently, which is just, tell the debugger to go And it will stop at the exception You know? First chance exception, this exception may be handled OK To show to you guys a little bit of how this also works the other direction, is let me go to the end of the trace And go backwards G minus, and we are stopped in the same spot So that’s a little bit of how it works, forwards and backwards, everything works normally As before, we have this wonderful minus two here In fact, it’s the disk pointer here, minus two Let’s go back a little bit and see if I can figure out where this value comes from So T minus to go back or if you prefer, press the button in the UI Now, here we have the C++ expression, memblock, ooh, memblock is also minus two So, that DX command that James was talking to us about, can be used to also look at the local variables, memblock And it has a C++ indexed parser, so the ampersand sign in C++ gives us the address The ampersand sign will give us the address So this is the address of that minus two value So I’m going to do the same thing that James was doing Put the break point, eight bytes, BA, when someone writes these eight bytes on this address, and I’m going to ask the debugger to go backwards What this does is it stops at the spot where we wrote the value Now, memory break points, when you are debugging on a processor, stop at the instruction after the action took place, is the same here, so this instruction after the right happened If I do want to step backwards, now we are at the instruction where this thing happened Which is memblock, will get the value assigned, and in fact, you see memblock has the previous values still here I don’t see free object list, so let me explain this pointer, oh, this pointer members, so fine DX Free Object list, this would work, or DX this free object list Same thing And as before, an ampersand will get me the address So now, I can do the same thing as before Put the breakpoint on this address that I have right here And as the debugger to go backwards Now, a couple of observations The address I just was looking at, this address, is no longer of interest to me This is the value that now has the previous values No longer the minus two, so I’m going to disable this And the other thing is, the debugger, when it’s running live, has a limit of how many hardware break points,

how many access break points you can put Intel processors tends to be four Other processors may be different With debugging a trace, you have no limit You can put 100 of these break points, and it will just work So that’s another advantage of debugging a tace, and another reason why it’s used often inside of the Microsoft As before, we stopped, now I stopped on the instruction after, so let me go one step backwards And I see this here I look at memblock It has an address I see free object list And if I, OK, has the new value The old value, not the new value Let me step into the function This is the same function as before Let me look at the values here It’s minus one Not minus two But if I look at this piece of code here, in fact, if we go back, we can see it That’s how the minus two came to be OK So now, the question is, who put this minus one here? Let me copy this address Let me step backwards Let me put the, sorry, let me put the breakpoint and type correctly that the term, the breakpoints, let me remove the last one I checked because it’s no longer of interest And go backwards And now here, I see the value minus one Which looks promising And I see this demo introduced a bug, so that’s the bug that we’re looking for And to be clear, I am not familiar with this code base, I have only used it to prepare for this demo The person that did the Javascript integration on the debugger in our team is not me, someone else, and we made something that was hard and difficult, quite easy to do It was mechanical It just followed the recipe And it’s not always that easy, but it often is easy to debug bugs this way Excuse me Now, if we had to do this normally, for those that have enough memory, this stack looks nothing like the stack that we had on the crash In fact, this crash happens over 100,000,000 instructions before the root cause So this is not going to be something that you can do manually by hand, the old fashioned way So Let me stop debugging and we go back to the slides Let’s take a look What we have seen is that there is overhead when we record You get a lot of value out of it, but there’s some price to pay I said, inside Microsoft, we use this a lot, so we find the price reasonable And it’s something we will be working on If we have time at the end, I will show another scenario where it’s much more clear, there is overhead Hard problems become easy I said, I have not seen this code base before, I just followed a mechanical recipe of just follow the bread crumbs backwards ’til you find the place where it does not look like the right thing to do And the other thing is, I had one trace to do I did not have to restart the debugger over and over every time I learned something Where this came from, since it’s backwards, just go and find it And so, with that Ken (applauding) – Thanks, Jordi, that was pretty awesome I like that OK, so OK So what we’ve seen so far is a pretty familiar debugging experience, setting break points, looking at the stack, executing forwards Now executing backwards But time travel debugging allows us to look at code execution another way It lets us look at it as a queryable data source So what do I mean by that? Imagine you had a sequel database table where each row was a point in time in your execution and that there was a set of columns that represented the register state of your processor and the contents of your address space That would be a really big table, but you could imagine doing queries that would find specific call patterns, memory access patterns, things like that What we have done is we have enabled this kind of querying in time travel debugging Another kind of way to think about it is if you’ve used the system tool’s process monitor, right, you have a work flow of capturing activity in a file, then you go to the filter, and you narrow it down to your specific problem that you’re interested in And we are trying to bring that kind of power to debugging your programs So to illustrate that, we took a Windows SDK sample for a media player and we introduced a bug into it, and so let’s see what that looks like OK, so imagine it’s four PM on a Friday, you’re getting ready to go home, and you get an e-mail like this one I was trying to use your program,

and this is what I get And we see we have the dubious error message that operation completed successfully So how many people have seen this one in the wild? Yeah, excellent, excellent So, all right, well, let’s see if we can use time travel debugging to solve this quickly so we can get out of the office So We are going to go here So You know, I could have asked Jordi for repro steps, but he didn’t remember what he did exactly, or I could sit in this office and have him hover over me while I try to debug it Which is kind of intimidating, so I just asked him, please send me a trace file And so he sent me a trace file And so let’s open it up And see what it looks like And let’s, actually, let’s stop debugging Just to make sure And we’ll open it up And we’ll close some of these old files there, great So We’re sitting at the start of the trace And I’m gonna start off by making assumption that Jordi probably stopped recording after he saw the message box come up, so I’m just gonna push the go button And go to the end of the trace OK And so we look down at the stack and we see that the message box is still in the stack there So our theory was right And again, this emphasizes the fact that, you know, when you have your repro captured, and there’s really no cost to like going past to where you want to execute, you can just test theories, and if they don’t work out, you can try something else So in this case, the theory was right, so let’s take a look at what this code is doing when it calls message box So, I’ll scroll up a little bit And so, this is fairly typical code It calls get last error to find out what the error code was It formats it using format message and then it displays it using message box Now, the fact that it said operation completed successfully, that likely means the error code was zero We can prove that by using a standard WinDbg command GLE for get last error Yep, there it is, operation completed successfully OK, so Let’s see how queries can help us find this problem So the first thing, so our basic idea is that in Win32, a common pattern is, when APIs fail, they return false And if you want more information, you have to call get last error to get the information More detailed error code Now, if you forget to call get last error soon enough, and you call some other Win32 API, it usually gets, the error code gets set back to zero, and you end up with lovely message boxes like that So, what we’re looking for is a error code that was returned that was non zero before message box was called So let’s use queries and see what that looks like So for now, I’m going to slide this over, so we have plenty of room And I’m going to clear the console window So we access queries through the debugger data model that we talked about earlier Yeah, through the DX command And what we’re gonna do is we’re going to capture all of the calls to get last error and to message box And I’m going to assign it to a variable so that we can do some other things to it after it’s ready So, we’re gonna go to the current session Use the tab key to have Intellisense have us find some of this And I know kernel base has get last error And user 32 has message box W on the stack And we hit enter And it runs for a few seconds and it has, it returns some results OK, so that doesn’t look like very interesting results, so let’s just clear the screen and display it as a table And to do that, we use dash G for the grid command and I’m going to use my previous variable And we can see this time that the results came up

much quicker, so it caches query results in memory as you execute them To save time OK So here’s our table of information And one of the things I notice is look at all these zeroes A lot of the times, there’s no error, but code is continually calling get last error, getting a zero, and so, that’s a lot of noise We like to get that out of the way So let’s filter out the zeroes So, I’m gonna clear the screen again And use the up arrow key to bring up a previous query, and we’ll add to this And so, here we’re going to use a link query and this is link with a Q, not a K, this is not the linker And link is a query technology that’s common in C sharp And we have more information about it on our blog, if you haven’t seen it before But what we’re gonna do is use the tab key, we’re gonna do a where clause, and we’re gonna limit what we see The where clause talks a land, so I’m just gonna use X to represent one particular call, and say OK, I want to keep anything where the function name contains message box because I notice that message box return zero, we don’t want to filter that out Or The return value is not equal to zero, OK? So we’ll do that OK, now when we scroll up, we don’t see any zero error codes, that’s nice We actually filtered out about 90% of the calls in this particular trace file So now, the other, going back to what our goal was, our goal was to find an error code that was set before message box was called And so what we need to do is sort these things by the time that they occurred in OK, so I’m gonna clear the screen, up arrow again So we can just refine things over time here And I’m gonna use an order by Which is another link method that also takes alanda And the land that tells us which field we want to sort on, so I’m gonna sort on the time start field OK And I also notice that down here, if you get, it shows 100 results by default, and we see that there’s some ellipses here And so I’m gonna make one more refinement to this Which is I’m gonna add a comma on the end, and I’m just gonna say, give me like 1,000 results OK, so now I have all of my results And now let’s see if we can find the message box OK, so I found the message box call, and we see that before that, the first non zero error code was a two That’s an interesting error code Let’s go time travel there and take a look at what we got So we click on that and it shows us the time travel thing here And here is my scroll Oh, there it is Empty scroll bar OK, and we click on time travel, and it takes us to the time travel position And we see our error completed successfully And let’s go down, let’s see here Click on the right one? I think I missed the right one here Clicking can be challenging at times Scroll up to message box OK And we go to on file open And our file name Zero there, that’s not good Oh, I forgot the time, thank you, thank you OK All right This looks better So we do get last error and there’s our system cannot find the file specified We see we’re calling into the media player

Let’s go to open file And we see that the file name was passed in was CPPConDemo.txt That’s not a very good media player option there, Jordi OK, so problem solved (applauding) – OK Going back to the slides What we have learned is I don’t know how to open files Send me a trace not a repro, as we said before that’s something that works very well The power of queries and then a little bit of C Sharp at CPPCon, no? Link So What’s next? What are we planning for the future? Improve our recording experience We talked, there is some overhead We have done work from the days that the research set out to prove that this was doable to shipping on the performance on the replay side On a couple of orders of magnitude improvement And the reason is, people record for five minutes and they replay for four, five hours So, cutting an hour out of those four was a bigger win than cutting a minute out of five Very powerful query support We want to invest in that space heavily We want to do dynamic code analysis We have a trace that explains everything that happened, so we want to use it to find leaks, or to find double freeze, or user after free, or this kind of stuff And then, we want to add kernel level tracing support to these tools so you can do this on a driver that you are developing for Windows or whatever it happens to be If you want to know how we are going to do some of these things, there’s a clue at the bottom right hand side corner of the slide And with that, I’m going to move to questions that I didn’t save Oh Do you remember the questions? – Jordi, yeah You forgot to save them again, didn’t you? – Yes – Yeah Well, this happened a fair amount while we were practicing, so I didn’t tell you this, but I secretly recorded the note pass session when you were typing the questions so that we have a hope of getting them back, so may I? – Please – OK OK All right, I hope this works Let’s see here OK Alt tab OK, so we’ll stop debugging here And Open trace file From Documents Sorry Down here In documents TTD traces Ah, here we are So before I open this, I just wanted to point out briefly, you can see that when we launched Notepad, you might have noticed it took a little while to launch, but the typing speed was pretty interactive and it produced a file that’s roughly 200 megabytes That’s not so bad It probably contains on the order of 200,000,000 instructions Something like that So let’s open the file And So what I did was I made a little Javascript file that’s going to try to dig out these things, so Go into here And we’ll open up the Javascript Great And let’s give ourselves some room to see the code here So, this is a feature of WinDbg that you can actually write Javascript and do imperative things like executing instructions immediately, or you can actually extend the data model that we’ve been talking about So invoke script is like main in C++ And debug logs, just like print app So the basic idea is Notepad uses a window edit control, and the edit control holds a piece of locally alloced memory And so, if Notepad wanted to load or save text from what you’re typing, it would have to be able to get at that memory And in Windows, the way you get at that memory is through a message, so I’ll briefly show you, you know, I have some code that finds the text handle Remembers what position it’s at, so it can restore to that point, when it’s done

And it finds the text locations where the text was updated and it does some simple things like appending to a vector, finding unique strings, and prints them out And in particular, wanted to show you that you can do queries inside of Javascript, as well, and it looks very similar to DX Here we start with the host current session And we go to TTD calls Look for send message And the edit handle message Which is parameter one Of send message And then we do another query to find the memory and hopefully, if we all got it right, bring our Bring that over And it says here are the unique lines from the Notepad session This looks like the questions we asked at the beginning So there you go, Jordi – Thank you (applauding) So, before answering the questions, let’s go back to the slides for a moment And let’s see what we learned on this Ken saves the day again The debugger is scriptable and now we did C Sharp at the CPPCon and now we’re doing JavaScript at the CPPCon, so not only C++ is not the only language we can use Now, going back to the questions How does this differ from RR? Single core versus multicore Different characteristics on the performance overhead during record Depends on the type of program, on the number of cores you have on your machine, one will be better than the other and the overhead that you see But the basic principles are the same On how this works Is this Windows only or which version of Windows? Actually, this is on the WinDbg preview So you actually have to have an up to date version of Windows 10, this will be available on the kits, and some of the restrictions will be lifted, but right now, it’s an up to date version of Windows 10, OK? And with that, now for real, let’s open the floor for questions and you know – I don’t think you ever said how to perform a trace, unless I missed it – Since, let me show, and what I’m going to do is use this moment to also show a case where our overhead is significant so that you guys see that sometimes, in Notepad you saw us type, and that came fine, or even the debugger you saw use this So if I’m going to launch, actually, let me launch Notepad this way, no, sorry, from here, launch executable, click this checkbox, record process with time travel debugging, and that will launch, in this case, the debugger If I do something as uninteresting as this And click on the record, it should launch Notepad And now typing works normally, however, if I want to show the open file dialogue, this will take a while And the reason is this dialogue does a ton of work This dialogue launches 14 threads, calculates icons and thumbnails, it actually, you know, it still usable but there is some overhead sometimes and as we said, that’s– – Is there? Does this require any compiler or language support? – None This will require, in fact, it’s one of the conditions we had, since we support code that sometimes we haven’t returned at Microsoft, we don’t have symbols for it, or we don’t have anything, this works just fine – [Man] Yeah, so that’s just like the normal Notepad that’s like installed with Windows, it’s no special modifications to it – No special modification The only thing we did is we created a new start menu shortcut to launch it under the recorder, but that was it – So you don’t need any symbols to do the recording? I mean to do the playback, I guess – No, you need symbols if you want to debug with symbols, but I have debug without symbols just fine with this And this is a normal machine with public symbols of Microsoft, so we have the public symbols of Notepad Sorry, there was a question on the back, in the middle – [Man] Yeah, actually if people could go to the microphones, that would be easiest, because then we don’t have to keep repeating questions – In fact, if you don’t mind, I’m going to change my glasses so I see you – [Man] What size of files are the traces? Like how much resources does that take? – So The size of the trace varies by the number of instructions you execute The more instructions you execute, the better What we do is when we start recording, we take a dump of oops A dump of the entire process, so that is, and then on a steady record, we do a bit per instruction, OK? But it will vary The open file dialogue is actually disastrous for us But there are other places where we can do much less than that But that’s approximately the case The overhead is mainly recorders is some overhead,

replay will be faster than live debugging, some cases not On the overhead of the recorder is memory intensive right now, we have work on the record side, as we said, to improve that significantly Thank you – [Man] OK You mentioned that you can’t modify memory on the replay Can you still execute code? So can I pull a function that doesn’t load my memory? – [James] No (laughing) – That was a fast answer – [Man] I have actually two questions One of them, can I attach the recorder while the process is executing? – So, a few notes You can attach to anything that you can debug, pretty much, there a couple of corner cases, but in general, if you can debug it, you can attach to it, or you can launch it You can attach to UW processes You cannot launch them under the traces just yet The other thing to note is when you stop, we terminate the process OK, that’s relevant, we have a couple of other work items to remove that, but that’s– – [Man] Thank you And in general, I also had a question, what are the limitations so that we know what to not expect? – If you cannot debug the process, we have actually done work to avoid us recording things that you cannot debug That means sometimes if you have anti-virus software on your machine, that logs process right across, we will not work, you will have to disable anti-virus software, if that makes any sense – And also, I would like to add If you’re using public symbols with queries right now, it doesn’t resolve the names, it’ll just say unknown symbols, so it’s another thing we need to fix If, you know, your own code, where you have full PDB information, it’s no problem, but like, pulling out kernel bases named there today requires private symbols, but we’ll fix that soon – [Man] Awesome, thank you – [Man] You showed the time travel on launch of application, I want to know what the impact is when you run it on huge scale, in application Like in my case, there’s an application that sometimes runs for 10 minutes or an hour before it has problems, and takes 10 to 20 gigabytes of memory What is the most efficient way to work with it? And how, what about size, performance? – That will be a long question to answer We’ll be at the table outside to answer it I have recorded this for hours, but the traces come back really, really, really big Just to set expectations We use it, for us, on chain servers, which are 64 gigs of ram, fully packed, and it works, but the overhead is measurable You will see it with the naked eye The overhead, just to set expectations Since we still have time, but if not, if you see people with these shirts, you can ask them questions, as well Just putting up here other Microsoft docs so that you guys know where they are Please – I have two questions First of them, is it possible to record 32 seconds of application? Last time the application working before crash – You need to know how to start recording, but for example, if you touch the debugger and you have a particular event and the debugger, then you work to suspend the process and there are things to do there It’s an area where we will making things easier for you – [Man] Situation or I start test agents under debug and when test crashed sometimes I need the last seconds to understand what’s happened – You still need to start recorder yourself But you can start it Because you cannot touch, you can start the recorder when you see fit – Second question was about administrative rights Is it possible to not use administrative rights because test agents must be clean – Not at this moment in time We have been talking inside of Microsoft on how to remove that requirement, but right now, we rely on a couple of APIs that are admin rights We are working through that – OK, thanks – [Man] Hi there Are there commands to integrate to Visual Studio? – So we are working with the Visual Studio team on the analyzing the product and how it would integrate If you have, if you would like to give them feedback, e-mail [email protected] and give them your thoughts, and they will be happy to hear about that To be clear, they expect someone to e-mail them if you want to do that, so please do – [Man] Visual Studio If you want to see this happen,

and you send mail, that increases the chance of that happening – Sorry, for those – If people want this, there’s a lot more leverage than just an idea from the likes of the TTD team (laughing) – For those that did not hear the person without the microphone, if you send an e-mail, it will really increase the chances of this happening, so please do if you want to see this happen – Thank you – Thank you, Andrew – Hello, once– – Sorry, we’re out of time, I’m happy to answer questions, but just to be clear, we’re out of time