Carl's Notes
Life complexity navigation algorithms

Hunting down a memory corruption bug in Go

In the little spare time I have available, I’ve picked up a new programming hobby. I’m working on an app that can bring a more modern UX to GnuCash’s otherwise very solid model. It’s a Go app that uses the Wails framework to create a UI based on HTML/CSS/Javascript-via-Svelte. It uses the native OS web component instead of packaging an entire browser like Electron does.

I had come to the point where I could load the account tree and all transactions straight from GnuCash’s XML file format which, I must point out, was an absolute breeze using Go’s struct tags. Some basic Svelte components were showing the accounts as a tree structure on the left and the transactions list for the selected account on the right.

That’s when I ran into a mysterious bug.

GnuCash stores the transaction values as fractions. So instead of storing “-57.25”, it stores “-5725100”. Since I’m still very much prototyping, haven’t settled on a data format for storing currency/commodity amounts yet and were not interested in writing even a simple parser at this point, I just settled on running those fraction-like values through the ever-so-evil JavaScript eval() function to display the actual transaction value as a number. Good enough for now, and lets me focus on the overall presentation for now.

Except that I started noticing that sometimes, randomly it seemed, eval() would choke on the input. It would always happen eventually, and always with a different transaction.

I discovered that the string containing the fraction would contain a NULL byte, causing the error. A NULL byte in a JavaScript string? At random? Seriously?

I kept working on the presentation of the transactions view when what turned out to be the same bug made Svelte’s templating mechanism choke on the transaction description strings. More NULL bytes in those JavaScript strings.

🤔

It was starting to impede me a bit too much. Wails uses Vite to hot-reload Svelte components when they are changed, but if the program keeps crashing after 2-3 changes, it’s not much of a help.

So I turned my attention to this bug. Little did I know what rabbit hole it would turn out to be.

So the JavaScript backend was manipulating strings with NULL bytes in them, appearing in different places seemingly randomly. I turned to my Go backend, where all the data originates from and which loads the transactions from XML.

A simple check of the data coming in from the XML parser showed me that those strings were fine. Could it be the mechanism that marshals the strings between Go and JavaScript?

Wails provides this nifty service of making select Go functions and methods available to JavaScript. Simply call it, wait for the promise to deliver its result, and display. Simple and elegant. Was this mechanism somehow causing issues?

In the Go function responsible for filtering and returning the transactions for the selected account, I added a quick check while iterating over the transactions. It would look at every description and value string and check for NULL bytes, panic()-ing if it found one. I fired up the app, let it load the account that had the most transactions… but no panic().

At least not immediately. Being a bit flustered by the absence of NULL bytes when I was looking for them, I resumed my work on presenting the transactions, only to suddenly get panic()s starting to occur after a few minutes of back-and-forth updating the Svelte components.

At this point I should note that I still have very little experience with all of the technologies I’m using. The most complex Go program I’ve written so far is the web server serving this page, and it’s a single source file. My HTML/CSS/JavaScript skills are basic at best, and I’ve never used Wails, or Svelte, or Vite before. I do however have plenty of experience on large C++ codebases and I’ve seen my share of weird bugs related to faulty memory management.

The Wails community has a Discord server where the maintainers are active and responsive. I posted about my problem and quickly got some initial replies, including a quick reaction from Lea Anthony, the creator of Wails. Could it be race condition? Some issue with the XML?

I quickly ruled out the race condition. My Go code, still very small at this point, was merely loading the XML data right at startup and never changing it again. The only thing it did after that was walk the structure and package the appropriate objects into a slice to be sent over to the JavaScript frontend. And there were no NULL bytes in the XML, I’d checked that. Besides, this smelled too much like a memory corruption bug. The elusive, mysterious, malicious smell of stale pointers.

Furthermore, I had witnessed other strange behavior. The marshalling code would sometimes start complaining that it would receive invalid JSON–another string that got messed up! And once or twice, I got a good old-fashioned, honest-to-god segfault. There was something very, very wrong going on in my app’s address space. Something, somewhere, was overwriting memory. My memory.

I got a little excited and curious at that point. I’ve had plenty of memory management-related bugs in C and C++, but this was Go, which provides automatic memory management. While it does give you a few ways to shoot yourself in the foot, it’s not being overly generous with those (unlike C++ which will eagerly give a 14-year-old the keys to a fueled up and fully armed fighter jet on an airplane carrier). I decided to spend some time investigating this to see if I could find the root cause, as a way of becoming more familiar with Go’s memory model. Nothing like a good bug to teach you a thing or two about what’s going on under the hood!

Being able to reliably reproduce a bug is probably the most important step in such a situation. It seemed to me that the bug would always appear after I made a change in a Svelte component and switched back to my app to view the result. So I wrote a shell script that would make changes to the Svelte code every 3 seconds, automating what I thought I was doing manually when I observed the issue. I tried various timer values and various types of changes, but no NULL bytes and no panic().

Was I being haunted by a Heisenbug? Like the elusive quantum particles you can’t observe without having some effect on them, was my poking around for it scaring the bug into the shadows?

So Vite’s hot-reload feature didn’t seem at fault here. Was is the switching back and forth between my app and the IDE that was causing this? I started hitting ALT-TAB as if possessed by a multi-tasking daemon. I got a few hits, but not really reliably. It seemed to work better the more often I selected accounts to force a reload of the transactions between ALT-TABs. I still wasn’t sure the code that was marshalling the data structure between Go and JavaScript wasn’t at fault, somehow. What I was sure of was that the panic() always happened when ALT-TAB-ing back to my app. Was it somehow related to the windowing code? But then what was the link to my rather basic and completely independent data structure?

Most memory corruption bugs end up with a segfault and are fairly trivial to debug with a good debugger. I tried running both GDB, which has very limited support for Go programs, and Delve, a similar debugger written explicitly for Go, but after I had spent some time installing both, neither would even raise an eyebrow at my strings being overwritten right under my nose. Mighty helpful, thanks a bunch!

I needed to narrow down this bug’s territory, find its nest. I rewrote my panic trap as a function whose sole purpose was to print the string passed to it and then walk my data structure and panic if it found a NULL byte in a string. I then started sprinkling calls to it throughout the Wails codebase, via a function pointer declared inside a custom-written package in the Wails codebase which I named bughunt. Even though it’s an HTML/CSS/JavaScript frontend, the backend is a rather traditional Win32 app with a message loop, except it’s written in Go. I’ve probably written a few dozen Win32 message-loop apps in my career, including reproducing the famous (if rather trivial) xeyes app on Windows, so I felt right at home. I instrumented many message handlers and various interesting looking Wails functions with pairs of calls to my function, one direct and one deferred, hoping to find an event for which my structure would be intact when entering its handler but messed up once that handler was done.

I eventually ended up finding such a function, but my excitement was short-lived. It turns it out my memory got corrupted inside a call to GetMessage(). Literally, at some random point, merely fetching the next message was corrupting my memory. What in the name of the Holy Pancake was going on?

I was at a bit of a loss. On Windows, Wails uses the native WebView2 component, which is the beating heart of Microsoft’s Edge browser. The control is loaded as a COM component, which probably meant that it was wrapping its own (Win32) window and had its own message handling. Was this component responsible for the memory corruption?

That would be bad news, I thought, since I didn’t have access to the source code.

I was left with hundreds of lines of tracelogs to look at, thanks to my instrumentation (I had instrumented every Win32 API wrapper in the codebase). It had occurred to me that the GetMessage() function that would eventually trigger the memory corruption and the ensuing panic() was always preceded by a WM_KEYDOWN message. There was also a call to GlobalFree() happening inside GetMessage() that would detect the NULL bytes directly when called. Of course, this caught my eye, since that’s the Win32 equivalent of free(), returning memory to the system, but something about this made it feel like a red herring: the memory was already screwed when entering that call, and since I was using defer quite a bit to instrument the various API calls, those GlobalFree() could very well be the result of some stack unwinding during the panic(), and not directly causing the bug.

So back to the keystrokes: I looked up the the raw parameter values and figured out it was the ALT key (wParam=0x12), interpreted as the VK_MENU virtual key. I fired up my app again and instead of pressing ALT-TAB many times, I merely held down the ALT key… and it eventually panic()-ed. Pressing CTRL or one of the arrow keys would also trigger the bug, but not e.g. SPACE or ENTER. There was something weird going on with these special keys.

At that point, another user posted on the same Discord server. They were getting segfaults when pressing arrow keys. That caught my attention. It sounded similar: something going wrong with memory, in a way that shouldn’t really happen in Go, and it involved pressing keys. As that user pointed out, it didn’t happen with all keys, just the arrow keys in their case.

The timing couldn’t have been better, and it gave me another idea. I fired up Spy++, the good old tool that comes with Visual Studio and which can log Win32 messages passed to any window object in the system. I found my own app’s window and, just as I thought, some Edge/Chromium child window that wasn’t from me/Wails but which was handling WM_KEYDOWN events with a parameter value of VK_MENU. I instrumented the Wails code some more, calling EnumChildWindows() to get a list of all handles to windows inside my app, verified that the Edge/Chromium window was always the second handle to land in that list, and then wrote a little goroutine to send WM_KEYDOWN messages to that window every 100 milliseconds. It would send the SHIFT key 450 times and then switch over to the ALT key, printing a little banner just when the switch would occur. My little struct-checking function bug trap would print “.” every time it got called.

var handles []uintptr

xwindows.EnumChildWindows(  // xwindows is "golang.org/x/sys/windows"
  xwindows.HWND(f.mainWindow.Handle()), 
  syscall.NewCallback(
    func(hwnd, lparam uintptr) uintptr {
      fmt.Printf("Found window %X\n", hwnd)
      handles = append(handles, hwnd)
      return 1 // Continue enumeration
    }), nil)

go func() {
  numIterationsBeforeStarting := 450
  for i := 0; true; i++ { // Overflow is OK, this is no production code
    if i < numIterationsBeforeStarting {
      w32.PostMessage(handles[1], w32.WM_KEYDOWN, w32.VK_SHIFT, 0)
    } else {
      if i == numIterationsBeforeStarting {
        fmt.Printf("\n[STARTING VK_MENU]\n")
      }
      w32.PostMessage(handles[1], w32.WM_KEYDOWN, w32.VK_MENU, 0)
    }
    time.Sleep(100 * time.Millisecond)
  }
}()

I also changed my struct-checking bug trap in an interesting way: instead of calling panic(), it would print a message whenever it would find a NULL byte in a string, and then put that string’s address in a map to avoid checking it again. I wanted to see how many strings were affected and how often. With this little setup in place, I could sit back, relax and observe how my app would merrily print hundreds of little dots over roughly 45 seconds, and then the little banner, after which the traces telling me that this or that now contained NULL bytes would start pouring in, with random numbers of dots in between, but also frequently enough that I would suffer for my poor little app being eaten alive from within in front of my eyes.

So who was handling these keypresses? It turns out the WebView2 component can be given a callback that will be called to handle accelerator key events. The Wails code was using this to open the dev tools when pressing Ctrl+Shift+F12.

The code inside that little handler looked innocent enough, so I tried two things. First, I commented out the code, such that the handler did nothing more than returning false and verified that the bug could still be triggered (it could). Second, I commented out the line setting this trivial callback and tried again. This time, no corruption.

So merely setting the callback for accelerator key events was triggering the bug in my app, and likely in that other user’s app as well. At that point, I posted my findings on Discord and went to bed.

It was user stffabi who shortly thereafter figured out what the problem was. Wails uses some clever tricks to be able to call into the WebView2 COM object, using Go’s syscall facilities and passing it Go structs made to look like the original C structs expected by the COM object (presumably, this is done in order to avoid using cgo). However, there was a subtle mistake in the memory layout of the struct used to retrieve the physical state of the pressed key: some fields were declared as bool, which is a single byte in Go, but in Win32 land, BOOL is defined as typedef int BOOL. The COM object was passed an object much smaller than expected, and this is where the right values where put in the wrong place on the stack, overwriting some innocent pointers which just happened to live around that area and which some other code (the caller, probably) would eventually go on to use to write stuff in random locations. My data structure, with its ~1500 transactions with as many description strings and twice as many value strings was just large enough of a net to catch some of these writes. Kudos to @stffabi for spotting the mistake merely by looking at the code.

I’m glad I invested this time. I didn’t make any progress on my app for quite a while, but I learned a fair amount as a result: how to use the Delve debugger, how to use Win32 APIs in Go, how to use a local copy of a package otherwise downloaded into the Go cache, etc.

And now, I’m just happy that I will soon be able to continue using the delightful Wails framework to make progress on my little hobby project.

Read another post