How is that supposed to be helpful?
Anyone who’s used Windows long enough has eventually ran into the quintessential Microsoft error, the Blue Screen of Death (BSoD). Not that other operating systems don’t have their own versions, those that have played with their share Unix flavors know all to well about kernel panic.
Nevertheless, someone usually needs to troubleshoot these, especially if you own the machine. The tricky thing about BSoDs is that they:
- Can be hard to reproduce
- The error messages are seldom helpful
- Can be driver related OR hardware related
- Don’t always wait for user feedback before disappearing (sometimes the machine will restart right away depending on OS settings).
I recently did a new build at home. Latest gear. Full of promise. That is, until the machine (running Windows Server 2008 x64) started to BSoD when restoring sizeable databases (8 GB) on SQL Server 2008 R2 x64.
Even though the error messages themselves are usually useless, it still helps to see them (even the appearance of a useless error messages is evidence of failure). This isn’t always the default behavior with system faults in all versions of Windows. So for many, the first step is to acquire/witness the blue screen itself (instead of your system sitting at a login prompt, when 30 min ago it was restoring a database).
Disabling Automatic Restart
First ensure that you’ve disabled the Automatic Restart checkbox in Startup and Recovery. This is typically found in:
- System Properties (Control Panel->System)
- Click Advanced System Settings
- Click on the Advanced tab.
- Click on Settings under the Startup and Recovery section.
- Ensure that the Automatically restart checkbox is unchecked and its usually not a bad idea to Write an event to the system log (checked).
Reproducing the Error and Narrowing the Field
If you’re like 99% of users, you can’t infer what the problem is from a BSoD error or analyze a memory dump. If that’s the case, the next steps are reproducing the error, and trying to figure out root cause.
If you can’t easily reproduce the error (or if it’s a new build), then all your components are suspect. It could be anything from hard drive, to your video card. If you’ve recently installed new hardware or drivers, those are likely the culprit.
For testing many components at once I’d suggest a product like PassMark’s Burn In Test, a tool we used to use back in my tech days. There are many similar products available (assuming you have a preference), essentially you just want to be able to selectively test many components at once and get either a:
- BSoD (in which case you reduce the number of components you’re testing until you isolate the culprit)
- A fail from the testing software which will usually tell you what component is causing the trouble
The next step (once the component is identified) is either:
- Replacing drivers for that component, or failing that
- Replacing the component all together (hopefully its not something that’s integrated into your motherboard)
Lather, rinse, repeat. That is, after you’ve identified (what you think is) the issue, and either swapped out a component or device driver, you need to run the tests again to see if you’ve actually improved anything. Once the errors start going away you can start to have some confidence in your fix.
These take time. Running a full burn in test alone usually takes 20 min. In my case, my issue ended up being a faulty hot swap bay, which is unfortunate because its likely the last device you’d suspect. More often than not, its memory, a video card, or some other kind of removable component. Assuming you can’t narrow it down, power supplies can also be suspect.
Hardware’s great, especially when it works. But with increasingly complex computers these days, DOA’s and semi faulty components seem to be more common. Ideally this saves someone some out there some time, or even better, a tech bill.