Doc Searls asks “Did the air traffic control center really have a “Microsoft server crash”?. This looks like an incredible use of a 32-bit counter of milliseconds that overflows every 49.7 days, without a built-in feature to reset it. The “neglected maintenance” is likely a reboot of the system. Now ask yourself: Do you really want to be at 35,000 feet when they reboot the air traffic control system?
The list of Microsoft Knowledge base articles that refer to various (or the same) incarnation of this bug are scary:
SNMP SysUpTime Counter Resets After 49.7 Days
Computer Hangs After 49.7 Days
“PING -T” Stops Timing Out After 50 Days
Print Spooler Stops Scheduling Print Jobs
The Rpcss.exe process consumes 60 percent of CPU time and performance is affected
X-Duration Values Are Larger Than Expected in Windows Media Server Log
Windows 2000 Terminal Services Time-Out Setting Limits
Contents of the Microsoft Windows 98 System Update
List of Bugs Fixed in Windows NT 4.0 and Terminal Server Edition Service Pack 4 (Part 1)
You might be able to spot Microsoft the Windows 95 and 98 systems; who would have ever expected 50-day reliability out of those systems? NT 4.0 is a little more worrisome, as the bug had been documented for some time before the release of NT 4.0, I think. But for Windows 2000? The RPCSS and print spooler bugs are not documented as fixed in a later service pack, but only a hot fix, although this may be a documentation issue. That is truly disturbing if such a known issue is still sitting around to bite programmers.
I’d really like to know how and why Harris Corporation was allowed to replace UNIX machines that did not have these problems with Windows machines where this was a known issue, and roll them out into the FAA’s production systems, no less. That this was a documented issue is not an acceptable excuse, as the incident last month demonstrated, fortunately without the loss of life.