This has to be one of the weirdest bugs I’ve ever had to track down and fix. Check out this stack trace:

Yes, that’s right, a call to the default Hashtable constructor is throwing an ArgumentOutOfRangeException with a message of “Load factor needs to be between 0.1 and 1.0.”. Where do you even begin with that one?
To make it particularly baffling this error just started occurring without the software being upgraded. The initial thoughts were a virus messing up the system, or a virus scanner causing issues. However there was no virus scanner operating (the machine is on a secure network and shouldn’t really need one…. yea, I know…) and there didn’t appear to be any virus. maybe the hardware had started to fail? A memory issue perhaps? But then it started happening elsewhere too. Upgrading the .Net runtime didn’t seem to help, and in fact some of the machines already had the latest runtime – which should have raised a flag actually, these machines are servers in a server room and run only our software. We don’t upgrade them without upgrading our software as well so some of these machines were being upgraded with the latest patches without us realising. To be fair our first response when we find an issue is “upgrade to the latest runtime and install all patches” so it is to be expected. The point is though, the runtime was upgraded but our software wasn’t. This shouldn’t be a problem but it was…
There is a bit of a clue to what is going on here – although we are calling the default constructor the framework is calling an overloaded constructor that takes 2 arguments, the first being an initial capacity and the second being a floating point value for the load factor for the hashtable (which has to do with the likelihood of there being collisions between unique hashes). That floating point is going to be important in a moment…
Checking with Reflector we can see that calling the default constructor calls the second constructor with a hardwired value of 1f:

Can’t see a problem there. Just to make sure I’m not barking mad, what does reflector say this constructor does?
public Hashtable(int capacity, float loadFactor)
{
if (capacity < 0)
{
throw new ArgumentOutOfRangeException("capacity", .....
}
if ((loadFactor < 0.1f) || (loadFactor > 1f))
{
throw new ArgumentOutOfRangeException("loadFactor", .......
}
this.loadFactor = 0.72f * loadFactor;
......
}
Clearly, 1 is not greater than 1 – so what on earth is going on? It would seem the computer has forgotten how to do floating point maths…. This is the point where I have to mention that we are calling into a delphi dll in this project using P/Invoke. Delphi is well known for messing about with the floating point control register. A problem with one being greater than one you say? Delphi you say? hmm….
Time to break out the debugger of last resort and find out what’s going on. It’s WinDbg time.
I managed to get a reproduction at will by creating a thread that sat in a tight loop doing nothing other than new Hashtable(). After a couple of minutes this would cause the exception I was looking for so at least I didn’t have to rely on clients generating a dump file for me. I’m creating a dump file here instead of just attaching the windbg debugger because it’s simply easier. I can load the dump file whenever I like and don’t need to leave the system running in between debugging sessions.
First things first, what’s the exception exactly? Using !printexception I got this:
0:040> !printexception
Exception object: 22692da4
Exception type: System.ArgumentOutOfRangeException
Message: Load factor needs to be between 0.1 and 1.0.
InnerException:
StackTrace (generated):
StackTraceString:
HResult: 80131502
OK, that’s already known so nothing useful there then. What’s the stack trace for that thread got to say for itself?
0:040> !clrstack
OS Thread Id: 0x1ea8 (40)
ESP EIP
2e4be7d4 7713b727 [HelperMethodFrame: 2e4be7d4]
2e4be878 6fa3993a System.Collections.Hashtable..ctor(Int32, Single)
2e4be89c 6f30165c System.Collections.Hashtable..ctor()
2e4be8a0 0040c128 Probel.Automation.EventStore.Impl.ModelObjectChangesCollection..ctor(…)
2e4be8c4 003c882e Probel.Automation.EventStore.RootImpl.Notify(…)
2e4bea3c 003c8217 Probel.Automation.EventStore.RootImpl.NotifyWriteComplete(…)
2e4bea70 003c7c46 Rescale.RescaleModel.EndWrite(…)
2e4beaf0 04c66d3d Rescale.Classic.ModelManager.EndWrite(…)
2e4bebbc 04c6692d Rescale.Classic.ReadWriteAccessor.Dispose()
2e4bec08 04c66872 Rescale.ReadWriteAccessor.Dispose()
2e4bec14 04bb7de7 Probel.Automation.EventStore.ClientFacing.ClientFacade.ExecuteWriteTransactionEx(…)
2e4bed34 04bb7ae4 Probel.Automation.EventStore.ClientFacing.ClientFacade.ExecuteWriteTransaction(Rescale.Priority, UInt32, UInt32, Probel.Automation.EventStore.ClientFacing.XaCallback, System.Object[])
2e4bed5c 04bef5bf Probel.Automation.EventStore.XmlExternalisation.XmlEventCopier.Paste(Probel.Automation.EventStore.ClientFacing.IClientFacade, Probel.Automation.Util.Oid, Probel.Automation.Util.Oid, Probel.Automation.Util.Oid, Probel.Automation.EventStore.XmlExternalisation.IPercentageComplete)
2e4beeb0 04beef56 Probel.Automation.EventStore.XmlExternalisation.XmlEventCopier.Paste(Probel.Automation.EventStore.ClientFacing.IClientFacade, Probel.Automation.Util.Oid, Probel.Automation.Util.Oid, Probel.Automation.EventStore.XmlExternalisation.IPercentageComplete)
2e4beed4 04beee80 Probel.Automation.MorpheusEditor.BackgroundLoader+PasteJob.Run()
2e4bef18 04beec7c Probel.Automation.MorpheusEditor.BackgroundLoader+ExecutableJob.ThreadProc()
2e4bef60 6f306e76 System.Threading.ThreadHelper.ThreadStart_Context(System.Object)
2e4bef6c 6f3102ff System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
2e4bef84 6f306df4 System.Threading.ThreadHelper.ThreadStart()
2e4bf1ac 70841b4c [GCFrame: 2e4bf1ac]
The first thing about that stack trace that is instantly obvious is that this is not the stack trace of a thread creating Hashtables in a tight loop. So although there’s a thread creating potentially millions of hashtables a second, that’s not the thread that explodes. The second thing of interest is that there are a number of calls into Rescale code. That’s a wrapper around the delphi dll I was talking about so it looks like this exception may be happening on threads that are calling into delphi.
Out of interest I rewrote the code which caused the exception to occur repeatedly so as to create objects instead of hashtables and still got the issue. However, if I rewrote it to increment a variable instead the issue failed to arise so it looks like garbage collection may have something to do with it.
It’s obviously an issue with floating point maths so the first place to look would be the fpcw register. This is the register that controls how floating point operations are done by the FPU. Using ~*e .block{.printf “TID= %x\tfpcw=%x\n” ,@$tid, @fpcw}

however tells us that the fpcw is set to 27f on every thread. This is the value you would expect so that’s not the issue. That stumped me for a while but a bit of research into how floating point operations are done on an x86 architecture lead to me the SSE registers. These are ‘newer’ floating point registers that are not implemented on all processors. Actually they are by now since they started being put in around 1999 or so. The control register for the SSE registers is MXCSR. Lets find the value of that reg instead:
0:027> ~*e .block{.printf “TID= %x\tMXCSR=%x\n” ,@$tid, @mxcsr}
TID= 1b3c MXCSR=1fa0
TID= 203c MXCSR=1f80
TID= 1f14 MXCSR=1f80
TID= 924 MXCSR=1f80
TID= dac MXCSR=1f80
TID= bbc MXCSR=1f80
TID= 2258 MXCSR=1f80
TID= ad0 MXCSR=1f80
TID= db0 MXCSR=1f80
TID= ac8 MXCSR=1f80
TID= 225c MXCSR=1fa0
TID= 21c0 MXCSR=1f80
TID= 920 MXCSR=1f80
TID= 1694 MXCSR=1f80
TID= 1f18 MXCSR=1f80
TID= 1810 MXCSR=1f80
TID= 107c MXCSR=1f80
TID= 1cbc MXCSR=1f80
TID= 2008 MXCSR=1f80
TID= 124c MXCSR=1f80
TID= 10e4 MXCSR=1f80
TID= f00 MXCSR=1fa0
TID= 1730 MXCSR=1f80
TID= 220c MXCSR=1f80
TID= 210c MXCSR=1f80
TID= c44 MXCSR=1f80
TID= 10d4 MXCSR=1f80
TID= 1f44 MXCSR=1fa0
TID= 1310 MXCSR=1f80
TID= de0 MXCSR=1fa0
TID= cd8 MXCSR=1fa0
TID= 1708 MXCSR=1f80
TID= 185c MXCSR=1fa0
TID= 1274 MXCSR=1f80
TID= 2198 MXCSR=1f80
TID= 20b4 MXCSR=1f80
TID= 20f8 MXCSR=1f80
TID= d04 MXCSR=1f80
TID= 1600 MXCSR=1f80
TID= 654 MXCSR=1f80
TID= 1ea8 MXCSR=1fa0
TID= 2034 MXCSR=1fa0
TID= 3f8 MXCSR=1f80
Now this is looking more interesting…
From the stacktrace we can see the thread Id for the thread that threw the exception is 1ea8. That thread has it’s MXCSR reg set to 1fa0 so this looks promising. Looking through the stack traces for all the threads showed a high correlation between threads with calls into the delphi dll in their stack and an incorrect MXCSR register. Unfortunately I was completely stumped as to what was setting the MXCSR reg. I even tried pushing and popping that register using assembler every time a delphi call was executed but that didn’t help. I’m guessing something in the delphi runtime sets and unsets that reqister but I can’t find it from disassembling the dll. Assembler isn’t my language of choice though so I was probably just missing the instruction. One of my colleagues suggested there might be an instruction for pushing and popping all registers so the MXCSR reg might not show up in the disassembly…
At this point I decided it was time to call microsoft and ask them if it was a problem with the framework or something we were doing. Thankfully our MSDN licences enable us to talk to their support people. They seemed to think it was a problem with delphi and not an issue for the dotnet runtime and suggested a number of options for trying to resolve the issue. The most obvious one, and the one we went with, was to only ever acccess the delphi dll from a dedicated (or number of dedicated) thread(s). This was a while ago now so I can’t remember if I used more than one thread for the rescale dll or not but we now execute all calls to the dll through a simple work queue. Push a work delegate onto the queue and it get processed as soon as possible in a separate thread that only ever deals with rescale. This resolved our problem. I’d still like to find out exactly what’s going on but I haven’t got the time to look into it any further. I’ve already spent far too long trying to resolve this.
It looks to me like microsoft were not originally using the sse registers in DotNet 2 but at some point decided that they were now ubiquitous enough to start using and added them to the JIT. We never had a problem with regular floating point values previously but for some reason the combination of Delphi and the DotNet runtime just don’t play nicely and one of them is not resetting the MXCSR register to what it should be when it is done with it. Thankfully that Delphi dll has now been replaced with a fully managed dll so this problem wont ever crop up again (I hope).
Posted in Uncategorized
Tags: C#, Debugging, Delphi, WinDbg