Monday, March 3, 2008

The Curious Incident of the Disappearing-Item in Production-time

This is a blog from 2008 that I wanted to preserve, so I copied it here. It refers a retail solution, and I only changed it a bit to remove specific customer or company names. (It is actually published in this blog on February 2016).

Background:


When our customer was moving to version 4.7.7, and installing it widely, a new and weird bug appeared.
Every now and again, all the POSs in one of the store servers, gave "Item Not Found" for each scan, save the first one.
There was no specific Server or Store or Time that could be blamed for this behavior. It happened in most stores, although it did not happen again in the same store.
To resolve it, you could use IIS-Reset, or Clear-Persistence, or client-restart or all of the above, or none. No apparent pattern could be found in the reason for the bug or the method to resolve it. Actually it looked like it resolved itself after some time...

Working the Bug:

 
The information we did have, was system logs that included a "ItemDataCartridge is NULL" log line for every incident, but nothing else.

To solve the problem, I worked my way very carefully through the code, noting every situation that might cause this specific behavior, Namely: Producing ONLY the mentioned log, and return "ItemNotOnFile" ResultCode.
 
After a while, I found an "IsValid" property in the PID (PosItemData) that was called each time this object was retrieved from the items cache list. Because it was called even when it should not have, it was evident that if it returned false each time, the result behavior will be the same as the bug.

I looked into this "IsValid" implementation, and this is (more or less) what I saw:
 
 
  get{ ... return expired > System.Environment.TickCount; }  
 
 
So I looked into the place "expired" is set, and saw:
 
 
  expired = System.Environment.TickCount + ActiveTimeout; 
 
 
Looks suspicious?  I thought so.

"TickCount", as far as I remembered, is an Int32. Int32 have a max value, that when you add positive int to it, becomes negative. When that happens, the "IsValid" will always return false, and Bob's Your Uncle!

So what I did now, and pay close attention because this is the whole trick, is placed the cursor (a.k.a. bin) between the "System.Environment.Tic" and the "kCount", and pressed "F1". And for those of you that do not have a working help file, here is the relevant quotes:

"A 32-bit signed integer containing the amount of time in milliseconds that has passed since the last time the computer was started. ...TickCount will increment from zero to Int32.MaxValue for approximately 24.9 days, then jump to Int32.MinValue, which is a negative number..."
 

The Conclusion:

 
The bug was this:

1) ActiveTimeout is an Int32 that equals 24 minutes in milliseconds (1440000).

2) The "expiration" field is calculated for every lookup in the transaction after the first one (for reasons I won't get into right now) as the current TickCount + ActiveTimeout.

3) For this reason, when a POS-Server runs for 24 days and 21 hours and 30 minutes (approximately), adding these 24 minutes to the TickCount value (that is now very close to the "Int32.MaxValue") will ALWAYS result in a negative value.

4) And because, as mentioned before, this calculation is always evaluated (even when the lookup is done through the Database), it will always return false, and the lookup will always fail.
 
The Grand-Total Effect is:
In each store one of the pos servers stop selling after 24.9 days for 24 minutes.
_____________________________________________________________________
 
And one final thing I'd like to point out:

The programmer that wrote this feature originally, declared the "expired" variable as long. This is interesting, because if the calculation was done in a scope of int64, this problem would have never occur. Unfortunately, the actual participants in the calculation where all int32 typed, what means that no meter what type of variant is used to hold the result, the Math is done in an Int32 scope.
 
 
p.s.

This was fixed, basically, by replacing the "TickCount" with "DateTime". This type also have a "MaxValue", but Microsoft operating-systems do not support it.
We might have a problem moving to Linux, though...