How Can This Test Fail?
Posted by Davy Brion on June 3rd, 2009
We have a certain class which sometimes makes some tests fail… the code is old (consider it legacy) but it’s not that bad… but we just can’t figure out why it sometimes causes tests the fail.
Here’s an example of a test that occasionally fails:
[Test]
public void TestGet_CacheableObjectReturned()
{
TestCacheableObject cacheableObject = new TestCacheableObject("test", 60000);
CacheProvider.GetInstance().Put(cacheableObject);
try
{
Assert.AreSame(cacheableObject, CacheProvider.GetInstance().Get("test"));
}
finally
{
CacheProvider.GetInstance().Remove("test");
}
}
If it fails, it’s on the following line:
Assert.AreSame(cacheableObject, CacheProvider.GetInstance().Get("test"));
Sometimes, the Get method of the CacheProvider returns null in this test, even though the cacheable object has just been added to the CacheProvider.
Here’s the code of the CacheProvider:
public class CacheProvider
{
private static object monitor = new object();
private static CacheProvider _instance;
private IDictionary<object, CacheableObject> _cacheDictionary;
private CacheProvider()
{
_cacheDictionary = new Dictionary<object, CacheableObject>();
}
public static CacheProvider GetInstance()
{
lock (monitor)
{
if (_instance == null)
{
_instance = new CacheProvider();
}
}
return _instance;
}
public void Put(CacheableObject value)
{
if (ContainsKey(value.Key))
{
Remove(value.Key);
}
value.Provider = this;
_cacheDictionary.Add(value.Key, value);
value.StartTicking();
}
public CacheableObject Get(object key)
{
if (_cacheDictionary.ContainsKey(key))
{
return _cacheDictionary[key];
}
return null;
}
public void Remove(object key)
{
_cacheDictionary.Remove(key);
}
public bool ContainsKey(object key)
{
return _cacheDictionary.ContainsKey(key);
}
}
This is the code of the CacheableObject class:
public abstract class CacheableObject
{
private object _key;
private CacheProvider _cacheProvider;
private Timer _timer;
protected abstract double GetTimeToLive();
public CacheableObject(object key)
{
_key = key;
}
public void StartTicking()
{
_timer = new Timer(GetTimeToLive());
_timer.Elapsed += Timer_Elapsed;
_timer.Start();
}
private void Timer_Elapsed(object sender, ElapsedEventArgs e)
{
_timer.Stop();
Provider.Remove(this.Key);
}
public object Key
{
get
{
return _key;
}
}
public CacheProvider Provider
{
get
{
return _cacheProvider;
}
set
{
_cacheProvider = value;
}
}
}
And this is the code of the TestCacheableObject that is used in the failing test:
public class TestCacheableObject : CacheableObject
{
private double _ttl;
public TestCacheableObject(string key, double ttl)
: base(key)
{
_ttl = ttl;
}
protected override double GetTimeToLive()
{
return _ttl;
}
}
I can’t for the life of me figure out why the test above fails sometimes. It happens very occassionaly, sometimes it even takes weeks or months before it fails again. It’s not just the test above, it’s basically every test that uses the CacheProvider’s ContainsKey method (either directly or indirectly) that fails sometimes.
Important note to consider: when these tests fail, they take about 10ms…
If anyone has a clue what could possibly cause this, i’d be very willing to hear it
June 3rd, 2009 at 2:13 pm
what other tests are using the *singleton* cacheprovider?
June 3rd, 2009 at 2:29 pm
none, only the tests in the fixture for this specific class… it’s only used in one place, and there are no tests for that (don’t look at me, it’s legacy stuff that i didn’t write
)
June 3rd, 2009 at 2:35 pm
Are you sure nothing else runs? Cause it looks like a threading issue.
Should you not use a readerWriter lock when getting or removing items from the directionary?
June 3rd, 2009 at 2:58 pm
it’s a single threaded test-run. Synchronizing access to the dictionary would be a good idea either way, but i don’t think it would make a difference in this case.
There are only 7 tests that use this class, all in the same fixture, and they all create a TestCacheableObject instance with a time to live of 60 seconds, with the exception of one test which creates one with a time to live of 500 milliseconds, and it then waits for 1000 milliseconds to see if the object removed itself from the cache… each test uses the same key value so depending on the order of events and duration of everything that might cause removal of an object with the same key as used in the other test, though the odds of those 7 tests exceeding 60 seconds are extremely low… they typically run in a couple of milliseconds.
June 3rd, 2009 at 3:59 pm
Hi !
In theory….
When you remove an object from the cache, you should also clear the provider and the timer of that object. The object is still alive (until the next GC) and the time.elapsed could cause the object to remove an other object from the cache b/c they share the same key.
Tapio Kulmala
June 3rd, 2009 at 4:24 pm
funny, i recently ran into the same problem, using IDictionary. The ContainsKey method would return false sometimes, even though i had just added the object to it. And whats funnier is that it only happens once in a while, for no apparent reason.
June 4th, 2009 at 1:18 am
Are you running these tests inside VS and do you find that the test fails on a second pass or only when run with other tests that use the CacheProvider. Is the dictionary key you are using unique to this test or is it used elsewhere?
1) I have a feeling that test runners in VS re-use app domains/processes which may lead to your singleton-based cache being ‘dirty’. Perhaps a Flush() method that empties the dictionary and stops all the timers of cached objects?
2) Try replacing “test” with new object() to avoid object.GetHashCode() collisions.
June 4th, 2009 at 3:54 am
I think a mix of most of the other comments gives a picture of the whole problem. Here’s my lame attempt at RaymondC-style “psychic debugging”.
There’s a race condition. Yes, it’s multithreaded code. There’s a Timer. Yes, it’s unlikely that it will happen. If that’s what’s causing the problem you probably wouldn’t see it often. Maybe once in a blue moon something unexpected causes the tests to take longer to run*. If it’s not the cause of the problem, the race in the test is still a bug IMO. But I’m an SDET – people pay me to say things like that.
The Elapsed event changes the state of that key in the CacheProvider singleton. Presumably one that other tests are using. It could be that one test’s timer is messing up another test. Possibly, though, it’s something else slowing the main thread while the timer ticks away.
Clearing the timer or somesuch is probably one solution. Not using the singleton might be a good idea, too.
* True story: There was a race in one of my tests once that I couldn’t reproduce, but another tester could repro some of the time. It turned out that sometimes (I don’t remember why) that tester would mouse-click on the cmd console running the test so the main thread of the test wasn’t running but the worker it spawned was. Hit any key in the console and the test continued, determined that its activity took too long, and the test case failed.
June 4th, 2009 at 7:47 am
@Drew
that definitely sounds like the most plausible explanation but when one of these tests fails, it takes about 10ms to do so, while the other tests all complete almost instantly. 6 of the 7 tests use a time to live of 60 seconds, while the other test uses a time to live of 500ms. But that test waits for 1000ms and then verifies that the object removed itself from the cache.
If the test with the short time to live would cause the problem, it should fail as well as any other test whose state got corrupted by it. But it’s always only one failing test.
@Andy
no, this is with NUnit on a buildserver.
June 4th, 2009 at 8:17 am
Do the Cacheable object instances created in the tests share the same key?
June 4th, 2009 at 8:38 am
as mentioned in one of the earlier comments: yes
June 4th, 2009 at 12:28 pm
I came back to see if you had posted a definitive explanation.
Is the following summary correct?
An object ‘A’ instance with the key “test” is inserted in the cache and its timer starts to tick. Another object ‘B’ with the same key “test” replaces ‘A’ in the dictionary. ‘A’ is now dereferenced and a candidate for GC. However, in some test runs the timer for ‘A’ expires and calls Remove on the cache provider for key “test”, for which the value is object ‘B’.
Looks like a bug in the cache to me, not just a test-time issue.
June 4th, 2009 at 12:35 pm
@Andy
your summary is correct, but the timer doesn’t expire for 60 seconds, and this particular TestFixture never takes that long to run. The CacheProvider is also not used anywhere else.
I think it would’ve been better if the CacheProvider was less permissive when it comes to adding items with keys that already exist. But i’m not convinced that the actual cache implementation (or that of the CacheableObject) has a bug…
June 4th, 2009 at 4:02 pm
Hi again.
What testrunner do you use?
If you run the tests directly from visual studio, there will be a different singleton instance for each run. But if you use for example nunit.exe as your testrunner every testrun will share the same instance of your singleton instance. Therefore the timer_elapsed events from an earlier testrun will interfere with your current testrun if the time between your testruns is about 60 secs.
Tapio
June 4th, 2009 at 4:04 pm
it’s Team City’s nunit testrunner… but as i already said, the time between these tests is never 60 seconds
June 4th, 2009 at 4:13 pm
Is there anything else running in your TeamCity that could use same CacheProvider singleton and coincidentally use the same “test” key?
Tapio
June 4th, 2009 at 4:14 pm
nope
June 4th, 2009 at 4:57 pm
I’ running out of ideas
Here’s one (hopefully) last theory.
Because the remove does not stop the timer, the timer.Elapsed event will always fire once for each cached object. You probably have tests that cache more than 1 object or tests that verify that the put replaces an existing object in the cache. This will result in the situation where multiple timer.elapsed events will fire very close to each other. Each event is handled by a separate thread. Because the CacheProvider and the Dictionary classes are not thread-safe those events might corrupt the internal state of cache/dictionary.
Tapio
June 4th, 2009 at 5:00 pm
… and the theory is wrong because those event will fire 60 secs after the tests…
Tapio
June 4th, 2009 at 5:03 pm
yeah :p
but there is indeed one test that stores 2 objects immediately after each other with the same key…
June 5th, 2009 at 4:22 am
One question I have is this. If you run the test locally and step through the code does it pass? If it passes locally but not on the build server there is a good chance you indeed are running into a race condition. If you wanted to see for yourself you could always add some logging to spit out the contents of the keys collection to the console so when your unit test runs you can see what is in there.
However, since you have a timer in your objects it would be best to add proper locking anyway.
Another very easy way to maybe check this is to add simple log statements in the CacheProvider that spits out the key name it’s looking for, the key name you added, etc. Then in the Elapsed_time event add a log statement that tells you that it’s removing said key. Match up your console output to what you expect, see if there is any differences.
One problem you could run into in a production system is that if the timer goes to Add said value the key might exist and cause a duplicate key exception or some such because your contains runs in a seperate call so by the time the contains returns true or false it could be invalid.
ie in your Put on CacheProvider it should look like this:
public void Put(CacheableObject value)
{
lock (monitor)
{
if (ContainsKey(value.Key))
{
Remove(value.Key);
}
value.Provider = this;
_cacheDictionary.Add(value.Key, value);
value.StartTicking();
}
}
This will allow you to create a single Atomic transaction. Also, on your remove you might want to add something similar that checks the contains key. As well on your Get command. You could have a problem that the Contains works fine and the get doesn’t.
To verify just add some logging like so to find out:
public CacheableObject Get(object key)
{
Console.WriteLine(“Getting data for key: ” + key.ToString());
if (_cacheDictionary.ContainsKey(key))
{
Console.WriteLine(“Found Key: ” + key.ToString());
return _cacheDictionary[key];
}
return null;
}
Hope that might help some.
June 5th, 2009 at 10:16 am
An interesting topic, very funny.
Maybe we can investigate the IL
June 5th, 2009 at 10:16 pm
This eventhandler has a bug, I think.
private void Timer_Elapsed(object sender, ElapsedEventArgs e)
{
_timer.Stop();
Provider.Remove(this.Key);
}
and it will “fail” in the following scenario:
put object A in cache with key “x” and timeout 10 seconds
wait 9 seconds,
add object B with key “x” and timout 10 seconds.
wait one second,
B gets removed from the cache by the timer event handler of the replaced object A because the timer didn’t get cancelled when A was evicted from the cache.
But I would expect B to be removed after 10 seconds have passed after insertion of B.
This also means that if any other test in a testrun also uses the key “test” while using the singleton cache , but with a much shorter timeout that falls within your testrun duration, you will see some very intermittant failures.
A fix would be to let the timeout handler check if the key is still associated with the same value for which the timeout was set, and to do nothing when the value was updated, or to cancel the timer when a CacheableObject is replaced or removed from the cache.
June 12th, 2009 at 6:02 pm
I have a theory. You’re not locking your reads and writes to the cache. And your puts are destructive without providing notice that a previous value is being overwritten. If your unit test runner runs the tests concurrently, you have a race in two places. If two calls to your put and get are interleaved the wrong way (a.Put b.Put a.Get) you’ll see that error.
Another thing is that your Cache isn’t checking for semantically equal objects.