Tentatively subtitled: “How scale can make fools of us all”

This is going to be a real life war story…cause I haven’t done one of those in a while, and this particular case really ticked me off.  Here’s the scoop:  I’ve got a “service” which is called by other parts of the system.  And by “service” I don’t mean something running in its own process and waiting for SOAP/REST requests or messages, I simply mean something that has a defined entry point (a static method in this case), where you pass in some data, and get something back.

Like many others, I’m sure, I’m using an IoC container to wire up bits so that I can have a big ball of interfaces “to make testing easier” (one of these days I’ll break that rather nasty habit and figure out a better way to do thing, but I’m getting off topic).  Specifically, I’m using Windsor for my dependency injection because it seems to have become the Container de jure among the devs that actually are using containers at work (StructureMap was in there for a while too, but it seems to have faded).  As many of you may know, Windsor is one of those containers that tracks instances for you so that it can use a Lifecycle rule to decide whether to give you an already existing instance of an object, or create a new one for you. It will also automatically call Dispose() on IDisposable objects that it may be tracking, thus helping ensure proper cleanup of resources.

In my case I had everything set up using the Transient lifestyle, because each request was essentially stateless, and there really wasn’t a lot of expense involved in creating a new instance of the objects.  Because I’ve done my homework, I know that if you’re using Transient objects in Windsor, you should explicitly call Release on the container to release the object when you’re done with it, otherwise you’re likely to get a memory leak, since the container would be holding on to an instance of the object, not letting the GC do its thing.  So, I made sure I did that, and my code looked something like this:

1
2
3
4
5
6
7
8
9
var myService = _container.GetService<IMyService>();
try
{
    myService.DoWork();
}
finally
{
    _container.Release(myService);
}

The one thing to point out here, is that my reference to _container was a singleton, so I would get it set up the first time and then use the pre-configured container after that. So, where is the problem? Anyone? Well, I didn’t see anything wrong with it. And neither did the person doing the code review.  But, as you might guess from the fact that I’m writing about this, there was a problem, and here’s how it manifested itself:

Approximately 6 days after this went to production, one particular set of servers in one of our data centers (lets say for the sake of this post we have 2) started kicking out OutOfMemoryExceptions during calls to the service.  My first thought was, “strange, but I’m doing the right thing here and releasing, so its probably just something else eating up memory and my code is suffering”.  To help demonstrate this I even set up a test running 1000 calls to the service in a while loop and watching the memory…nothing unusual, hovered around 33MB.  So I fired up the most excellent dotTrace memory profiler, and it confirmed.

4 more days go by and our operations folks come and beat the crap out of me because they have had to reboot production servers every couple of hours.  Ok, they didn’t beat the crap out of me, but they wanted to, and they did send along a dump, which one of the other devs who is a wiz with windbg was able to translate into something meaningful for me.  The dump showed thread contention in ReaderWriterLockSlim.WaitOnEvent(), and about 200MB worth of an object called Castle.Microkernel.Burden.  And here are some other interesting details:  The service is called by all kinds of different servers; Web servers, SOAP servers, REST servers, but none of these were showing problems.  The only one that was having issues was a server that was set up to process asynchronous SOAP requests (don’t ask).  And each server could process up to 20 at a time.

Armed with this information I did some googling, and discovered that the Burden object is the thing you leak when you don’t call Release() on the container in Windsor….But I was calling release!  I found a blog post by Davy Brion that talked about getting leaks when using your own Windsor container with NServiceBus, and how to deal with it….seemed interesting, but it also seemed like something that didn’t apply, since the problem there was that NServiceBus didn’t know about calling Release() since it was written with a container that didn’t keep references.  It did lead me to the source code for the release policy, which showed me something very interesting.

The Windsor object tracking is basically doing some reference counting.  The ReaderWriterLockSlim is being used to manage the count of instance references, so when you create a new instance it is incremented, and when you release an instance it is decremented.  In either case you’re doing a write, so you’re calling a ForWriting() method on a lock wrapper, which is effectively trying to do a write lock (at some point down the call stack)….very interesting.  At this point I decided to see if I could reproduce the problem, and so I took my earlier test running 1000 calls in a loop, and kicked it up a few notches on the concurrency scale, and set it up to run calls in a while loop until the thread was canceled. I fired up 25 threads to do this, launched the little console app and waited.  Sure enough I was able to see in process monitor that memory was rising….there were some spots where a large collection was taking place, but it wouldn’t release everything, and so soon my little app which started at around 40 MB was using 50 MB, then 60 MB.  It was the concurrency!  The multiple requests were stacking up new instances of object, and new instances of the Burden object faster than they could be collected because the whole thing was bottle-necked by the ReaderWriterLockSlim!

So I plugged in a version of Davy’s code to fix the NServiceBus issue, only I decided since I was managing this container local to my service, and I was also dealing with any Disposables myself, that I would not let it track anything (there is actually a built-in policy object for not tracking anything…just realized that).  Plugged it in, fired up the test, and I had a little console app that ran for about an hour and hovered at about 40MB of memory in use.

We actually did an emergency deployment to push this to the effected set of servers in production, and I’m happy to say that so far I’ve not seen an issue….of course our logs stopped showing the OutOfMemory exceptions about 24 hours before we pushed the fix, so we have that to help out our feeling of doubt that the issue is resolved.  And even though I could create something suspicious locally, we were never able to recreate the production issue in QA.  One of the interesting things about our environment is that we have a lot of customers who do things that we don’t exactly expect, or want, them to do.  It looks like in this case we had some customers who were doing a lot of asynchronous calls and they just managed to stack up in a way where things got ugly.

I’m currently finding myself in the midst of an evolutionary change.  And I’m not talking about my super-human mutant powers, I’m talking about the way I’m thinking about solving a specific set of problems.

Let’s start with a sample….Lets take something like processing credit-cards as a benign and IP free place to start.  As a subject that I really have no practical experience with, it seems like an appropriate choice.  I’m going to assume that there are different rules for doing checksum validation on credit card numbers, depending on what the card is (Mastercard/Visa/Discover/etc). Now, here is evolutionary step 1: Use a basic case statement to process the various cards.  Here is what something like that would look like:

1
2
3
4
5
6
7
8
9
10
11
12
switch(CardType)
{
    case CardType.MasterCard:
        CardValidators.MasterCardValidator(card.CardNumber);
        break;
    case CardType.Visa:
        CardValidators.VisaValidator(card.CardNumber);
        break;
    case CardType.Discover:
        CardValidators.DiscoverValidator(card.CardNumber);
        break;
}

 

This looks pretty straight-forward, and as it stands it isn’t too bad from a maintainability stand point.  But what happens when there are many different types of cards?  An then what happens when you find a large amount of duplication between the validation functions?

Well, any student of GoF should be able to tell you that a Chain Of Responsibility pattern looks like a perfect fit for this sort of scenario.  So, evolutionary step 2: Create separate classes to handle the different types of validation, and configure them in a Chain of Responsibility pattern where each instance decides for itself whether it can process the input.

Here is a quick and dirty look at what something like that would look like:
ClassDiagram1

 

The two most interesting things here are the GetValidator() method in the AbstractCardValidator, and the individual CanValidate() methods in the concrete implementations.  What this does is it allows each class to decide for itself how it is going to determine whether or not it can be used as a validator for a specific card (thats the CanValidate() part), and also provides a single point which the consumer of the API can use to get the validator for the card instance they have.  You would probably want to build and Abstract Factory around this, which would instantiate all of the ICardValidator classes, and then run the GetValidator() method to get the correct one.

Now we are at a point where things are looking pretty good; we’ve got the ability to do some fairly complex logic to make the decision about which validator to use, and we have a way to simply ask for one, and the correct one appears.  Pretty cool.

This is actually the place where I have found myself in the not to distant past.  I have previously been perfectly content with this arrangement, and been fairly happy with the separation of concerns among the classes…I mean, after all, who better to decide whether or not a specific class should be used to validate a card than the class itself.  So what is the issue?  Well, recently I have become aware of two problems with this arrangement: Tight Coupling, and a violation of the Single Responsibility Principle.  Let’s start with the first:

Tight Coupling
The credit card example may be a bit contrived when it comes to this issue, but bear with me.  Overall, the issue is that specific instances of ICardValidtor objects are being created and handed around.  The use of an interface and an Abstract Factory pattern would actually help the situation out some, but effectively all it does is move the coupling from the consuming class to the Factory (okay, it also consolidates coupling to a single class, which makes maintenance a lot easier).  As I said, contained, but still there.  It would be nice if the factory didn’t need any knowledge of what concrete instances of ICardValidator were out there.  Before we tackle that, though, lets also look at the second issue:

Violation of the “Single Responsibility Principle”
The SRP states that a class should have one, and only one, thing it is responsible for.  Sounds pretty easy doesn’t it?  The problem is that this can be difficult to obtain without a fair amount of discipline.  The violation of SRP which I’m seeing is that the ICardValidator is responsible for both validating a credit card and determining which validator is appropriate.  But wait!  Didn’t I just say that moving this check into the ICardValidator instance was a “Good Thing”?  Well, lets go as far as saying it is better than the previous method, but still not perfect.  Applying the SRP would move the task of selecting a validator from the ICardValidator instance, and put it on it’s own somewhere.  So, thusly we come to our:

Inversion Of Control Container.
That’s right, we are now going to get crazy and move the responsibility of creating these instances to another component all together.  The nice thing about this is that it allows us to move all of the knowledge about dependencies off somewhere else.  How does this apply to this example?  Well, lets assume we have an object of type Card which requires as a dependency an instance of an ICardValidator.  We’ll also assume that Card is subclassed based on the type of credit card.  It now becomes trivial to configure our IoC container to supply a specific implementation (read sub-type) of ICardValidator for each implementation (again, read sub-type) of Card.  Now, when you want a Card instance, you ask the IoC container for one, and depending on what type of card it is, you will get the appropriate ICardValidator as well.

What’s the catch?  Well there is some additional complexity which will show up somewhere in the application due to the IoC, but typically IoC configuration can be delegated down to the configuration file level, so even then the ugliness is pushed away to it’s own dark corner.

But wait!  Why should we have different instances of Card?  What if the Card class is just a container for the card data?  Well, our Ioc still gives us some advantages.  If we look back at our first example with the switch statement, we’ve got a nice CardType enum, which could be a property of our Card class.  Using an IoC container like the one provided by the Castle project, you have the ability to configure a key string for your instances.  This would make it trivial to map the enum choices to specific keys within the container, which the Card class would use to get an ICardValidator instance.  This would also make it possible to make the validators slightly more advanced by adding something like a Decorator pattern, in which specific aspects of the validation could be factored into separate classes, and then “stacked” to produce the final validation logic (This is the same concept used by the Stream classes in .Net and Java.  You can modify the behavior of a stream by passing it to the constructor of a stream with different behavior).

It is definatly worth mentioning that there is a sudden appearance of tight coupling to the IoC container itself from our consuming classes.  You probably want to try to abstract away the fact that the IoC container exists from the majority of the application.  Factory classes go a fair ways in making this happen, but another good idea is to introduce a single service to do type resolution.  The Factory classes can then ask this service for the object they want, and they never need to know the IoC container is there.  This approach also gives you the ability to create some objects using IoC and others in another (more traditional) way.

So is this it?  Have I finally found the answer I’ve been looking for?  It’s hard to say right now.  For the time being this is a decent way to handle things, provided the complexity of the underlying system, and the need for loose-coupling are both high enough to justify the additional complexity of the IoC.  But who knows, in another couple months I may find something new, or even something old, which seems better, cleaner, simpler.  That, after all, is my final goal….And I need to remind myself of that regularly, lest I become complacent.