Archive for May, 2009

What Microsoft Should Do For .NET Open Source

7 commentsWritten on May 25th, 2009 by
Categories: Opinions

I recently read Rob Conery's "What Should Microsoft Do For .NET Open Source" post where he invites people to answer the question he poses in his post. Obviously, i could not resist posting my thoughts and views on this.

Before i start, i would like to state that i appreciate all of Microsoft's recent efforts to move to a more open development model. I think they've done a pretty good job already but nothing's perfect and there's always room for improvement. So let's just get to the list of things that i would love to see from Microsoft with regards to dealing with Open Source Software in the .NET world.

First of all, i'd like to see a completely transparent development model. I love the fact that ASP.NET MVC is released under an Open Source License, but it would be better if everyone could track the development. With that i mean access to the internal bug/issue tracker and source code repository. I want to know what problems are known, and how they are being dealt with. I want to see the progress in the code. I want the ability to either run off the trunk, or to merge specific pieces of the trunk into a released version if that enables me to avoid problems that i might run into with an officially released version. Getting the source code to a released version is (IMO) only a small part of what Open Source development is all about.

I also want them to be open to outside contributions. I truly hate running into annoying little bugs that won't be fixed until the next major version (which sometimes takes a year or 2 to come out) but which could be avoided with a simple patch. Allow people to contribute patches and you'd be amazed at the amount of goodwill and support you'll create among a vocal group of users. Obviously, this only works when the development model in general is as open as i mentioned earlier. Nobody is going to submit patches based on a released version if the actual trunk might already contain fixes or has been modified heavily after the latest release. And being able to patch a released version for your own use isn't always a viable option to most people since you really don't want to deal with having to merge the patch when newer versions are released (and all the regression testing that comes with it).

Microsoft should also play nice with other Open Source projects. Sure, they included JQuery with ASP.NET MVC which is a good move. But there are a lot of examples where they developed some library or framework for which a generally accepted Open Source alternative was already present. I still cringe whenever i'm forced to use MSTest (which to this day is still a pretty crappy testing framework IMHO) while they could've just as easily supported NUnit's development. And if you're not happy with NUnit, consider the state of MSTest before you complain about my choice of NUnit over say... MBUnit or any of the other testing frameworks. Does anyone remember NDoc? Was Sandcastle really necessary or would it have made more sense to just invest some effort into NDoc instead of starting from scratch? My biggest issue with this is that Microsoft's alternatives can often be used without resistance in certain workplaces, just because it's from Microsoft. It doesn't even matter if there are actually better Open Source alternatives available... In Microsoft we trust, thus Microsoft you'll use. Not always the smartest choice, as i'm sure many of you will agree with.

Would it really be so bad for Microsoft if they would just assign some developers to help out with projects that already exist? We now generally have to wait until the third version of any Microsoft product before we really get great quality, while in some cases, a high-quality Open Source alternative was already available from the start. This often leads to about 3 to 4 years of crappy releases that many people will adopt solely because it's from Microsoft and because "it'll improve in the next versions". Even if you don't want to assign developers to Open Source projects, it might already be very useful to provide helpful things such as servers, bandwidth, anything that could help out. Hell, references in official Microsoft documentation on MSDN would already be a huge step up.

There also needs to be some kind of PR cleanup campaign for all of the FUD that Microsoft has spread about Open Source software in general in the past few years. While it can be argued that that FUD was mainly targeted at GPL software, to many managers in big companies the message came across as "Do not use Open Source because it's a legal minefield". That is a misconception that many .NET developers are still fighting on a daily basis which isn't good for anyone.

Then there is the issue of support. I'm not going to repeat myself so i'll just link to a previous post of mine which covers that topic: Support Of Commercial Software vs Open Source Software.

And as for the host of legal aspects that Microsoft is probably afraid of, i would suggest looking into the reasons why other (large) companies are able to be cooperative citizens within the Open Source world without legal troubles. Hell, commercial software can be a legal minefield as well so is the Open Source world truly worse? Maybe they should ask companies like IBM... surely they have plenty of experience in both cases.

Do i expect or even want Microsoft to open source everything in the .NET world? No. But i would definitely appreciate it if Microsoft would play nice with other Open Source projects, be a more active participant and be open to outside participation as well. I'm sure i won't be the only one who thinks so.

Comment Spam

2 commentsWritten on May 24th, 2009 by
Categories: About The Blog

I used to go over the comments that were marked as spam on a regular basis to check for false positives. This was pretty doable in the past since i got between 30-50 spam comments on a daily basis and you could usually scroll through them pretty quickly to identify real comments. Lately though, i've been getting several hundred of spam comments per day and i really don't have the time to go through all of that. Unlike some other bloggers, i refuse to disable commenting in general due to spam so some legit comments might accidentally be flagged as spam and won't show up. So, if you leave a comment and it's not showing up immediately, just send me an email (preferably from the same address you filled in for the comment) and i'll try to rescue it from the spam filter. I do delete spammed comments pretty regularly so don't wait around if your comment isn't displayed immediately :)

Continuous Integration On A Real, Big Project

7 commentsWritten on May 24th, 2009 by
Categories: Continuous Integration

Some of you may remember a post of mine where i showed the complete lack of attention that was being paid to the Continuous Integration of one of our projects. This particular project is pretty big. The development of this project has gone on for years, and will keep going for years to come as it is a strategically important project for us. The actual system is used internally by our company as well as a growing list of customers.

There are essentially 2 big problems with this project. One is a mountain of legacy code with so much technical debt (due to the evils of code generation) it sorta resembles the current economic recession (as in: it'll take a long time to get everything sorted out). The other problem is more a matter of organization. We're a pretty small company and while we have great ideas for products, we simply can't assign a steady, stable team of developers to work on any of these projects on a continuous basis since we all often need to work on stuff that is simply more lucrative at that particular point in time. The result is that this particular project typically has an ever rotating group of developers working on it, usually for short periods of time. Some people work almost full-time on it, but that list is pretty limited.

Not exactly the ideal situation for a large project to apply CI and other agile development practices to, right? Luckily for us, we're quite stubborn and we try to make the best out of every situation. So back when this project's CI success rate was only showing a lack of success, we all agreed to follow the CI rules and at this point, i'm pretty proud to be able to show you guys the following picture:

image001

Note: we moved to Team City in July 2008 so i can't show you any earlier data than that.

Anyways, as you can see, after the dismal months of February and March, the success rate of the CI build gradually started improving again. The average time to fix failing tests also decreased sharply. We still have failing tests from time to time, but at least now they're all dealt with in a timely fashion. We still have build failures, though those have decreased a lot as well and are always dealt with pretty soon now. I'm not sure if this is because of my 'Continuous Bitching' whenever the build fails, or that everyone bought into the concept of CI again (for my own sake, i'll just assume that it's the latter instead of the former) but i'm pretty happy with the results that we're getting.

Also, take a look at the number of tests (we're at 16000+ now) and the duration of the build. The build time is about 48 minutes on average now, which is obviously way above the recommended 10 minutes. I'd love to see this go down to about 20 minutes (which i'd find very acceptable considering the size of the project) but that's gonna take a long while. Of those 16000 tests, there are about 13000 tests that cover the legacy code and they all use the database. And since those 13000 tests use a generated data layer, we can't just let it run on an in-memory database nor can we mock the database in those tests because all of that generated code, and pretty much everything that was written on top of it, is coupled more tightly than Siamese Twins. We also lose a couple of minutes of build time due to our Genesis processing but that is simply something that we can't go without anymore so we don't really mind the extra build time of that part.

So there you have it... the reason i wanted to post this is because when the topic of CI comes up, you always read about 'instant feedback' and really quick builds and things like that. It's simply not always like that in the real world. But with a bit of effort and focus, you can get many of the benefits that are usually attributed to CI, even on huge projects with lots of legacy code.

Keep An Eye On Those Indexes

1 Comment »Written on May 21st, 2009 by
Categories: Performance

We have a multi-tenant application, where each tenant has its own database. We recently were informed about a particular performance problem that one tenant (which we'll refer to as Tenant A) was experiencing in every screen where data of a certain type needed to be shown. None of the other tenants experienced this problem though.

We tracked down the query that was causing the bad performance and ran it on the database of Tenant B. Tenant B actually had a lot more data in the main table that was used in the query and the query executed immediately whereas it took about 25 seconds to complete for Tenant A. So the query runs fast on another database that actually has more data... at this point i was convinced that it had to be related to indexes.

Turns out that someone recently ran an import process to import a bunch of data in Tenant A's database. I know very little about databases, but one thing i've seen time and time again (with both Oracle and SQL Server) is that you really need to make sure that your indexes are in good shape after any process that performs a lot of inserts (or removals). A couple of years ago, i had a very intensive nightly import process for a particular project that used an Oracle database. As time went on, the application's queries became painfully (unacceptably even) slow. I managed to restore the performance of those queries by simply instructing Oracle to recalculate all of the statistics of the indexes of tables that were affected heavily during the nightly import.

With that in mind, we simply rebuilt the indexes for Tenant A's database, and the same query that took 25 seconds completed almost instantly from then on. Now, we did had a weekly job running on that database server to keep the indexes in a healthy shape but that job didn't really do a good umm... job of it, apparently.

Lessons learned: make sure that you:

  • Have a proper maintenance job set up which keeps your indexes healthy and schedule it to run regularly
  • Run that job manually if you need to perform a manual import process
  • Execute that job in an automated fashion whenever an intensive automated import process has completed

Oh, and consult with your DBA's or at least people who know what they're doing when it comes to your particular database on how to keep those indexes healthy. In this case, we rebuilt them. In other cases it's sufficient to recalculate the statistics... i'm not sure which way is the best but you should at least keep an eye on this possible problem :)

Using The Guid.Comb Identifier Strategy

11 commentsWritten on May 21st, 2009 by
Categories: NHibernate, Performance

As you may have read by now, it's a good idea to avoid identity-style identifier strategies with ORM's. One of the better alternatives that i kinda like is the guid.comb strategy. Using regular guids as a primary key value leads to fragmented indexes (due to the randomness of the guid's value) which leads to bad performance. This is a problem that the guid.comb strategy can solve quite easily for you.

If you want to learn how the guid.comb strategy really works, be sure to check out Jimmy Nilsson's article on it. Basically, this strategy generates sequential guids which solves the fragmented index issue. You can generate these sequential guids in your database, but the downside of that is that your ORM would still need to insert each record seperately and fetch the generated primary key value each time. NHibernate includes the guid.comb strategy which will generate the sequential guids before actually inserting the records in your database.

This obviously has some great benefits:

  • you don't have to hit the database immediately whenever a record needs to be inserted
  • you don't need to retrieve a generated primary key value when a record was inserted
  • you can batch your insert statements

Let's see how we can use this with NHibernate. First of all, you need to map the identifier of your entity like this:

    <id name="Id" column="Id" type="guid" >
      <generator class="guid.comb" />
    </id>

And that's actually all you have to do. You don't have to assign the primary key values or anything like that. You don't need to worry about them at all.

Take a look at the following test:

        [Test]
        public void InsertsAreOnlyExecutedAtTransactionCommit()
        {
            var insertCountBefore = sessionFactory.Statistics.EntityInsertCount;
 
            using (var session = sessionFactory.OpenSession())
            using (var transaction = session.BeginTransaction())
            {
                for (int i = 0; i < 50; i++)
                {
                    var category = new ProductCategory(string.Format("category {0}", i + 1));
                    // at this point, the entity doesn't have an ID value yet
                    Assert.AreEqual(Guid.Empty, category.Id);
                    session.Save(category);
                    // now the entity has an ID value, but we still haven't hit the database yet
                    Assert.AreNotEqual(Guid.Empty, category.Id);
                }
 
                // just verifying that we haven't hit the database yet to insert the new categories
                Assert.AreEqual(insertCountBefore, sessionFactory.Statistics.EntityInsertCount);
                transaction.Commit();
                // only now have the recors been inserted
                Assert.AreEqual(insertCountBefore + 50, sessionFactory.Statistics.EntityInsertCount);
            }
        }

Interesting, no? The entities have an ID value after they have been 'saved' by NHibernate. But they haven't actually been saved to the database yet though. NHibernate always tries to wait as long as possible to hit the database, and in this case it only needs to hit the database when the transaction is committed. If you've enabled batching of DML statements, you could severly reduce the number of times you need to hit the database in this scenario.

And in case you're wondering, the generated guids look like this:

81cdb935-d371-4285-9dcb-9bdb0122f25f
a44baf99-58e9-4ad7-9a59-9bdb0122f25f
a88300c2-6d64-4ae3-a55b-9bdb0122f25f
032c7884-da2f-4568-b505-9bdb0122f25f
....
70d7713c-b38d-4341-953d-9bdb0122f25f

Notice the last part of the guids... this is what prevents the index fragmentation.

Obviously, this particular test is not a realistic scenario but i'm sure you understand how much of an improvement this identifier strategy could provide throughout an entire application. The only downside (IMO) is that guid's aren't really human readable so if that is important to you, you should probably look into other identifier strategies. The HiLo strategy would be particularly interesting in that case, but we'll cover that in a later post ;)