Why On Earth Would A Developer Do This?
Posted by Davy Brion on November 22nd, 2008
I just saw the following piece of code:
public void Execute()
{
ArrayList empIds = PayrollDatabase.GetAllEmployeeIds();
foreach (int empId in empIds)
{
Employee employee = PayrollDatabase.GetEmployee(empId);
if (employee.IsPayDate(payDate))
{
DateTime startDate = employee.GetPayPeriodStartDate(payDate);
Paycheck pc = new Paycheck(startDate, payDate);
paychecks[empId] = pc;
employee.Payday(pc);
}
}
}
The code is old, pre .NET 2.0. But no, the usage of the ArrayList isn’t what bothers me here. It first gets all of the Id values for all of the employees in the database. Then it loops through the list of Id values, and fetches each employee from the database. Whenever i see this kind of code, it makes me want to kick the person who wrote it.
In case you don’t know what is so bad about this, think about this: if you have 5 employees you first send one select statement to the database to get their id values. Then you send another 5 select statements to the database to get the data of each employee. Oh well, 6 queries isn’t that big of a deal right? It sure as hell is if you could’ve just as easily gotten all of the required data with one query (and only one expensive roundtrip)! Imagine you have to calculate the paychecks of 100 employees instead of 5. How about 1000 employees?
What i really don’t understand about this is that this kind of code gets written every day. Do these people simply don’t care or do they genuinely not know how bad this is? If they don’t know, that’s truly sad. If they don’t care, that’s even worse because a developer who knows how bad this is but still writes it, obviously takes no pride at all in his/her work and (IMO) has little respect for the code, the team, the company and the client.
In case you’re wondering where i saw this code… it’s in Robert C. Martin’s Agile Principles, Patterns, and Practices in C# book. Yes, the Robert C. Martin, aka Uncle Bob. I probably shouldn’t criticize such a well known figure in the OO world, but seriously Bob, what were you smoking? The purpose of the whole book is to teach people how to write good code and there is tremendous amount of valuable information in there. But putting code like this in your examples is really inexcusable.
November 22nd, 2008 at 8:35 pm
I’m curious about how you wpould implement this same story if you knew it had to scale to 100000 employees a
November 22nd, 2008 at 8:37 pm
I’m curious about how you would implement this story if you knew that it had to scale to 100K employees and each employee entity was, well, sort of large.
Greg
November 23rd, 2008 at 1:28 am
if i knew in advance it had to scale to that many employees? i’d pull in the customers, about 500 at a time and process them. Then i’d profile that, change the batch size a couple of times to see what the best size would turn out to be.
November 23rd, 2008 at 5:05 am
[...] Why on Earth Would a Developer Do This? (Davy Brion) [...]
November 23rd, 2008 at 8:41 am
seems to me you don’t have to get all the employees either. I’m quite sure that the condition “employee.IsPayDate(payDate)” can somehow be defined in a where clause in the SQL statement
November 23rd, 2008 at 11:10 am
@Peter
depends on the complexity of the IsPayDate method… if it’s simple enough to put in the where clause then i’d have no problems with it. if it would involve calling a stored function…. that’s a different story
November 24th, 2008 at 5:11 pm
What I have seen is that there are many of programmers who think in terms of “works” or “doesn’t work”. If it “works” then the code is done, so time to move on. They have no idea what refactoring is and never question whether or not their own code would make sense to someone else, let alone whether or not the code would work beyond a simple use case. I used to get this kind of stuff when I worked with an offshore team. These guys would look at your PayrollDatabase class, see there was no way to get all of the employees at once, and just use what was there, instead of just adding the method that was needed.
November 24th, 2008 at 9:52 pm
This is a fundamental problem of ORM. Too much abstraction hides the amount of processing going on here. In addition to the issue of getting each employee one-by-one by id - also I would say that calling a PayDay method on each one - creating a new Paycheck object - is also bad. Heck, the method is not even returning the modified employee objects.
This could have been wrapped up in 1 SQL statement with little scalability issues.
November 24th, 2008 at 9:58 pm
@Dan Howard
as i mentioned earlier, i really think that depends on the ‘complexity’ of the logic in the Paycheck class. I prefer to not have any kind of logic that is beyond ‘trivial’ in queries. But that kinda depends on whether you’re a ‘database’ guy or an ‘OO’ guy… but i really don’t agree that this a fundamental problem of ORM. There is nothing ORM-specific in the code above.
and instantiating a Paycheck object to perform the calculation is a cost that can truly be ignored. Even if you have to instantiate a few thousand of them, any decent language should be able to do that without any noticable overhead. It’s not even a fraction of all the database roundtrips involved in the code.
November 24th, 2008 at 10:40 pm
@Davy,
You have pointed out a common problem in many development environments, but you too have failed (IMO) to come up with a decent solution to the problem. Don’t ignore what Dan Howard mentions. This IS a common issue in many OR Mappers. The reason is that many ORMs are designed such that an entity represents only a single table in the database. As such, it becomes necessary to think of each table in terms of separate queries, which explains the unnecessary overhead in your above post.
“if i knew in advance it had to scale to that many employees? i’d pull in the customers, about 500 at a time and process them. Then i’d profile that, change the batch size a couple of times to see what the best size would turn out to be.”
This is not a realistic solution. In fact, I’d generally frown upon this more so than than the original solution because it’s far more complex. Instead, my solution would involve performing the query on the database –where it belongs– through use of a View. I would then create an entity off of my View, with related CRUD procs/methods.
November 24th, 2008 at 11:22 pm
there’s only one entity type in the above code… there’s no reason at all to use separate queries. the unnecessary overhead in this scenario is because the developer is too lazy to write a GetAllEmployees method which would fetch the required data in one go.
and if you would need to pull them in in batches because there are too many records, you could do that _very_ easily, especially with a decent ORM like NHibernate
I fail to see how using a View would make this situation any better, unless you misread the code and thought that the data was from two different tables or something. Even in that case, a decent ORM would’ve made it trivially easy to join the tables and get only the results you’d need.
November 25th, 2008 at 12:06 am
[...] Why On Earth Would A Developer Do This? [...]
November 25th, 2008 at 4:48 am
Indeed…I did read the code wrong. And I am officially a moron!
November 25th, 2008 at 3:34 pm
@Davy,
I am curious to see the actual code you would write for this.
November 25th, 2008 at 3:47 pm
for what… the paging?
November 25th, 2008 at 5:03 pm
Developers do this because they don’t know its a database. If you are correctly using abstractions and coding to an interface, then you don’t know what’s in the black box. So you do what any good programmer would do, write the simplest, cleanest, most readable code. Caching is your data layers problem. Since good programmers write clean readable code whenever we can, we can deduce that you must suck and are one of the pointy hair grumbles that have been screwing up IT since day one. Good luck with your next career.
November 25th, 2008 at 5:08 pm
I fully agree that caching is the responsibility of the datalayer… that doesn’t give developers a pass for writing really shitty code though
btw, how is the above code cleaner, simpler and more readable than simply looping through the return value of PayrollDatabase.GetAllEmployees?
November 25th, 2008 at 5:10 pm
True, you got me there.
November 25th, 2008 at 5:13 pm
so those petty insults kinda look stupid now huh? :p
November 25th, 2008 at 5:25 pm
Yup, see ya in the soup line!
November 25th, 2008 at 5:29 pm
“btw, how is the above code cleaner, simpler and more readable than simply looping through the return value of PayrollDatabase.GetAllEmployees?”
If we’re only interested in the employees whose pay date is today, why pull all the employees out of the DB? A simple where clause would return only the employees of interest, shifting computation to the most optimised place (the DB server) and generating a much smaller set to loop over.
November 25th, 2008 at 6:01 pm
like i said earlier, if it was a simple comparison then i’d put it in the where clause as well… if the check is more complex, then i’d rather keep that logic outside of the queries as well
November 26th, 2008 at 1:26 pm
Optimize later:
Using an ORM-like approach is a good thing, code is read much more times that it get’s writed, and if the performance of the program is Ok I don’t see a reason why “optimize” it. The question here is: Is it worth to optimize the code to make just one SQL query instead of six?
Premature optimization is the most common cause of project failures.
November 26th, 2008 at 1:30 pm
This really isn’t just a case of premature optimization… but going to the database in a loop is about the dumbest thing you can do.
and again, as far as the readability is concerned, i’ll just copy/paste something from one of the earlier comments:
“how is the above code cleaner, simpler and more readable than simply looping through the return value of PayrollDatabase.GetAllEmployees?”
November 26th, 2008 at 4:11 pm
Hmm, how would you handle this situation:
Suppose you’ve an entity that is mapped against multiple tables. In other words, the entity has a number of collections which contain related information.
For instance, you have a class ‘Product’, which has a collection of ‘Prices’, and a collection of ‘Names’ (for every language, a different name), etc…
You need to write the Product information to a file, and this file contains, for each product, the price that is currently valid, and the ProductName for a specific language.
).
(You cannot use a DTO to retrieve the Products
If you use the ‘paging’ capabilities of NHibernate, NHibernate will issue several SELECT statements, which retrieve a recordset of n rows (where n is the ‘page-size’ that you’ve specified).
However, this recordset of n rows maybe contains information for n / 2 entities:
Since the entity has a collection of Names and Prices, the recordset contains a row for every Product/ProductName/ProductPrice.
This means that, if a product has 2 names and 3 prices, the recordset will contain 2*3 records for one Product.
Then, the pages will contain fewer entities then you’ve specified.
How would you solve this ?
One could argue that the Names and Prices should be loaded lazily, but, when you do that, you’ll find yourself in the same situation as the piece of code you’ve posted:
multiple queries will be fired in the loop where you iterate over every Product that you’ve loaded …
November 26th, 2008 at 4:34 pm
why wouldn’t i use a DTO for this? it appears to be the most optimal solution… you could map the DTO to a view which retrieves the correct data and that would be all there is to it. If you want to use paging on that view, there’s no problem either…
November 26th, 2008 at 5:25 pm
Well, that’s just what I was trying to say: you cannot use a DTO, since determining the price that has to be exported, is a quite complex task which cannot be handled by SQL.
So, you’ll need to retrieve all the prices of a product, so that you can use the ‘DeterminePrice( params )’ member method of the Product class …
That’s the reason why you cannot use a DTO.
Therefore, what solution would you use in this particular case ?
I’m trying to create a ‘PagedResults’ class, which is given an ICriteria.
The PagedResults class would then use the SetFirstResults & SetMaxResults methods to retrieve the required entities in ‘chunks’. But, then I am faced with the problems I’ve described in my earlier comment.
So, I’ve decided to modify my ‘PagedResults’ class a bit, so that I first retrieve the next ‘n’ Id’s, and then I let NHIbernate retrieve the entities for those Id’s (in one go), but this is a bit problematic as well …
November 26th, 2008 at 8:17 pm
in that case i’d probably try to get it working without the paging at first. then i’d test it it with a very large set of data to make sure i really do need to use paging. How many products are we talking about?
If paging really is necessary, a good solution to this problem is something that’s probably gonna require a bit of experimentation… i can’t just think of something that would definitely be great for this situation… sorry
December 5th, 2008 at 9:12 am
Frederik Gheysels:
I don’t see why you can’t create a specialized entity, call it a DTO, that shares some of the implementation of Employee but this time mapped as Davy said to a View.
Using full blown Entities for batch processing is fine for small jobs but that just doesn’t scale. Using a “DTO” would allow to use the same infrastructure as with the Entities but would provide tremendous performance benefits.
Daniel
December 6th, 2008 at 4:15 pm
That code is stupid, with or without a database. What’s the point in getting all the IDs and then the object for each ID, when you could iterate over the whole collection using a foreach statement? This is what I would call a proper abstraction.
December 7th, 2008 at 7:33 am
Where exactly does the code say that it is connecting to a relational database? How do we know that the PayrollDatabase is not an in-memory structure? If it is connecting to a relational database, how do we know that the query is not done at the “PayrollDatabase.GetAllEmployeeIds()” call and then the results cached for the scope of the executing block?
Basically, you are assuming a particular implementation of the PayrollDatabase, and are criticizing that implementation. I have worked on many projects where the ORM is smart enough to optimize cases like this, or where the data is already in memory, and I only “pay” for the method call. (The only thing I would change is to remove the part where we get the IDs, and instead get an iterator over the objects. If you want to criticize an API that exposes arbitrary IDs, I am fine with that.)
December 9th, 2008 at 9:49 am
Having read your post on a great technical lead, I doubt I would ever wish to work under you, with your criticising and your to-the-heart laments. I respect you say in your technical lead blog you do try to aspire to those values you present. I think you just have a long way to go.
Peter.
December 9th, 2008 at 10:30 am
hmm… apples and oranges
criticizing code in a blog post and working with teammates are two entirely different things. for starters, when one of my teammates writes code that needs improvements, i explain them what’s wrong with it and how we can improve it. There’s no criticizing going on there.
December 12th, 2008 at 7:30 pm
I guess it adds weight to the maxim…”Refactor mercilessly”