Totally awesome software for iPhone and OS X

Friday, November 21, 2008

Is not using an ORM premature optimisation?

Corey Trager writes in this blog post about how he's "stockpiling ammunition" (collecting articles) to use in a workplace argument (discussion) against using an ORM solution in a project.

Now, apart from what he describes in his post, I don't know much about his project and in his case he may well be right to argue that using an ORM, for them, is a complete waste of time and effort. I don't know.

But in the general case (i.e. almost everywhere) I personally think that chosing not to use ORM is a premature optimisation. Simply giving hand waving arguments about performance without metrics and claiming that using ORM is just following the herd is not going to sway my point of view on this and here's why...

1. Performance

Ok, so you could argue that, even without supplying metrics, not using an ORM can ultimately be more performant for some scenarios (which may or may not represent real life usage patterns). But that's like arguing that writing your app in assembly language is going to run faster than the same thing done in Java, Ruby, Python etc. It may be technically true, but you're going to burn up a lot of time and build in a lot of complexity right from the start. So in terms of developer performance and code maintainability you're going to lose out.

'Whaa?', I hear you say, 'who cares if it's easy for the developers to work with, I want blazing fast systems not lazy, bigass developers futzing around with the 'tool du jour' just because it makes their lives easier'.

Yes, but also remeber that you don't necessarily know at this point what the performance characteristics of your application need to be. Let's say you're building a web based API as Corey is. Do you know upfront what the average load will be, how it will be used and so on. Sure, you can guess or even do a little research and get some approximate figures and patterns. Hell, you could even run some benchmarks that test the relative performance of direct connections to a database against an ORM, but they won't tell you anything meaningful at this point. You may even discover that the direct connections are marginally quicker, but it's not a real life usage scenario and half of the (free) benefits of using an ORM probably won't even be brought into play by your test scenarios.

One thing I've discovered in my ten or so years of building big ass apps is that the bottleneck will at some point be the database. Yep, the database itself. Whether it be locked up doing some big query or update (hey, ORMs make locking problems easier to solve too), sending data down the wire to your app servers, out of memory, disk space, available connections or whatever. It's very rarely the app that you'll find yourself waiting on. Of course, there's a case that some of these problems may be the fault of the your app but this is going to be true regardless of whether you're talking to it via some kind of ORM library or a bunch of connections and queries you're managing yourself.

Anyway, we'll come back to performance so let's move on :

2. Software Developers != DBAs

In my experience developers don't think or act like DBAs. They don't often write optimal SQL and they tend to think in terms of code when trying to solve problems in SQL. The opposite is also true, DBA's often write funny code, I once worked with one who was given the task of parsing text files and, of course, set about solving it using an RDBMS. The parsing worked, but took hours and it was quickly replaced with a few lines of C or Perl (I forget which) that did the same job in seconds.

So, abstracting developers from the database can be a good idea, ORMs generally don't make too bad a job of optimising queries and laying out schemas in most cases and they at least follow consistent principles in how they go about such things. The same can't usually be said of people. And besides, ORM frameworks keep getting better as they evolve, so you get the optimisation benefits (collected over years of other peoples' real life scenarios) for FREE every time you update the library. You can't easily build that in yourself and you almost certainly can't build it for free.

By all means, design the database properly, or better still get a DBA in to do it for you. There's no rule anywhere that says you have to model your entity objects in some hyper generic, polymorphic uber-pattern that creates thousands of tables with many-to-many relationships between them and slows to a crawl trying to inflate an object.

You can be pragmatic here. If you have an existing schema or a nice DBA designed one, why not just map your entities directly to that? Think a bit about how your entity model is going to look when mapped out as a database schema, change it if necessary. Most ORMs offer some kind of migration feature that will create and update your database schema for you.

If you know nothing at all about databases, then letting the ORM create you one based on your object model isn't going to be any worse (and probably better) than you'd manage without.

'Wait a minute!' you say, 'That's not much of an abstraction.' Well, no it's not and yes it is, with an ORM the real abstractions are the mechanisms for acquiring database connections, querying relationships, fetching and updating data, managing transactions, caching things and more. The responsibility of knowing that your design is stupid and won't work when translated into a database schema ultimately rests with you.

Again, this is also true if you decide to go down the 'roll your own' route, you just don't get any help along the way.

3. Caching

Big apps are all about caching. Caching everywhere. Facebook have something like 25 terabytes of in memory caches and you're probably going to want some too at some point. Most, if not all, ORM implementations have caching built in, they cache queries, entities, relationships and all the things that you're going to need to remember to find and optimise and cache yourself if you don't use one. They deal with stale objects in the cache for you and take the load of development, testing and debugging all this crap from off your shoulders. Of course, you can tweak them for performance later (when you know what needs doing) and add more caching as you need to other parts of your app, but they're all there ready to go.

Caching is quite a hard problem, but if you decide on rolling your own anyway you'll find caching libraries everywhere and they're not that difficult to work with, but applying them consistently and appropriately is.

This is done for you when using a decent ORM.

4. When you're not using an ORM aren't you really just using an ORM?

If you're not using an ORM to manage the 'interface' between your app and your database. What are you doing? You're probably getting a connection, querying data (being careful to only query for the data you need and doulbly optimising the query), doing some stuff with caches and putting it in a sort of code based representation of the data, cleaning up any used resources and sending it on it to the user. The representation, probably a DTO of some sort, is doubtless an object.

So you're bascially just writing your own specialised ORM then?

Except you've written (and need to maintain) more of your own code to do it. It used to be the case that ORMs required a tightly coupled set of confusing mapping files in xml or similar and involved a lot of work to do apparently simple schema updates. But this hasn't been the case for some time. The two ORMs I use most frequenty JPA/EJB3 (Hibernate) and ActiveObjects are mapped using annotations within the code. But instead, if you don't use an ORM, you've now got a codebase that's tightly coupled to your database. Of course, there may be some benefits to this, but I doubt they'll have much real value. If you've made your database abstract layer more generic than this, then, well why didn't you just use an off the shelf solution?

'Yeah, but what if you're doing data intensive stuff, that's where ORMs really suck. Like getting tens of thousands of rows and performing calculations on them?' You ask.

Well any ORM worth it's salt will let you use...

5. Native queries

Sometimes the whole coding thing just breaks down and SQL is more appropriate, as it usually is for large set based operations. Imagine a scenario where you have to do something to huge datasets and need to update them all at once. You could (but probably don't want to) use an ORM to do something like.


//get every MyEntity in the database
//increment someting by 5
for(MyEntity entity : entityManger.find(MyEntity.class)) {
entity.setSomething(entity.getSomething()+5);
entity.save();
}


This will be slower than the equivalent UPDATE statement on any given RDBMS, if for no other reason than the database doesn't have to transfer the data between itself and your app. This should be pretty obvious.

The arguments for not approaching this specific type of problem this way are the same irrespective of whether or not you're using an ORM. And no ORM worth using is going to tie you in to iterating a massive result set just because it's somehow more 'OO' to do so.

Again, ORMs allow you to bypass their funkiness and call directly through to the underlying database. Breaking the abstraction? Not really, just being pragmatic. Maybe I'm wrong, but as mentioned earlier I don't think the purpose of an ORM is to allow you to create 'whatever crazy OO model you dream of' and magically and transparently map it to a database with little thought or effort and everything will perform wonderfully. The abstraction is the abstraction of not having to deal with the whole kit and kaboodle of connection pools, caches, query optimisations etc. except when you have to.

It can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience. - Albert Einstein


6. What else?

Other complaints that I've seen levelled at ORMs include the following. I'll deal with them quickly as this is becoming a long(er than intended) post...

Concurrent updates : From the 'ammunition' collected in the post that sparked this article, I found this quote, from Bruce Eckel I believe, "Most O/R mappings try to give the illusion that there is just that one Customer object with custID 100, and it literally is that customer. If you get the customer and set a field on it, then you have now changed that customer. That constrasts with: you have changed this copy of the customer, but not that copy. And if two people update the customer on two copies of the object, whoever updates first, or maybe last, wins."

This is simply not true. Hibernate et al. support optimistic concurrency control as much as the next man. How else were you going to solve this? Make your domain objects singletons, lock the row when you 'check out' some data to read it just incase you decide to modify it?

The SELECT * 'problem' : It's a common belief that looking up objects in an ORM involves eagerly loading everything, even stuff you don't want. Again, not true. Completely configurable. Make sensible choices. If you roll your own you're going to need to do this anyway.

7. So, I built an ORM based solution and it totally sucks...

Well, ok, if you're absolutley sure you did your best and the whole thing is falling apart and it completely sucks and it's all the fault *exclusively* of the ORM and you've proven this mathematically and no ammount of tuning and tweaking makes it work.

Then rip it all out.

You should have a clean enough code base, after all you'll be left with POJOs and stuff which you can reuse. I'd hope that your code will be much smaller and easier to refactor as a result of having used an ORM than it would be otherwise. Did I mention that ORMs generally make unit testing and the like much easier as well? Improving code quality overall.

To summarise then:

1. ORMs are not magic, they don't understand intent. They just help abstract away some of the pain of dealing with databases and all the associated cruft that comes with that. You don't have to make random design decisions, if you're stuck with an existing database consider accomodating it a bit in your OO design rather than performing a bunch of crazy mappings. Better still migrate it. Do whatever you'd do if you weren't using an ORM, they generally don't enforce a strict set of rules on how you should use them.

2. Work done by other people, built upon work done by better people and tested in thousands of real life scenarios is going to be better and more bug-free than something you knock up yourself. Every time.

3. Don't immediately wave your hands in the air and claim 'performance sucks'. I've worked on sites using JPA/EJB3 that deal daily with many hundreds of hits per second and don't suffer from performance issues. I've also used ORM in environments where very high processing throughput and heavy calculations were required, again without issue. It may well have been the case that EJB2s container managed persistence was a bit 'merdique' (shitty) but god, that was years ago things have come a long way since those days...

4. If you don't use an ORM and your app isn't running entirely in the database (oracle forms?) then you're going to be building some form of database abstraction layer anyway, whatever you call it or think of it as it's just a custom, one off ORM. That's what it is. I'm sure that this approach could, in the end, be slightly more tuned to your specific needs. But I'd argue that it's a lot of extra work for not much extra gain.

Richy

6 comments:

victori said...

Read your blog partially, the font is too small. The whole anti-ORM argument is stupid. Use an ORM and if you need performance in specific areas you drop down to SQL. Hibernate and ActiveRecord both give you the option of doing so. No point in wasting your expensive developer time writing redundant, potentially bug laden code. Furthermore, hibernate has second level caching, which will most likely produce faster results than pulling data via JDBC dao every time.

Richy said...

Hi, thanks for the comment. I realied just after I posted that the current format of the blog is pretty much unreadable for anything but single paragraph posts. I'll be changing this shortly.

Needless to say, I agree with everything you say :)

Richy

Anonymous said...

While you may be right for typical use patterns, it is also possible to conclude fairly early that ORMs may be problematic. My own work involves the following pattern:
1. Load problem data and existing solution (if any). This involves thousands or tens of thousands of objects.
2. Compute new solution. This is heavy mathematics done multithreaded across all the cores I can get. This step takes from minutes up to over an hour.
3. Write back the solution -- more thousands of objects created/updated.

Immediately you can see that the usual response to the large number of objects --- use native sql --- doesn't work as the computation can't sanely be done in SQL. The use of multiple threads adds extra warning bells (JPA doesn't allow this, though some implementations do).

So without doing anything I see we appear to be outside the comfort zone of ORM application and possibly skating close to the edge.

Richy said...

Mark,

Absolutley, I'm not saying that ORMs are useful in every case, but rather dismissing them out of hand is not always the best approach.

In your case the bottleneck would appear to be the calculations rather than database access, but without specifics I don't know. I've worked on large derivatives processing platforms that have run calculations distributed across a grid of a thousand blades and still took hours (>12 in some cases) to run. Reading and writing the inputs and resultsets (which were huge) was a trivial problem in this case, in the scheme of things it wouldn't have mattered if we'd ftp'd them as csv files from mars :) In reality, some inputs were read from the filesystem (or networked file systems), others from interim caches, some from web services available around the place and others from a db (via an ORM). Although, as I say, gaining a few seconds here and there getting the data together wouldn't have affected much in this particular case, your mileage may vary :)

Oh, and by suggesting using native queries, I wasn't suggesting that you necessarily do everything in SQL, but sometimes it's more optimal to use native syntax to do some join operations and fetch back trivial subsets of data that you don't always want to create an object out of. If all you're interested in is a row count, then probably asking Hibernate et. al to get you back a Set of entities then looking at the size of the Set isn't the best approach...

Similarly bulk updates/deletes etc. can be quicker in native sql.

Richy

Mark Nuttall said...

Mark,
You probably don't even want to use a RDBMS let alone ORM (although I think you might still want to). You also might want to be using some sort of grid like IBM eXtreme or tool like Coherence.

Anonymous said...

The primary reason for using a RDMS is that customers are more comfortable with it. They know how to back it up and are satisfied they can extract their data from it, even if in practice they never will.

© 2007 Wired Up And Fired Up