ORM + RDB != ODM + MongoDB
MongoDB has been the target of a considerable amount of hate from the community of late. Here’s the backstory – in 2008 began a “mass exodus” of sorts towards (immature) NoSQL technologies. Bright-eyed architects began adopting databases like MongoDB without developing a complete understanding of their features and limitations. Over time, they realized the err of their ways. The full implications of the loss of the relational and ACID properties of the RDBMSs that they had abandoned only became clear as their applications grew. Their thoughts of 1000 machine mongo clusters were replaced by the realities of rewriting GROUP BY in code for the twelfth time; these early adopters began ranting and raving against poor lil’ Mongo.
That said, this is not another MongoDB-is-a-PITA-post. Sure, we’ve run into our own share of issues with Mongo. But to us, the most interesting ones have arisen from our use of an ODM (Object-Document Mapping) system as an abstraction of Mongo. We decided to use Mongo because it seemed to be a good fit for early versions of our application – we used its geospatial features heavily, and had a handful of primary domain objects that were more or less disjoint. However, as the application evolved and our data started to look more and more relational, we decided to adopt Doctrine’s ODM (heavily inspired by their ORM which in turn is modeled after Hibernate) to act as a bridge between the document paradigm and the traditional relational paradigm. In a lot of cases, this abstraction works well but in others, it breaks down; Doctrine’s ODM is a leaky abstraction. As Joel Spolsky puts it in his Law of Leaky Abstractions:
“All non-trivial abstractions, to some degree, are leaky.”
An ODM is an incredibly difficult abstraction to implement. It attempts to create an ORM-like mapping, which is already conceptually and technically tricky due to the impedance mismatch between the OO and relational paradigms, for a datastore that is very different from a traditional RDB. In some cases, these differences can be worked around in code, resulting in additional complexity and inefficiency. In other cases, the differences remain, and are rendered invisible to the ODM user; the biggest danger of this sort of abstraction lies in the fact that it creates an illusion of similarity. Users of the ODM, like us, fall prey to concluding incorrectly that since it looks like, swims like and quacks like an ORM, it probably is one. Under the hood lies a very different beast. Here are some of the ways in which we were bitten by way of such thinking:
1. Workaround for lack of nested positional operator caused unexpected results
Mongo’s positional operator does not support the ability to update arrays nested within arrays (an issue present since March 2010). However, our domain model consists of many one-to-many-to-many style relationships that are implemented as array embeddings by the ODM at the DB level. For instance, we have a ‘proposal’ object which has a one-to-many relationship with a 'property’ which in turn has a one-to-many relationship with a 'message’ which in turn has a one-to-many with 'attachment’; a proposal document contains an array of property documents and so on… However, since the positional operator can only perform updates up to one-level deep (in this case, to properties only), updates to further levels of nested arrays must be performed by a workaround in code. Because of a bug in the ODM to work around this limitation, we found that when trying to add a message with attachments to a property, the attachments would always get added on to the first message in the array. Going forward, we’re working around this workaround by avoiding more than one level of nesting – normalizing our data across multiple documents in such cases. This comes at the cost of performance, and the cost of increasing the probability of data inconsistency issues (due to lack of multi-document atomic updates i.e., transactions). This inconsistency in how Mongo treats top level documents vs. embedded documents is not only an issue with the Doctrine ODM, but also with the popular Mongoid ODM as well.
2. Unexpected query implementation caused race conditions
In a feature that kicked off multiple asynchronous updates to a certain object, we discovered a bug that seemed to be causing the data corresponding to a certain property on the object to disappear. After a few hours of digging deep into the internals of the ODM, we discovered that the bug was caused by an unexpected implementation of what we assumed was an atomic write. While aware that Mongo doesn’t implement transactional support, we assumed that single-document ODM level updates were mapped to single-document updates at the DB level and hence, were atomic (would be a pretty safe assumption in the case of most ORMs). However, we noticed that while updating the value of a property on an object that is an ArrayCollection type (a Doctrine collection type which wraps native PHP arrays), the ODM executed two queries on the corresponding document – one to unset the property (rendering it empty), and a second to set it to the specified value. This, of course, led to a race condition when multiple processes tried to write/read changes to/from this object. Often, processes that tried to update the object after an update to the ArrayCollection property was initiated by an earlier process, read the object in a state in which the property was unset. These processes wrote the object with the property in the unset state, making it seem like data was disappearing.
3. Emulation of joins causes performance issues
Most ORM’s support the retrieval of an object graph from the underlying database with a single query but since MongoDB does not support joins, the retrieval of an object and its associations involves many roundtrips to the database resulting in a process that is obviously less efficient (admittedly, we are yet to quantify the extent of the performance hit). The number of queries that the ODM must perform in order to retrieve an object graph is more or less opaque to a developer who is not conscientiousness of this limitation of the database – another downfall of this abstraction.
Even though abstractions enable developers to write cleaner code faster, they also increase the risk of their making incorrect behavioral assumptions. This is especially true when abstractions that are generally well understood are recreated for disparate underlying systems. However, we agree with Jeff Atwood in that the most useful (non-trivial) abstractions (also the most leaky) are here to stay. It is our job as programmers to understand the deficiencies of such abstractions and not reject them, for in return they promise us much simplicity.