I've always disliked database integration. I've also disliked shared development databases, just as I've disliked shared development environments. And it's not just me that thinks like this.

https://twitter.com/marcjohnson/status/193495187799019520

https://twitter.com/jeremydmiller/status/309353109300326400

Databases are great at persisting data reliably and securely, and can provide a workable integration point between systems, but why is this choice commonly the wrong one?

It ensures tight coupling between systems

Encapsulation (or don't "expose your privates") is a key tenet in object orientation, SOA (service oriented architecture), and other architectural styles that advocate loose coupling and high cohesion. The idea is simple: the less related systems know about each other, the easier it is to change, replace, or relocate one whilst minimising impact on the other. Object orientation and SOA both enable encapsulation through publishing and defining interfaces. They  expect all service consumers to interact solely through these interfaces. This concept is vital for separating internal provider implementation from the declarative published interface, and thus allowing the provider to change implementation with no impact on the consumer.

Web architecture provides an example of loose coupling where consumers (including web browsers and search engines) only require basic information about a website, i.e. URI, encoding, and transfer protocols, for them to request and consume web content. Providing that a website continues to support the same URIs and protocols, it is free to change its implementation (changing server side language, database system, physical location) with little to no impact on consumers.

How an application persists its data should be an implementation detail private to the application. Database schemas are commonly designed to most simply and efficiently store an application's data. If the application is subject to change, functional or non functional (e.g. performance, scalability, security), its database schema may also change. If such database changes are required, then regression testing and possible modifications could be needed on any other applications that integrate with this changed schema. The costs and risks involved in this regression testing and possibly maintenance are the undesirable side effects of tight coupling.

Interlinked databases cause development pain

Common modern development practises, including Test Driven Development and using test doubles (e.g. mocking and stubbing), are possible when developing or maintaining software systems composed of loosely coupled components conforming to defined interfaces. For example, if developing against an external API, it's common to stub the API using a network stubbing framework (like VCR) to allow speedy, deterministic functional tests. This allows development and testing without a full implementation of the external API existing locally e.g. attempting to install a full working copy of Twitter on a development machine would be pure madness!

Contrast this with tightly coupled systems where the interfaces are not well defined, and sit in technologies that do not aid test doubles. Examples include APIs that do not conform to OO interfaces, legacy communication protocols and, oh yes - databases. What if the integration is implemented via shared use of a set of tables? The only way to stub this is to create an identical table structure in a local database. However, these tables commonly have to be populated with particular data to support specific tests. Scripting and changing test fixtures frequently in between tests makes testing slow and error prone.

Maybe you are lucky that all access to the database goes through stored procedures. However, most (any?) databases do not provide the ability to change the implementations behind a stored procedure interface at run time, and therefore complex coding is required to implement test doubles of stored procedures. Worse still, what happens if triggers are used within the schema. These complicate automated testing against databases significantly, in particular when trying to create stable test fixtures.

If these problems prevent developers using test doubles as a way of handling dependencies within testing, then the alternative is to have the dependent system (or at least it's database) installed on their development environment. This takes time especially when configuration managing versions etc, and is exacerbated further if several tightly coupled schemas exist; a developer might have to bring back an entire set of data schemas that do not relate to the problem in hand, but are technical dependencies.

Architectural design should take account of development factors, and anything that prevents progression to development techniques such as TDD, and isolated development (considered best practice by Microsoft from at least 2002) should, at least, be considered against other factors and drivers.

It prevents technical progression

The last 25 years have seen massive change in the database industry. The once exciting world of object databases has come and gone, and relational databases have increased their footprint and ubiquity in persistence. However, the last 5 years has seen a considerable rise in competitive alternatives to relational databases. Their popularity has been boosted due to their role played in scaling success stories at web-scale companies like Facebook, Google and Twitter.

An impedance mismatch exists between databases using different data models, and the mismatch between relational and NOSQL is particularly important, and problematic due to the prevalence relational data stores. NOSQL databases commonly replace relational models in favour of document orientation, key / value maps, and column orientation.  Schema to schema integration is easy to implement on relational databases from the same vendor, but is complicated when attempting similar across databases from different vendors. Proprietary bridge solutions are required, and these solutions are constrained to databases sharing relational concepts. Time will tell whether such bridges are provided between relational and NOSQL databases; they may well be. Either way, SOA and Restful architecture promote integration through open standards rather than proprietary tools and these bridges should present architectural warnings even if the bridges do exist!

If an organisation chooses database integration and then wishes to choose NOSQL data stores in the future to support technical progression, then they will eventually hit the above impedance mismatch. Alternatively, using open standards integration (SOA, enterprise service buses etc) will provide the loose coupling and architectural patterns to accommodate change more easily. NOSQL data stores are becoming more popular, especially with their close relationship with in-vogue "Big Data" trends. Organisations will benefit greatly by making architectural choices that support, rather than constrain technologically progression.

It's not always the fastest performing integration choice

Of all the arguments advocating database integration, I've previously believed that performance was the most compelling. When using functionality in System A wherein, System A tightly integrates with the data held in System B, then it would make sense that holding this data closely together will provide the greatest performance.

However, data caching and replication concepts can now bring System B's data closer to System A's functionality without a direct database to database link. For example, REST / web architectural style would promote System A referring to certain System B resources. Correct and careful http usage allows System A to request the required resources from System B through HTTP, and cache it for subsequent interactions. The advanced cache controls offered through HTTP can prevent stale caches, and as these exploit the standard and well known caching policies that form the basis of the Web, this style of integration can transcend networks, routers and load balancing systems providing great scalability.

Further, SOA and Rest advocates integration through messages formatted in structures based on resources defined according to domain concepts. Services can hydrate, cache and reuse these domain resources without frequent round trips to the database. Most Object Relational Mapping tools provide sophisticated caching options. Server side caching can provide significant improvements in responsiveness and computer resource usage.

Databases are the most expensive system components to scale

Most web startups dream about the future need for massive scalability. However, even if demand does require a solution this wonderful problem, limited financial reserves constrain engineers to strive for the most cost effective solutions for scalability. Many architectural, database and scaling innovations have resulted from these scenarios.

Through so called polyglot persistence, concepts and techniques such as sharding and map-reduce, and wider architectural concepts like RESTful cache exploitation, both commonly implemented in free and open software, startups have found significantly cheaper, more flexible, and scalable (read ability to quickly grow) alternatives to the traditional approaches of scaling the database. Vertically scaling (upgrading a server) or horizontally scaling (adding more servers) commercial databases can result in high license costs.

If this provides some credibility that databases are the most expensive system components to scale, then software architectures should aim to reduce load on the database. This contrasts against the principle of database integration where responsibility of integration is added to the already considerable load exerted on the database servers used simply for separate system persistence.

So does database integration suck?

My personal view is that yes, database integration sucks. It can restrict an organisations ability to fully exploit new technologies like NOSQL databases. It can significantly complicate system development, maintenance, configuration management, and test automation. It might not even be the cheapest or most performing solution.

However, most importantly to me as a technical architect, it sits almost perfectly in opposition to the principles and concepts of modern architecture - SOA and the RESTful web architectural style. These styles both strongly advocate establishing and publishing interfaces using domain terminology and messaging styles to enable system encapsulation and loose coupling. Encapsulation and loose coupling are the absolute requirement of enterprise architectures that provide the technical agility required to meet the ever changing business demands brought by fast moving and innovative organisations.

Supporting evidence of my views can be found in Jeff Bezos's well known SOA commandment where he forbids database integration within Amazon:

3. There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team’s data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network.
Finally, I'd like to thank @philip_sant and @phevans for their hard work reviewing this post - cheers guys!


comments powered by Disqus