Hank Williams: Graphs: A Better Database Abstraction

Some very good nuggets in his post from last year. I have underlined some really important points. In fact, worth reading the whole post:

…an abstraction is a framework that simplifies how you think about and work in a given domain. Abstractions can be (and often are) argued against by suggesting that you don’t really need them. In computer programming, we didn’t need C because we had assembly language. We didn’t need C++ because we had C. We didn’t need Java because we had C++. To me these arguments (which people really made) were silly. I abandoned assembly language in the 90′s.

The point is none of our existing abstractions are *needed*. But our human brains can only manage a certain amount of complexity at a time. Complexity is fine but only in bite size chunks. … Abstractions allow us to encapsulate complexity so that we don’t have to think about it and we can achieve greater and greater levels of complexity in an efficient way allowing us to keep more of the model of a given system in our heads.

And so the anti-abstraction argument rears its head in the RDBMS vs graph database debate. One of the arguments that the pro RDBMS folks make for why there is no need for the graph database model is that you can do everything that can be expressed in a graph database in a relational database. And there is some truth to this.

But there are two problems with this argument. The first is that this is only true in theory. It is not possible to build a graph database of scale using pure SQL – at least with the SQL tools that we currently have to choose from.

One reason for the scale problem is the only way to do it is to do what are called self-joins. This is when you join a table to itself. Conceptually seem just fine. But the problem is that it is impossible for the database engine to do anything other than brute force un-optimized traversals of the graph when confronted with a chain of self-joins. In other words, using this technique will not yield a useful database that is query-able at any kind of scale. Handling certain aspects of providing a graph database model requires some very specific and different kind of thinking and optimizations from those that go into designing an SQL database.

Another problem is that one giant table using self joins for traversal means a huge write bottleneck. Yes, you can avoid that with sharding depending on your design, but it is definitely not part of the SQL model, and so you can’t say SQL is helping you here.

The second and I believe more important argument against implementing a graph data model using SQL is that even if SQL could do a good job of representing a graph model, building your graph system in SQL is not a very good abstraction. The truth is that most of the kinds of things we want to do in app development look more like graph than relational structures. [Emphasis is mine] Graphs are elemental to computer science because most interesting algorithms and in fact real world data models can be very naturally thought of as a graphs. Graph theory is (if things are as they were when I was in school) the first thing you learn when you begin studying computer science, and there is very good reason for this. The fact that Facebook was able to anchor the idea of what they were building as a “social graph” is an incredible testament to the innately natural characteristics of the graph concept.

So if you are representing a graph, you really want an API that reflects the unique and useful characteristics of a graph. In other words, you want an abstraction that reflects how you really think about the data and not some jury-rigged representational model continuously intruding itself into your thought process. And so, having a data store that allows us to express our data in a way that is much more similar to how we actually think is enormously helpful.

And such is the case with attempting to implement a graph database using SQL. You can do it, but it is unlikely to work very well, and because you don’t have the benefit of the abstraction, it actually adds to the complexity of the design instead of simplifying it.

The bottom line is that graphs are a better representational model when the structure of your system will change frequently. Relational is a better model when the structure will be static. Today, I think most of us are not building applications that are ideally structurally static.

Because most applications today have a much more dynamic nature, graphs are, for most people, under most circumstances, a far better abstraction. And to me, there is little in this world more powerful and satisfying than a great abstraction.

Note that his argument is about the concepts developers use to build their applications. He is not arguing against SQL databases as the storage engine — just like the approach we have been taking with InfoGrid.

Leave a comment:

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>