Hank Williams: Graphs: A Better Database Abstraction

Some very good nuggets in his post from last year. I have underlined some really important points. In fact, worth reading the whole post:

…an abstraction is a framework that simplifies how you think about and work in a given domain. Abstractions can be (and often are) argued against by suggesting that you don’t really need them. In computer programming, we didn’t need C because we had assembly language. We didn’t need C++ because we had C. We didn’t need Java because we had C++. To me these arguments (which people really made) were silly. I abandoned assembly language in the 90′s.

The point is none of our existing abstractions are *needed*. But our human brains can only manage a certain amount of complexity at a time. Complexity is fine but only in bite size chunks. … Abstractions allow us to encapsulate complexity so that we don’t have to think about it and we can achieve greater and greater levels of complexity in an efficient way allowing us to keep more of the model of a given system in our heads.

And so the anti-abstraction argument rears its head in the RDBMS vs graph database debate. One of the arguments that the pro RDBMS folks make for why there is no need for the graph database model is that you can do everything that can be expressed in a graph database in a relational database. And there is some truth to this.

But there are two problems with this argument. The first is that this is only true in theory. It is not possible to build a graph database of scale using pure SQL – at least with the SQL tools that we currently have to choose from.

One reason for the scale problem is the only way to do it is to do what are called self-joins. This is when you join a table to itself. Conceptually seem just fine. But the problem is that it is impossible for the database engine to do anything other than brute force un-optimized traversals of the graph when confronted with a chain of self-joins. In other words, using this technique will not yield a useful database that is query-able at any kind of scale. Handling certain aspects of providing a graph database model requires some very specific and different kind of thinking and optimizations from those that go into designing an SQL database.

Another problem is that one giant table using self joins for traversal means a huge write bottleneck. Yes, you can avoid that with sharding depending on your design, but it is definitely not part of the SQL model, and so you can’t say SQL is helping you here.

The second and I believe more important argument against implementing a graph data model using SQL is that even if SQL could do a good job of representing a graph model, building your graph system in SQL is not a very good abstraction. The truth is that most of the kinds of things we want to do in app development look more like graph than relational structures. [Emphasis is mine] Graphs are elemental to computer science because most interesting algorithms and in fact real world data models can be very naturally thought of as a graphs. Graph theory is (if things are as they were when I was in school) the first thing you learn when you begin studying computer science, and there is very good reason for this. The fact that Facebook was able to anchor the idea of what they were building as a “social graph” is an incredible testament to the innately natural characteristics of the graph concept.

So if you are representing a graph, you really want an API that reflects the unique and useful characteristics of a graph. In other words, you want an abstraction that reflects how you really think about the data and not some jury-rigged representational model continuously intruding itself into your thought process. And so, having a data store that allows us to express our data in a way that is much more similar to how we actually think is enormously helpful.

And such is the case with attempting to implement a graph database using SQL. You can do it, but it is unlikely to work very well, and because you don’t have the benefit of the abstraction, it actually adds to the complexity of the design instead of simplifying it.

The bottom line is that graphs are a better representational model when the structure of your system will change frequently. Relational is a better model when the structure will be static. Today, I think most of us are not building applications that are ideally structurally static.

Because most applications today have a much more dynamic nature, graphs are, for most people, under most circumstances, a far better abstraction. And to me, there is little in this world more powerful and satisfying than a great abstraction.

Note that his argument is about the concepts developers use to build their applications. He is not arguing against SQL databases as the storage engine — just like the approach we have been taking with InfoGrid.

Carsonified: Why Graph Databases

Martin Kleppman summarizes the case for Graph Databases at carsonified.com. This is exactly why InfoGrid is built around a graph of MeshObjects:

… graph databases focus on the relationships between items — a better fit for highly interconnected data models.

Standard SQL cannot query transitive relationships, i.e. variable-length chains of joins which continue until some condition is reached. Graph databases, on the other hand, are optimised precisely for this kind of data. Look out for these symptoms indicating that your data would better fit into a graph model:

  • you find yourself writing long chains of joins (join table A to B, B to C, C to D) in your queries;
  • you are writing loops of queries in your application in order to follow a chain of relationships (particularly when you don’t know in advance how long that chain is going to be);
  • you have lots of many-to-many joins or tree-like data structures;
  • your data is already in a graph form (e.g. information about who is friends with whom in a social network).

Graph databases are often associated with the semantic web and RDF datastores, which is one of the applications they are used for. I actually believe that many other applications’ data would also be well represented in graphs. However, as before, don’t try to force data into a graph if it fits better into tables or documents.

In our experience, particularly social applications or applications that deal with complex interrelated data are much easier to build using a graph of typed objects in InfoGrid than to shoehorn into relational tables. But then, InfoGrid can use relational databases as storage engines, so we have the best of both worlds: graphs on the front, and enterprise-friendly SQL on the back.

Neo4j and InfoGrid

Just came across Neo4j, an “open source NoSQL graph database”. Neo4j is clearly very close in philosophy and API to InfoGrid, in fact closer than anything else that I’ve come across so far.

Compare this:

Neoj4 InfoGrid
Node firstNode
    = neo.createNode();
Node secondNode
    = neo.createNode();
firstNode.createRelationshipTo(
    secondNode,
    MyRelationshipTypes.KNOWS );
MeshObject firstObject
    = life.createMeshObject();
MeshObject secondObject
    = life.createMeshObject();
firstObject.relateAndBless(
    secondObject,
    MySubjectArea.MESHOBJECT_KNOWS_MESHOBJECT.getSource() );

If this isn’t similar, what is?

Transactions:

Neoj4 InfoGrid
Transaction tx = neo.beginTx();
try {
    // do something
   tx.success();
} finally {
   tx.finish();
}
Transaction tx = null;
try {
    tx = mb.createTransactionNow();
    // do something
} finally {
    if( tx != null ) {
        tx.commitTransaction();
    }
}

Properties:

Neoj4 InfoGrid
firstNode.setProperty(
    "Name",
    "Neo4j" );
firstObject.setPropertyValue(
    MySubjectArea.PERSON_NAME,
    StringValue.create( "InfoGrid" ));

Regarding differences, it seems the Neo4j folks have spent a lot more time than we have on make it a “database” (while InfoGrid delegates to other storage engines like MySQL or Hadoop).
On the other hand, InfoGrid.fnd is type-safe, and instead of a command-line shell, uses a set of web Viewlets to access the graph of objects. (Which then can be incrementally refactored into an application.)
And then of course there is InfoGrid.net, which does not seem to have an equivalent in Neo4j. (see also InfoGrid Core features and Neo4j wiki)

Worth digging into more deeply …

Web-based log4j configuration

Usually a reconfiguration of log4j requires an application restart. That’s annoying, particularly if it is necessary to track what an application is doing in a production environment. It would be nice to be able to change logging levels without having to re-deploy the application.

So I created a Viewlet that allows the user to reconfigure log4j logging levels of a running InfoGrid application from the web browser. Screen shot to the right. Nothing particularly fancy, but useful. Obviously, you can set any access control for it that you like in the application’s ViewletFactory.

The Viewlet has been added to MeshWorld and NetMeshWorld in trunk. The code is in a new project called org.infogrid.jee.viewlet.log4j.

Tracking XPRISO Messages

XPRISO is the messaging protocol used by InfoGrid to keep distributed information consistent. The acronym stands for “eXtensible Protocol for the Replication, Integration and Synchronization of distributed Objects”.

The current version of InfoGrid uses XPRISO mostly for communication between an application’s main NetMeshBase, and the ShadowMeshBases that track (“shadow”) external data feeds.

Sometimes it is really useful to track XPRISO messages.The simplest way o track all XPRISO messages in your application is this:

  1. Determine the classes that your application instantiates that implement the NetMeshBases communicating via XPRISO. For example, that might be org.infogrid.meshbase.net.local.m.LocalNetMMeshBase and org.infogrid.probe.shadow.m.MShadowMeshBase, or perhaps org.infogrid.meshbase.net.local.store.IterableLocalNetStoreMeshBase and org.infogrid.probe.shadow.store.StoreShadowMeshBase.
  2. Set their logging level to INFO.

This might be as simple as adding the following two lines to your Log.properties file:

log4j.category.org.infogrid.meshbase.net.local.store.IterableLocalNetStoreMeshBase=INFO
log4j.category.org.infogrid.probe.shadow.store.StoreShadowMeshBase=INFO

Voilá, all XPRISO messages are written to your logger. (Some detail is omitted to make the log readable.)

Here is an actual example:

INFO  2009-10-12 15:32:14,992 [pool-1-thread-1] shadow.store.StoreShadowMeshBase (Log4jLog.java:152)
 @ org.infogrid.util.logging.log4j.Log4jLog.logInfo:152
 - org.infogrid.meshbase.net.xpriso.SimpleXprisoMessage@5baec83c{
    route: http://localhost:8085/app1/ -> http://localhost:8085/app2/, requestId: 1255386734969
    requestedFirstTime:
        #http://localhost:8085/app2/
}

INFO  2009-10-12 15:32:15,018 [pool-1-thread-2] local.store.IterableLocalNetStoreMeshBase (Log4jLog.java:152)
@ org.infogrid.util.logging.log4j.Log4jLog.logInfo:152
 - org.infogrid.meshbase.net.xpriso.ParserFriendlyXprisoMessage@4e940b1e{
    route: http://localhost:8085/app2/ -> http://localhost:8085/app1/, responseId: 1255386734969
    conveyedMeshObjects:
        id: http://localhost:8085/app2/
}

Here, the first NetMeshBase asked the second for an object, and promptly got a replica in return.

If you don’t like the logging format, or want to log them differently, you can use your own XprisoMessageLogger simply by invoking setXprisoMessageLogger on the NetMeshBase whose messages you like to track.