What’s the biggest obstacle to GraphDB adoption?

The recent workshop on Graph Databases in Barcelona sparked an interesting debate among vendors of graph databases about how to accelerate graph database adoption.

I don’t think that debate has been resolved yet. Perhaps this post will help a bit.

Some opinions that I heard were:

  • there are significant differences between the various graph database products on the market (e.g. graphs, property graphs, properties on edges or not, hypergraphs etc.). Unless they are more similar, customers will fear lock-in and not buy any of them. Here is a variation:
  • relational databases only took off once there was agreement on the SQL standard among vendors. We need to cooperate to create a similar language, otherwise graph database adoption will not take off either.
  • most potential users of graph databases have never heard of them. How could they use any if they don’t know they exist?
  • even if possible users know of graph databases, they do not know what the use cases are because use cases and success stories have not broadly been documented.

What do you think the biggest obstacles are for graph database adoption?

Generating a Hierarchical HTML Outline

Graph databases like InfoGrid are much better at traversing recursive data structures than relational databases. That is probably an established fact by now and not controversial.

But what does this mean in practice?

Here is an example just how simple it can be to generate a hierarchical HTML outline from a graph using InfoGrid’s custom tags:

<dl>
  <tree:treeIterate
            startObjectName="Subject"
            traversalSpecification="xpath:*"
            meshObjectLoopVar="current">
    <tree:down>
      <dd><dl>
    </tree:down>
    <tree:up>
      </dl></dd>
    </tree:up>
    <tree:nodeBefore>
      <dt>
        <mesh:meshObject meshObjectName="current"/>
      </dt>
    </tree:nodeBefore>
  </tree:treeIterate>
</dl>

The result is a set of nested HTML definition lists, which are a good match for an outline in HTML.

By way of explanation of this example, the outer tag is a big loop, which iterates over all elements in the outline. Every time we walk down the tree, the <tree:down> tag fires and emits the HTML list start element contained there. Every time we walk up the tree, the corresponding <tree:up> tag triggers and closes the list. “Subject” is the name of the start node as JSP bean, and the xpath-like expression is used to describe which branches to traverse in the graph, because it would just as simple traversing the opposite direction, or sideways.

I don’t think it gets much simpler ;-) Imagine the pain in SQL, particularly if the tree is deep …

Web Apps and InfoGrid

Graph databases are great for web apps: instead of somehow having to map tables and rows to URLs, developers can simply 1-to-1 map objects in the GraphDB to URLs.

Unlike most other GraphDBs, InfoGrid has extensive libraries to make the creation of web pages from dynamic graphs straightforward.

Here’s an example. Let’s say on your web page, you want to show a list of nodes related to some node ‘x’. For example, a social media friends list, or list of purchases, or bookmarks etc. Here is the JSP code:

<set:traverseIterate startObjectName="x" traversalSpecification="*" loopVar="y">
  <set:iterateNoContent><p>Sorry, no neighbors.</p></set:iterateNoContent>
  <set:iterateHeader>
    <ul>
  </set:iterateHeader>
  <set:iterateContentRow>
    <li>
      <mesh:meshObjectLink meshObjectName="y">
        <mesh:meshObject meshObjectName="y" />
      </mesh:meshObjectLink>
    <li>
  </set:iterateContentRow>
  <set:iterateFooter>
    </ul>
  </set:iterateFooter>
<set:traverseIterate>

If you look closely, you see that not only does it print the neighbor nodes in the graph, but also adds a hyperlink to them, so they can clicked on and examined. And if there are no neighbors, a message is printed saying so.

What if I don’t want any kind of neighbor, but only those related in some fashion, say business colleagues? Well, depends a bit on your model, but it’s usually as simple as replacing the first line of JSP above with something like this:

<set:traverseIterate startObjectName="x"
        traversalSpecification="com.example.MySubjectArea/Person_IsColleagueOf_Person-S"
        loopVar="y">

Notice that there is no other code that needs to be written, no object-relational mapping, no SQL syntax to get right, no handler code or other clumsy magic. Of course, you can have may of these statements on the same page, nested etc.

A great way of developing rich web pages really quickly.

The Bless Relationships API

InfoGrid supports types on both nodes and edges, aka MeshObjects and Relationships. They are called EntityTypes and RelationshipsTypes, respectively.

Assume we have two related MeshObjects, each blessed with a (hypothetical) type Person. Like this:

MeshObject obj1 = ...;
MeshObject obj2 = ...;
obj1.bless( PERSON );
obj2.bless( PERSON );
obj1.relate( obj2 );

Now if we want to express that the second Person is the daughter of the first Person, we cannot do this:

obj1.blessRelationship( PERSON_PARENT_PERSON, obj2 );

because we don’t know, from this API, whether obj1 or obj2 is the daughter or the parent.

Instead, in InfoGrid, we have to explicitly specify the direction, and we do this by specifying the “end” of the relationship type instead of the relationship type for the bless operation, like this:

obj1.blessRelationship( PERSON_PARENT_PERSON.getSource(), obj2 );

Hopefully, the documentation of the parent relationship type in the model indicates the semantics of the source and the destination of each relationship type. These “ends” are called RoleTypes, by the way.

One nice side effect of this approach to the API is that it becomes quite straightforward to extend it if InfoGrid ever decided to support ternary or N-ary relationship types: Just pick a RoleType beyond a source or destination RoleType.

If you bless relationships from the HttpShell — the simplest way of manipulating the graph from a web application — you thus need to specify the identifier of the RoleType, not of the RelationshipType. By convention, the source RoleType of a RelationshipType with identifier ID is ID-S, and ID-D for the destination.

One More Word on GraphDBs and Schemas

myNoSQL picked up our recent post on why having a schema is A Good Thing. In the comments to his post, various other graph database vendors speak up. They are far more ambiguous on the matter than we are …

For now, I think of Brian Akers’s take on this subject as the final word. With the comment “I know where everything is… don’t touch”, he shows this picture:

[very messy office]

If you can afford code and data like this, be my (schema-less) guest. Personally, I can’t.

’nuff said.

Required vs. Optional Property Values

InfoGrid distinguishes between properties that must have a non-null value, and properties that may or or may not be null.

When creating an InfoGrid model, a developer has to specify which by using the <isoptional/> tag in the model file.

Why?

By way of parallel, consider the following piece of Java code:

class Foo {
    private int max1 = 10;
    private Integer max2 = 20;

    public void doSomething() {
        for( int i=0 ; i<max1 ; ++i ) {
            //...
        }
        for( int i=0 ; i<max2 ; ++i ) {
            //...
        }
    }

Spot the problem? max2 of course might be null, which means our code will throw an exception in the innocent-looking second for loop. To get the code right, we will have to protect that section with an if-then-else section that checks for null first.

Of course, such a protection is often the right thing to do. But in this example, a “max” should hardly ever be null, so using an “int” as a data type like for max1 (which can’t be null) is much better than using an “Integer” like for max2 (which may be null).

It’s the same thing for properties in InfoGrid models. Some properties simply should never be null. For example, consider a time stamp indicating when a MeshObject was created. Given that the MeshObject was created, the time stamp must exist, and therefore a null value makes no sense. In which case the property would be specified as “mandatory”. On the contrary, a time stamp when a MeshObject is likely to become obsolete is very likely optional: we might not know that time (yet), or it might never become obsolete, so null values are fine.

If InfoGrid did not distinguish between required and optional values, application code would be littered with unnecessary tests for null values. (or failing that, unexpected NullPointerExceptions.) We think being specific is better when creating the model; higher-quality and less cluttered application code is the reward.

Also check out the following related posts:

InfiniteGraph Implementation of FirstStep

InfiniteGraph, the currently youngest member of the GraphDB party, has now also implemented our FirstStep example. Todd Stavish’s code is here. It joins implementations of the same example from InfoGrid, Neo4j, Sones, and Filament.

[Update: see Todd's comment below. Apparently there is checking in the second step.] On cursory examination, I’m surprised that InfiniteGraph allows you (requires you?) to create edges without source and destination nodes. Only after the creation of the edge does one assign the nodes to the edge. I’m unclear why this is an advantage for any particular scenario; however, I would think there’s a clear disadvantage as an application developer, because I now have to check for null pointers that I wouldn’t have to in most other graph databases (InfoGrid included).

Hope somebody more neutral than me will perform an API comparison using this example some day.

Graph Database Scaling: InfoGrid’s Contrarian View

There’s an ongoing debate how to scale graph databases horizontally, and I got “strongly encouraged” to present the InfoGrid point of view.

In a nutshell: to make it work, scale like the web, not like a database.

Let me explain:

If you start out with a single relational database server, and you want to scale it horizontally to thousands of servers, you have a problem: it doesn’t really work. SQL makes too many assumptions that one simply cannot make for a massively distributed system. Which is why instead of running Oracle, Google runs BigTable (and Amazon runs Dynamo etc.), which were designed from the ground up for thousands of servers.

If you start out with a single hypertext system, and you want to scale it horizontally to millions of servers, you have a problem, too: it doesn’t really work either. Which is why we got the world-wide-web, which has been inherently designed for a massively decentralized architecture from the get-go, instead of a horizontally-scaled version of Xanadu.

So he we are, and people are trying to scale graph databases from one to many servers. They are finding it hard, and so far I have not seen a single credible idea, never mind an implementation, even in an early stage. Guess what? I think massively scaling a graph database that’s been designed using a one-server philosophy will not work either, just like it didn’t work for relational databases or hypertext. We need a different approach.

With InfoGrid, we take such a fundamentally different approach. To be clear, its current re-implementation in the code base is early and is still missing important parts. But the architecture is stable, the core protocol is stable, and it has been proven to work. Turning it back on across the network is a matter of “mere” programming.

To explain it, let’s lay out our high-level assumptions. They are very similar to the original assumptions of the web itself, but unlike traditional database assumptions:

Traditional database assumptions Web assumptions InfoGrid assumptions
All relevant data is stored in a centrally administered database Let everybody operate their own server with their own content and create hyperlinks to each other Let everybody operate their own InfoGrid server on their own data and share sub-graphs with each other as needed (see also note on XPRISO protocol below)
Data is “imported” and “exported” from/to the database from time to time, but not very often: it’s an unusual act. Only “my” data in “my” database really matters. Data stays where it is, a web server makes it accessible from/to the web. Through hyperlinks, that server becomes just one in millions that together form the world-wide-web. Data stays where it is, and a graph database server makes it look like a (seamless) sub-graph of a larger graph, which conceivably one day could be the entire internet
Mashing up data from different databases is a “web services” problem and uses totally different APIs and abstractions than the database’s Mashing up content from different servers is as simple as an <a…, <img… or <script… tag. Application developers should not have to worry which server currently has the subgraph they are interested in; it becomes local through automatic replication, synchronization and garbage collection.

This approach is supported by a protocol we came up with called XPRISO, which stands for eXtensible Protocol for the Replication, Integration and Synchronization of distributed Objects. (Perhaps “graphs” would have been a better name, but then the acronym wouldn’t sound as good.) There is some more info about XPRISO on the InfoGrid wiki.

Simplified, XPRISO allows one server to ask another server for a copy of a node or edge in the graph, so they can create a replica. When granted, this replica comes with an event stream reflecting changes made to it and related objects over time, so the receiving server can obtain the received replica in sync. When not needed any more, the replication relationship can be canceled and the replicas garbage collected. There are also features to move update rights around etc.

For a (conceptual) graphical illustration, look at InfoGrid Core Ideas on Slideshare, in particular slides 15 and 16.

Code looks like this:

// create a local node:
MeshObject myLocalObject = mb.getMeshObjectLifecycleManager().createMeshObject();
// get a replica of a remote node. The identifier is essentially a URL where to find the original
MeshObject myRemoteObject = mb.accessLocally( remote_identifier );

// now we can do stuff, like relate the two nodes:
myLocalObject.relate( myRemoteObject );

// or set a property on the replica, which is kept consistent with the original via XPRISO:
myRemoteObject.setProperty( property_type, property_value );

// or traverse to all neighbors of the replica: they automatically get replicated, too
MeshObjectSet neighbors = myRemoteObject.traverseToNeighbors();

Simple enough? [There are several versions of this scenario in the automated tests in SVN. Feel free to download and run.]

What does this contrarian approach to graphdb scaling give us? The most important result is that we can assemble truly gigantic graphs, one server at a time: each server serves a subgraph, and XPRISO makes sure that the overlaps between the subgraphs are kept consistent in the face of updates by any of the servers. Almost as important is that it enables a bottom-up bootstrapping of large graphs: everybody who feels like participating sets up their own server, populates it with the data they wish to contribute (which doesn’t even need to be “copied” by virtue of the InfoGrid Probe Framework), and link to others.

Now, if you think that makes InfoGrid more like a web system than a database, we sympathize. There’s a reason we call InfoGrid an “internet graph database”. Or it might look more like a P2P system. But other NoSQL projects use peer-to-peer protocols, too, like Cassandra’s gossip protocol. And I predict: the more we distribute databases, the more decentralized they become, the more the “database” will look like the web itself. That’s particularly so for graph databases: after all, the web is the largest graph ever assembled.

We do not claim that this approach addresses all possible use cases for scaling graph databases. For example, if you need to visit every single node in a gazillion-node graph in sequence, this approach is just as good or bad as any other: you can’t afford the memory to get the entire graph onto your local server, and so things will be slow.

However, the InfoGrid approach elegantly addresses a very common use case: adding somebody else’s data to the graph. “Simply” have them set up their own graph server, and create the relationships to objects in other graph servers: XPRISO will keep them maintained. Leave data in the place where its owners have it already is a very powerful feature; without it, the web arguably could not have emerged. It further addresses control issues and privacy/security issues much better than a more database’y approach, because people can remain in control over their subgraph: just like on the web, where nobody needs to trust a single “web server operator” with their content; you simply set up your own web server and then link.

Want to help? ;-)

Operations on a Graph Database (Part 8 – Events)

Graph Database Tutorial

Part 1: Nodes

Part 2: Edges

Part 3: Types

Part 4: Properties

Part 5: Identifiers

Part 6: Traversals

Part 7: Sets

Part 8: Events

The database industry is not used to databases that can generate events. The closest the relational database has to events are stored procedures, but they never “reach out” back to the application, so their usefulness is limited. But events are quite natural for graph databases. Broadly speaking, they occur in two places:

  • Events on the graph database itself (example: “tell me when a transaction has been committed, regardless on which thread”)
  • Events on individual objects stored in the graph database (example: “tell me when property X on object Y has changed to value Z”, or “tell me when Node A has a new Edge”)

Events on the GraphDB itself are more useful for administrative and management purposes. For example, an event handler listening to GraphDB events can examine the list of changes that a Transaction is performing at commit time, and collect statistics (for example).

From an application developer’s perspective, events on the data are more interesting:

An example may illustrate this. Imagine an application that helps manage an emergency room in a hospital. The application’s object graph contains things such as the doctors on staff, the patients currently in the emergency room and their status (like “arrived”, “has been triaged”, “waiting for doctor”, “waiting for lab results” etc.) Doctors carry pagers. One of the requirements for application is that the doctor be paged when the status of one of their assigned patients changes (e.g. from “waiting for lab results” to “waiting for doctor”).

With a passive database, i.e. one that cannot generate events, like a typical relational database, we usually have to write some kind of background task (e.g. a cron job) that periodically checks whether certain properties have changed, and then sends the message to the pager. That is very messy: e.g. how does your cron job know which properties *changed* from one run to the next? Or we have to add the message sending code to every single screen and web service interface in the app that could possibly change the relevant property, which is just as messy and hard to maintain.

With a GraphDB like InfoGrid, you simply subscribe to events, like this:

MeshObject patientStatus = ...; // the node in the graph representing a patient's status
patientStatus.addPropertyChangeListener( new PropertyChangeListener() {
    public void propertyChanged( PropertyChangeEvent e ) {
        sendPagerMessage( ... );
    });
}

The graph database will trigger the event handler whenever a property changed on that particular object. It’s real simple.

InfoGrid Interview Published

Pere Urbon published his e-mail interview with me about InfoGrid:

This is the fourth in his series on graph databases:

It’s rather apparent that while these projects are all GraphDBs, they differ substantially in what they are trying to accomplish, and why, and therefore how they do it. This is a good resource for developers investigating GraphDBs and trying to understand their alternatives.