Web Apps and InfoGrid

Graph databases are great for web apps: instead of somehow having to map tables and rows to URLs, developers can simply 1-to-1 map objects in the GraphDB to URLs.

Unlike most other GraphDBs, InfoGrid has extensive libraries to make the creation of web pages from dynamic graphs straightforward.

Here’s an example. Let’s say on your web page, you want to show a list of nodes related to some node ‘x’. For example, a social media friends list, or list of purchases, or bookmarks etc. Here is the JSP code:

<set:traverseIterate startObjectName="x" traversalSpecification="*" loopVar="y">
  <set:iterateNoContent><p>Sorry, no neighbors.</p></set:iterateNoContent>
  <set:iterateHeader>
    <ul>
  </set:iterateHeader>
  <set:iterateContentRow>
    <li>
      <mesh:meshObjectLink meshObjectName="y">
        <mesh:meshObject meshObjectName="y" />
      </mesh:meshObjectLink>
    <li>
  </set:iterateContentRow>
  <set:iterateFooter>
    </ul>
  </set:iterateFooter>
<set:traverseIterate>

If you look closely, you see that not only does it print the neighbor nodes in the graph, but also adds a hyperlink to them, so they can clicked on and examined. And if there are no neighbors, a message is printed saying so.

What if I don’t want any kind of neighbor, but only those related in some fashion, say business colleagues? Well, depends a bit on your model, but it’s usually as simple as replacing the first line of JSP above with something like this:

<set:traverseIterate startObjectName="x"
        traversalSpecification="com.example.MySubjectArea/Person_IsColleagueOf_Person-S"
        loopVar="y">

Notice that there is no other code that needs to be written, no object-relational mapping, no SQL syntax to get right, no handler code or other clumsy magic. Of course, you can have may of these statements on the same page, nested etc.

A great way of developing rich web pages really quickly.

InfiniteGraph Implementation of FirstStep

InfiniteGraph, the currently youngest member of the GraphDB party, has now also implemented our FirstStep example. Todd Stavish’s code is here. It joins implementations of the same example from InfoGrid, Neo4j, Sones, and Filament.

[Update: see Todd's comment below. Apparently there is checking in the second step.] On cursory examination, I’m surprised that InfiniteGraph allows you (requires you?) to create edges without source and destination nodes. Only after the creation of the edge does one assign the nodes to the edge. I’m unclear why this is an advantage for any particular scenario; however, I would think there’s a clear disadvantage as an application developer, because I now have to check for null pointers that I wouldn’t have to in most other graph databases (InfoGrid included).

Hope somebody more neutral than me will perform an API comparison using this example some day.

Graph Database Scaling: InfoGrid’s Contrarian View

There’s an ongoing debate how to scale graph databases horizontally, and I got “strongly encouraged” to present the InfoGrid point of view.

In a nutshell: to make it work, scale like the web, not like a database.

Let me explain:

If you start out with a single relational database server, and you want to scale it horizontally to thousands of servers, you have a problem: it doesn’t really work. SQL makes too many assumptions that one simply cannot make for a massively distributed system. Which is why instead of running Oracle, Google runs BigTable (and Amazon runs Dynamo etc.), which were designed from the ground up for thousands of servers.

If you start out with a single hypertext system, and you want to scale it horizontally to millions of servers, you have a problem, too: it doesn’t really work either. Which is why we got the world-wide-web, which has been inherently designed for a massively decentralized architecture from the get-go, instead of a horizontally-scaled version of Xanadu.

So he we are, and people are trying to scale graph databases from one to many servers. They are finding it hard, and so far I have not seen a single credible idea, never mind an implementation, even in an early stage. Guess what? I think massively scaling a graph database that’s been designed using a one-server philosophy will not work either, just like it didn’t work for relational databases or hypertext. We need a different approach.

With InfoGrid, we take such a fundamentally different approach. To be clear, its current re-implementation in the code base is early and is still missing important parts. But the architecture is stable, the core protocol is stable, and it has been proven to work. Turning it back on across the network is a matter of “mere” programming.

To explain it, let’s lay out our high-level assumptions. They are very similar to the original assumptions of the web itself, but unlike traditional database assumptions:

Traditional database assumptions Web assumptions InfoGrid assumptions
All relevant data is stored in a centrally administered database Let everybody operate their own server with their own content and create hyperlinks to each other Let everybody operate their own InfoGrid server on their own data and share sub-graphs with each other as needed (see also note on XPRISO protocol below)
Data is “imported” and “exported” from/to the database from time to time, but not very often: it’s an unusual act. Only “my” data in “my” database really matters. Data stays where it is, a web server makes it accessible from/to the web. Through hyperlinks, that server becomes just one in millions that together form the world-wide-web. Data stays where it is, and a graph database server makes it look like a (seamless) sub-graph of a larger graph, which conceivably one day could be the entire internet
Mashing up data from different databases is a “web services” problem and uses totally different APIs and abstractions than the database’s Mashing up content from different servers is as simple as an <a…, <img… or <script… tag. Application developers should not have to worry which server currently has the subgraph they are interested in; it becomes local through automatic replication, synchronization and garbage collection.

This approach is supported by a protocol we came up with called XPRISO, which stands for eXtensible Protocol for the Replication, Integration and Synchronization of distributed Objects. (Perhaps “graphs” would have been a better name, but then the acronym wouldn’t sound as good.) There is some more info about XPRISO on the InfoGrid wiki.

Simplified, XPRISO allows one server to ask another server for a copy of a node or edge in the graph, so they can create a replica. When granted, this replica comes with an event stream reflecting changes made to it and related objects over time, so the receiving server can obtain the received replica in sync. When not needed any more, the replication relationship can be canceled and the replicas garbage collected. There are also features to move update rights around etc.

For a (conceptual) graphical illustration, look at InfoGrid Core Ideas on Slideshare, in particular slides 15 and 16.

Code looks like this:

// create a local node:
MeshObject myLocalObject = mb.getMeshObjectLifecycleManager().createMeshObject();
// get a replica of a remote node. The identifier is essentially a URL where to find the original
MeshObject myRemoteObject = mb.accessLocally( remote_identifier );

// now we can do stuff, like relate the two nodes:
myLocalObject.relate( myRemoteObject );

// or set a property on the replica, which is kept consistent with the original via XPRISO:
myRemoteObject.setProperty( property_type, property_value );

// or traverse to all neighbors of the replica: they automatically get replicated, too
MeshObjectSet neighbors = myRemoteObject.traverseToNeighbors();

Simple enough? [There are several versions of this scenario in the automated tests in SVN. Feel free to download and run.]

What does this contrarian approach to graphdb scaling give us? The most important result is that we can assemble truly gigantic graphs, one server at a time: each server serves a subgraph, and XPRISO makes sure that the overlaps between the subgraphs are kept consistent in the face of updates by any of the servers. Almost as important is that it enables a bottom-up bootstrapping of large graphs: everybody who feels like participating sets up their own server, populates it with the data they wish to contribute (which doesn’t even need to be “copied” by virtue of the InfoGrid Probe Framework), and link to others.

Now, if you think that makes InfoGrid more like a web system than a database, we sympathize. There’s a reason we call InfoGrid an “internet graph database”. Or it might look more like a P2P system. But other NoSQL projects use peer-to-peer protocols, too, like Cassandra’s gossip protocol. And I predict: the more we distribute databases, the more decentralized they become, the more the “database” will look like the web itself. That’s particularly so for graph databases: after all, the web is the largest graph ever assembled.

We do not claim that this approach addresses all possible use cases for scaling graph databases. For example, if you need to visit every single node in a gazillion-node graph in sequence, this approach is just as good or bad as any other: you can’t afford the memory to get the entire graph onto your local server, and so things will be slow.

However, the InfoGrid approach elegantly addresses a very common use case: adding somebody else’s data to the graph. “Simply” have them set up their own graph server, and create the relationships to objects in other graph servers: XPRISO will keep them maintained. Leave data in the place where its owners have it already is a very powerful feature; without it, the web arguably could not have emerged. It further addresses control issues and privacy/security issues much better than a more database’y approach, because people can remain in control over their subgraph: just like on the web, where nobody needs to trust a single “web server operator” with their content; you simply set up your own web server and then link.

Want to help? ;-)

Operations on a Graph Database (Part 8 – Events)

Graph Database Tutorial

Part 1: Nodes

Part 2: Edges

Part 3: Types

Part 4: Properties

Part 5: Identifiers

Part 6: Traversals

Part 7: Sets

Part 8: Events

The database industry is not used to databases that can generate events. The closest the relational database has to events are stored procedures, but they never “reach out” back to the application, so their usefulness is limited. But events are quite natural for graph databases. Broadly speaking, they occur in two places:

  • Events on the graph database itself (example: “tell me when a transaction has been committed, regardless on which thread”)
  • Events on individual objects stored in the graph database (example: “tell me when property X on object Y has changed to value Z”, or “tell me when Node A has a new Edge”)

Events on the GraphDB itself are more useful for administrative and management purposes. For example, an event handler listening to GraphDB events can examine the list of changes that a Transaction is performing at commit time, and collect statistics (for example).

From an application developer’s perspective, events on the data are more interesting:

An example may illustrate this. Imagine an application that helps manage an emergency room in a hospital. The application’s object graph contains things such as the doctors on staff, the patients currently in the emergency room and their status (like “arrived”, “has been triaged”, “waiting for doctor”, “waiting for lab results” etc.) Doctors carry pagers. One of the requirements for application is that the doctor be paged when the status of one of their assigned patients changes (e.g. from “waiting for lab results” to “waiting for doctor”).

With a passive database, i.e. one that cannot generate events, like a typical relational database, we usually have to write some kind of background task (e.g. a cron job) that periodically checks whether certain properties have changed, and then sends the message to the pager. That is very messy: e.g. how does your cron job know which properties *changed* from one run to the next? Or we have to add the message sending code to every single screen and web service interface in the app that could possibly change the relevant property, which is just as messy and hard to maintain.

With a GraphDB like InfoGrid, you simply subscribe to events, like this:

MeshObject patientStatus = ...; // the node in the graph representing a patient's status
patientStatus.addPropertyChangeListener( new PropertyChangeListener() {
    public void propertyChanged( PropertyChangeEvent e ) {
        sendPagerMessage( ... );
    });
}

The graph database will trigger the event handler whenever a property changed on that particular object. It’s real simple.

Graph Database Feature Comparison Table

Pere Urbón Bayes has put together an excellent table comparing features of notable graph databases: post, table (PDF): Neo4j, HyperGraphDB, DEX, InfoGrid, Sones and VertexDB.

One can quibble, as one always can with tables like this, but it is the best comparison of graph databases so far that I’m aware of. Check it out.

Operations on a Graph Database (Part 7 – Sets)

Graph Database Tutorial

Part 1: Nodes

Part 2: Edges

Part 3: Types

Part 4: Properties

Part 5: Identifiers

Part 6: Traversals

Part 7: Sets

Part 8: Events

Sets are a core concept of most databases. For example, any SQL SELECT statement in a relational database produces a set. Sets apply to Graph Databases just as well and are just as useful:

The most frequently encountered set of nodes in a Graph Database is the result of a traversal. For example, in InfoGrid, all traversal operations result in a set like this:

MeshObject    startNode     = ...; // some start node
MeshObjectSet neighborNodes = startNode.traverseToNeighbors();

We might as well have returned an array, or an Iterator over the members of the set, were it not for the fact that there are well-understood set operations that often make our jobs as developers much simpler: like set unification, intersection and so forth.

For example, in a social bookmarking application we might want to find out which sites both you and I have bookmarked. Code might look like this:

MeshObject me  = ...; // node representing me
MeshObject you = ...; // node representing you

TraversalSpecification ME_TO_BOOKMARKS_SPEC = ...;
    // how to get from a person to their bookmarks, see post on traversals
MeshObjectSet myBookmarks   = me.traverse( ME_TO_BOOKMARKS_SPEC );
MeshObjectSet yourBookmarks = you.traverse( ME_TO_BOOKMARKS_SPEC );

// Bookmarks that you and I share
MeshObjectSet sharedBookmarks = myBookmarks.intersect( yourBookmarks );

Notice how simple this code is to understand? One of the powers of sets. Or, if you know what a “minus” operation is on a set, this is immediately obvious:

// Bookmarks unique to me
MeshObjectSet myUniqueBookmarks = myBookmarks.minus( yourBookmarks );

This is clearly much simpler than writing imperative code which would have lots of loops and if/then/else’s and comparisons and perhaps indexes in it. (And seeing this might put some concerns to rest that NoSQL databases are primitive because they don’t have a SQL-like query language. I’d argue it’s less the language but the power of sets, and if you have sets you have a lot of power at your fingertips.)

To check out sets in InfoGrid, try package org.infogrid.mesh.set. Clearly much more can be done than we have so far in InfoGrid, but it’s a very useful start in our experience.

Big Data Workshop April 23, Mountain View, CA

I’m planning to be at Big Data Workshop, the first unconference on NoSQL and Big Data. If past events moderated by Kaliya Hamlin are any guide, it will be a great opportunity for everybody:

  • to explore together how the Big Data market will be coming together
  • to understand how the key technologies and projects work
  • what interfaces and interoperability standards are emerging and/or needed
  • how we can grow the overall market and make it easier for everybody to adopt these technologies for interesting new projects.

Arguably, without Internet Identity Workshop (also moderated by Kaliya) was the enabler for the stunning adoption rate over the past five years of OpenID, OAuth and related technologies (at last count, more than 1 billion enabled accounts). I hope history repeats itself here.

P.S. Feel free to corner me on InfoGrid, graph databases or any other subject. That’s the whole point of an unconference.

InfoGrid and Multigraphs

Graph theory distinguishes between graphs that are multigraphs and those that are not, i.e. between graphs where there may be more than one edge between any two nodes, and those graphs where there can only be one edge between any two nodes.

On the first glance, InfoGrid does not support multigraphs. On the second glance, InfoGrid is a bit more powerful than it appears; not an unusual occurrence ;-) Let me explain:

Here is how we create an edge between two nodes in InfoGrid:

MeshObject node1 = ...;
MeshObject node2 = ...;

node1.relate( node2 );

If we try to do the last statement again, InfoGrid will throw a RelatedAlreadyException. That clearly says that multiple edges between the same two nodes are not allowed.

But.

We do this in InfoGrid because unless there is some associated information with each of these edges, we cannot distinguish them and so there is no point in having two. The way we do that in InfoGrid is to bless the edge with more than one RelationshipType, such as:

node1.blessRelationship( PARENT_TO_SON, node2 );
node1.blessRelationship( CUSTOMER_TO_VENDOR, node2 );

These two relationships express that node1 is a parent of son node2, and that node2 (the son) has sold something to customer node 1 (the parent). Very different semantics associated with the same edge. Just like multiple edges in a multigraph.

So: InfoGrid supports multigraphs just fine, just perhaps not entirely as you would have expected.

Operations on a Graph Database (Part 6 – Traversals)

Graph Database Tutorial

Part 1: Nodes

Part 2: Edges

Part 3: Types

Part 4: Properties

Part 5: Identifiers

Part 6: Traversals

Part 7: Sets

Part 8: Events

Traversals are the most common operations on a graph database. They are just as important for graph databases as joins are for relational databases. But of course, they are something else as graphs are not tables.

A traversal in a graph database is uniquely described by two data items:

  • a start node
  • a specification for the traversal

A traversal always leads to a set of nodes. Depending on the structure of the graph being traversed, that set may contain many, one or zero nodes. For example, if we traverse from a node representing a Person to the nodes representing their grandchildren, we may or may not find any, depending on whether they have any.

From a given node, we can traverse in many different ways (i.e. same node, different traversal specifications). Or, given the same traversal specification, we can start with different nodes.

By way of analogy, consider street directions:

  • start node: my house
  • traversal specification: first turn left, then go either straight or left.

The result of this particular traversal is a single-element set containing the neighborhood park. If you had started at the same node (my house), but gone right first, you would not have arrived at the park. If you had started at a different node (somebody else’s house), you may or may not have arrived at the park. You may not have arrived anywhere (perhaps there is no left that one can take from certain houses). Or you might have arrived in multiple places (“go either straight or left” might not take you to the same part regardless which you take, but taken you into different directions.

Graph database products seem to differ on how to deliver to the developer the set of nodes that is the result of a traversal. In InfoGrid, all traversals produce a MeshObjectSet, which, as the name says, is a set of nodes. One can then iterate over that set, for example, subset it, unify it with another or ask how many elements it has. In other products, traversals produce an iterator directly which then can be queried for one member of the result set at a time. Regardless of API details, the result of a traversal is always a set (e.g. it can’t contain duplicates.)

Just like there are many ways of giving directions, traversal specifications can be captured in many different ways. In InfoGrid, we have — you guessed it — an abstract data type called TraversalSpecification and several different classes that implement that type, such as:

  • traverse by going to all direct neighbor nodes of the start node
  • go to all neighbor nodes related with a edge of a particular type in a particular direction (e.g. “traverse from employee to their manager(s)”)
  • go N steps in sequence, where each step can be any traversal specification
  • go N steps in parallel, and unify the resulting set
  • select a subset of the found nodes based on some criteria, etc.

The FirstStep example shows some simple traversals.

And just for simplicity, InfoGrid also allows traversals starting from a set of start nodes, not just one. So we can say things like this:

MeshObject me = ...;
MeshObjectSet myParents      = me.traverse( childToParents );
MeshObjectSet myGrandParents = myParents.traverse( childToParents );

In our experience, working with sets makes complex traversals very easily understandable.

Operations on a Graph Database (Part 5 – Identifiers)

Graph Database Tutorial

Part 1: Nodes

Part 2: Edges

Part 3: Types

Part 4: Properties

Part 5: Identifiers

Part 6: Traversals

Part 7: Sets

Part 8: Events

Well, “identifiers” aren’t much of an “operation”, but there are some operations related to identifiers, thus the title.

All first-class objects in a graph database typically have a unique identifier. This means nodes have unique identifiers, and for those graph databases that represent edges as distinct objects (see previous discussion on the pros and cons), they have unique identifiers, too.

This means we can ask a node for their identifier, remember the identifier, and later find the node again by looking it up in the graph database. In InfoGrid, this looks as follows:

MeshObject someNode = ...; // some MeshObject aka Node
MeshObjectIdentifier id = someNode.getIdentifier();

and later we can do this:

MeshBase mb = ...; // some MeshBase
MeshObject nodeFoundAgain = mb.findMeshObjectByIdentifier( id );

As you can see, InfoGrid uses an abstract data type called MeshObjectIdentifier, which you can think of as String for a second. (see below.) In InfoGrid, all identifiers are long-lasting. This means, your object will still have the same MeshObjectIdentifier after you rebooted your database. This has some advantages, e.g. you can define well-known objects in your graph database to which you can easily return even weeks later.

Other graph databases may use different data types as identifiers (e.g. int or long), but the use of identifiers is common with the above operations. They may or may not be the same after rebooting of the database.

Why does the type of identifier matter? Well, it depends on the application you have in mind. For InfoGrid applications, we primarily care about web applications, specifically REST-ful web applications. And so InfoGrid web applications generally use MeshObjectIdentifiers that identical to the URLs of the application. Let’s make an example:

Assume you have a URL bookmarking application which runs at http://example.com/. Let’s say a user creates tag “books”, which can be found at URL http://example.com/books/. It would be most straightforward to create a MeshObject with MeshObjectIdentifier http://example.com/books/. Which is exactly what InfoGrid does by default. No impedance mismatch between URLs that the user sees, the objects in the application layer, and the database! This leads to dramatic simplification of development and debugging.