Three’s a Crowd: Neo4j, Sones, Filament all implement InfoGrid’s FirstStep Example

Little did I know when I put up InfoGrid’s FirstStep example. The example creates just a few nodes and a few edges to show, in principle, how to build a URL tagging application based on a graph database like InfoGrid.

Alex Popescu at MyNoSQL challenged the Neo4j folks how they would implement it, and they responded promptly. Then, the guys are Sones implemented the same example themselves, and just now the Filament project did the same. Worth a blog post with the links!

Here they are:

for your reading and comparing pleasure.

I’m tempted to list my own observations, but I’d like to avoid a blogging contest in which — naturally — everybody will claim “but the way we do it is better”. Independent reviews anybody?

Strong and Weak Typing With Graph Databases

Whether programming systems should be strongly typed or weakly typed has been one of the longest-running controversies in the history of computer science going back something like 50 years. Generally speaking, strongly typed systems tend to require more programmer effort up-front, in exchange for earlier or more definite error reports.

We also need to distinguish between static typing and dynamic typing: a dynamically typed system enables changes of types at run-time, while a statically typed system can’t do that.

Not surprisingly, typing for graph databases (or any other kind of NoSQL database) can be implemented in different ways, too:

Weakly typed Strongly typed
Dynamically typed

At development time: types may be declared but are not checked except perhaps rudimentarily.

At run-time: errors may occur, which may or may not be discovered; mis-interpretations of data are possible; data corruption is likely in case of programming errors.

At development time: types are declared and checked as well as possible.

At run-time: all operations are checked for type safety; types can be discovered dynamically; type mis-interpretations are not possible.

Statically typed

At development time: only rudimentary checking, if at all

At run-time: errors may occur, which may or may not be discovered; mis-interpretations of data are possible; data corruption is likely in case of programming errors.

At development time: all type errors are caught; additional developer effort is required; some types of data are hard to represent

At run-time: no checking required due to “correctness by construction”.

Let’s insert some systems into this table:

Weakly typed Strongly typed
Dynamically typed Most NoSQL systems InfoGrid
Statically typed SQL database (if used as intended)

Side note: when NoSQL proponents argue that weakly typed systems are much better than stronger-typed SQL, they sometimes throw out the baby with the bath water: there are four choices, not two. We agree that statically, strongly typed systems like a typical SQL database has considerable disadvantages in a fast-moving world, but so do weakly typed systems; the only difference is the type of disadvantage. In our view, a strong but dynamic type system is the best compromise for most applications with a non-trivial schema, which is why InfoGrid V2 implements it. (There are some applications that do not require a non-trivial; web caching for example.)

In a graph database like InfoGrid, the following items can be typed:

  • Nodes
  • Edges
  • Properties

In other graph databases, only a subset of these items may be typed. More in the next post on types.

Access Control the InfoGrid Way

We do it very similarly in InfoGrid.

But we can go one big step further: have InfoGrid automatically enforce the access control rules that were set up. If we have the ACL information, why not use it and have the graph database do the enforcement for us? That functionality has been part of InfoGrid for a couple of years.

For some detailed examples how this works, consult the security tests that are part of InfoGrid’s automated test suite (particularly MeshBaseSecurityTest5).

Here’s the basic idea. (again, paraphrasing the code for easier readability. Consult the above link for full code.) When a graph database in InfoGrid is configured to run with a AclBasedAccessManager, we can do this:

First, we need a MeshObject that is going to be the owner of some access-controlled data object. Any MeshObject will do:

MeshObject owner = createMeshObject();

Then, we associate the owner with the current Thread. (Just like in UNIX, where the ownership of processes determines what the process can do)

theAccessManager.setCaller( owner );

Now here comes the access-controlled data object.

MeshObject data = createMeshObject();

Because the current Thread is associated with the owner MeshObject, InfoGrid automatically sets up an ownership relationship between the data object and the owner object — just like in UNIX, a newly created file automatically has an owner.

Going beyond UNIX, we can now put the data object into something we call a ProtectionDomain. It’s basically a collection of MeshObjects that all have the same access control policy. This is mainly for efficiency and easy of management.

MeshObject protectionDomain = createMeshObject( AclBasedSecuritySubjectArea.PROTECTIONDOMAIN );
domain.relateAndBless( AclBasedSecuritySubjectArea.PROTECTIONDOMAIN_GOVERNS_MESHOBJECT.getSource(), dataObject );

Now, let’s give some another entity some access rights to the data object:

MeshObject actorMayReadNotWrite = createMeshObject();
actorMayReadNotWrite.relateAndBless( AclBasedSecuritySubjectArea.MESHOBJECT_HASREADACCESSTO_PROTECTIONDOMAIN.getSource(), domain );

Note that it is the owner of the object that needs to do that; others can’t.

So now we change ownership on the thread.

theAccessManager.setCaller( actorMayReadNotWrite );

This call will succeed:

dataObject.getPropertyValue( <some property type> );

while this call will throw a NotPermittedException:

dataObject.setPropertyValue( <some property type>, <some property value> )

If the thread was currently associated with the owner, both calls would succeed. Again, I refer you to the follow code linked above. As you can say, it works very similar to how permissions work in UNIX, although of course the underlying ACL information is represented as a MeshObjectGraph.

If you like this, we can do even one better: the whole security mechanism is pluggable in InfoGrid. You don’t like the way we represent and enforce ACLs? Be our guest … write a new subclass of AccessManager, and it will work the way you want. (Did we say that InfoGrid is extremely pluggable?)

P.S. It’s great to see that we aren’t the only ones to think that security-related information is an excellent match for a graph database. There’s also the rather intriguing example for where Microsoft is going with their LDAP directory, which very much looks like evolution in the graph direction. Time to get on board graph databases!

Comparing FirstStep in InfoGrid and Neo4j

Alex Popescu has a great comparison how the InfoGrid FirstStep example would look like in Neo4j, another graph database. As I noted in an earlier post, there are far more similarities in our approaches to the basics of graph databases than there are differences.

Couple comments, addressing some of Alex’ notes. He says:

everything in Neo4j must happen inside a transaction even if it’s a graph traversal operation (this gives a very strong Isolation level). The InfoGrid traversal code seem to happen outside the transaction…

That’s correct. You can do the traversal inside a transaction if you like, but you are not required to. This gives application developers one more option for concurrency control: transactions, critical sections, and no protection.

Re InfoGrid terminology, it’s ancient roots are in object modeling (think UML) — for example, we still talk about InfoGrid Models. However, over time it became clear that InfoGrid’s core ideas are far distinct enough to warrant their own terms. So when we moved from InfoGrid V1 to V2 a few years ago, we changed terms. For example, an “instance” (in UML or programming terminology) aka “node” (graph terminology) rarely can have more than one type anywhere other than in InfoGrid. Think of a Java object that has more than one class, and you can dynamically add and remove classes to the instance at run-time. So we call them MeshObjects rather than something people might have the wrong connotations with. The closes we are aware of is Perl’s “bless”, which is why we use that term.

the Neo4j uses also the LuceneIndexService for indexing both the tag and web resources nodes, but that’s only because the code there makes sure not to duplicate either tags or web resources (i.e. this functionality is not present in the InfoGrid code and I don’t know how that would look like)

Correction. You are invited to modify the example and attempt to create a second MeshObject with the same identifier. You can’t (it will throw an exception at you).

But never mind the comparatively minor differences between Neo4j and InfoGrid. We should compare this to all the stuff one would have to do with a relational database to build the same thing. Object-relational mapping anybody? No thanks …

Graph Databases vs. Object Databases — What’s the Difference?

Great question on Stackoverflow.com about the difference between Graph Databases and Object Databases. I answered it there, and decided to post it here as well:

Object and graph databases operate on two different levels of abstraction.

An object database’s main data elements are objects, the way we know them from an object-oriented programming language.

A graph database’s main data elements are nodes and edges.

An object database does not have the notion of a (bidirectional) edge between two things with automatic referential integrity etc. A graph database does not have the notion of a pointer that can be NULL. (Of course one can imagine hybrids.)

In terms of schema, an object database’s schema is whatever the set of classes is in the application. A graph database’s schema (whether implicit, by convention of what String labels mean, or explicit, by declaration as models as we do it in InfoGrid for example) is independent of the application. This makes it much simpler, for example, to write multiple applications against the same data using a graph database instead of an object database, because the schema is application-independent. On the other hand, using a graph database you can’t simply take an arbitrary object and persist it.

Different tools for different jobs I would think.

Carsonified: Why Graph Databases

Martin Kleppman summarizes the case for Graph Databases at carsonified.com. This is exactly why InfoGrid is built around a graph of MeshObjects:

… graph databases focus on the relationships between items — a better fit for highly interconnected data models.

Standard SQL cannot query transitive relationships, i.e. variable-length chains of joins which continue until some condition is reached. Graph databases, on the other hand, are optimised precisely for this kind of data. Look out for these symptoms indicating that your data would better fit into a graph model:

  • you find yourself writing long chains of joins (join table A to B, B to C, C to D) in your queries;
  • you are writing loops of queries in your application in order to follow a chain of relationships (particularly when you don’t know in advance how long that chain is going to be);
  • you have lots of many-to-many joins or tree-like data structures;
  • your data is already in a graph form (e.g. information about who is friends with whom in a social network).

Graph databases are often associated with the semantic web and RDF datastores, which is one of the applications they are used for. I actually believe that many other applications’ data would also be well represented in graphs. However, as before, don’t try to force data into a graph if it fits better into tables or documents.

In our experience, particularly social applications or applications that deal with complex interrelated data are much easier to build using a graph of typed objects in InfoGrid than to shoehorn into relational tables. But then, InfoGrid can use relational databases as storage engines, so we have the best of both worlds: graphs on the front, and enterprise-friendly SQL on the back.

Neo4j and InfoGrid

Just came across Neo4j, an “open source NoSQL graph database”. Neo4j is clearly very close in philosophy and API to InfoGrid, in fact closer than anything else that I’ve come across so far.

Compare this:

Neoj4 InfoGrid
Node firstNode
    = neo.createNode();
Node secondNode
    = neo.createNode();
firstNode.createRelationshipTo(
    secondNode,
    MyRelationshipTypes.KNOWS );
MeshObject firstObject
    = life.createMeshObject();
MeshObject secondObject
    = life.createMeshObject();
firstObject.relateAndBless(
    secondObject,
    MySubjectArea.MESHOBJECT_KNOWS_MESHOBJECT.getSource() );

If this isn’t similar, what is?

Transactions:

Neoj4 InfoGrid
Transaction tx = neo.beginTx();
try {
    // do something
   tx.success();
} finally {
   tx.finish();
}
Transaction tx = null;
try {
    tx = mb.createTransactionNow();
    // do something
} finally {
    if( tx != null ) {
        tx.commitTransaction();
    }
}

Properties:

Neoj4 InfoGrid
firstNode.setProperty(
    "Name",
    "Neo4j" );
firstObject.setPropertyValue(
    MySubjectArea.PERSON_NAME,
    StringValue.create( "InfoGrid" ));

Regarding differences, it seems the Neo4j folks have spent a lot more time than we have on make it a “database” (while InfoGrid delegates to other storage engines like MySQL or Hadoop).
On the other hand, InfoGrid.fnd is type-safe, and instead of a command-line shell, uses a set of web Viewlets to access the graph of objects. (Which then can be incrementally refactored into an application.)
And then of course there is InfoGrid.net, which does not seem to have an equivalent in Neo4j. (see also InfoGrid Core features and Neo4j wiki)

Worth digging into more deeply …

Linux Magazine Article on MongoDB

Good introduction to MongoDB at Linux Magazine. It makes the case for post-relational web application architectures quite well:

Web applications and traditional relational databases are nearing an end to their tumultuous relationship. For over a decade now, most Web applications have been built on top of relational databases, with various layers of indirection to simplify coding and boost the productivity of developers. For every Web programming language, there are any number of object-relational mapping (ORM) choices, each with pros and cons, yet none good enough that a developer can forget about SQL or ignore protecting the database. Moreover, as Web applications grow more complicated and sites need to be created faster, adapt instantly, and scale massively, these old solutions are no longer satisfying the demands of the Web.

There are a number of different projects working on new database technologies, all of which forego the stalwart relational model. Relational databases are difficult to scale, largely because distributed joins are difficult to perform efficiently. Further, mapping from the many popular dynamically-typed languages to SQL is complicated, inefficient, and time consuming. While often called the “NoSQL” movement, the need for new technologies is caused by the relational model, rather than SQL.

Beyond the relational model, there are a number of data model choices: key-value stores, tabular databases, graph databases, and document databases…

InfoGrid is built around a graph database in this terminology. Except of course, that by virtue of the Store abstraction, InfoGrid can delegate to virtually any kind of database. InfoGrid does not contain a database itself. We believe this separation of concerns makes it easiest for developers to benefit from the high-level services that InfoGrid offers, while still being able to choose whatever storage technology they prefer.

Digg Migrating Away From SQL

Excellent post by the Digg development team explaining why they have to move away from SQL, and the results they accomplished when migrating one of their features to Cassandra.

Very impressive.

InfoGrid and Relational Databases

The other day I was asked:

Your pitch for InfoGrid really disses relational databases. But then, InfoGrid applications usually use MySQL (or PostgreSQL) to store their data. What gives?

To which I responded:

All the database vendors want you to store your data in their database, instead of files in the file system. But then, the databases themselves store their data as files in the file system. What gives?

This does not sound as contradictory. It’s fine that a database stores its data as files in a file system; it may or may not, as an application developer you really don’t care much. You care about the high-level facilities (such as SQL) that the database provides, because writing code against them is much easier and faster than writing against files (for many applications).

The InfoGrid argument is the same one, just one level up: It is much better the develop against the InfoGrid APIs than against SQL directly, because of all the high-level facilities that InfoGrid gives you. Here’s an example:

Try this with a relational database:

MeshObject employee = ...;
employee.bless( CustomerSubjectArea.CUSTOMER );

Your employee has just also become a customer, with all that this entails (e.g. participating in the relationship Customer_Places_Order, which you can’t as a mere employee). For more on blessing objects, see the documentation.

With raw SQL, you wouldn’t even know where exactly to start, but chances are you would have to redesign your schema, and write and update a whole lot of application code.