Operations on a Graph Database (Part 7 – Sets)

Graph Database Tutorial

Part 1: Nodes

Part 2: Edges

Part 3: Types

Part 4: Properties

Part 5: Identifiers

Part 6: Traversals

Part 7: Sets

Part 8: Events

Sets are a core concept of most databases. For example, any SQL SELECT statement in a relational database produces a set. Sets apply to Graph Databases just as well and are just as useful:

The most frequently encountered set of nodes in a Graph Database is the result of a traversal. For example, in InfoGrid, all traversal operations result in a set like this:

MeshObject    startNode     = ...; // some start node
MeshObjectSet neighborNodes = startNode.traverseToNeighbors();

We might as well have returned an array, or an Iterator over the members of the set, were it not for the fact that there are well-understood set operations that often make our jobs as developers much simpler: like set unification, intersection and so forth.

For example, in a social bookmarking application we might want to find out which sites both you and I have bookmarked. Code might look like this:

MeshObject me  = ...; // node representing me
MeshObject you = ...; // node representing you

TraversalSpecification ME_TO_BOOKMARKS_SPEC = ...;
    // how to get from a person to their bookmarks, see post on traversals
MeshObjectSet myBookmarks   = me.traverse( ME_TO_BOOKMARKS_SPEC );
MeshObjectSet yourBookmarks = you.traverse( ME_TO_BOOKMARKS_SPEC );

// Bookmarks that you and I share
MeshObjectSet sharedBookmarks = myBookmarks.intersect( yourBookmarks );

Notice how simple this code is to understand? One of the powers of sets. Or, if you know what a “minus” operation is on a set, this is immediately obvious:

// Bookmarks unique to me
MeshObjectSet myUniqueBookmarks = myBookmarks.minus( yourBookmarks );

This is clearly much simpler than writing imperative code which would have lots of loops and if/then/else’s and comparisons and perhaps indexes in it. (And seeing this might put some concerns to rest that NoSQL databases are primitive because they don’t have a SQL-like query language. I’d argue it’s less the language but the power of sets, and if you have sets you have a lot of power at your fingertips.)

To check out sets in InfoGrid, try package org.infogrid.mesh.set. Clearly much more can be done than we have so far in InfoGrid, but it’s a very useful start in our experience.

InfoGrid and Multigraphs

Graph theory distinguishes between graphs that are multigraphs and those that are not, i.e. between graphs where there may be more than one edge between any two nodes, and those graphs where there can only be one edge between any two nodes.

On the first glance, InfoGrid does not support multigraphs. On the second glance, InfoGrid is a bit more powerful than it appears; not an unusual occurrence ;-) Let me explain:

Here is how we create an edge between two nodes in InfoGrid:

MeshObject node1 = ...;
MeshObject node2 = ...;

node1.relate( node2 );

If we try to do the last statement again, InfoGrid will throw a RelatedAlreadyException. That clearly says that multiple edges between the same two nodes are not allowed.

But.

We do this in InfoGrid because unless there is some associated information with each of these edges, we cannot distinguish them and so there is no point in having two. The way we do that in InfoGrid is to bless the edge with more than one RelationshipType, such as:

node1.blessRelationship( PARENT_TO_SON, node2 );
node1.blessRelationship( CUSTOMER_TO_VENDOR, node2 );

These two relationships express that node1 is a parent of son node2, and that node2 (the son) has sold something to customer node 1 (the parent). Very different semantics associated with the same edge. Just like multiple edges in a multigraph.

So: InfoGrid supports multigraphs just fine, just perhaps not entirely as you would have expected.

Operations on a Graph Database (Part 6 – Traversals)

Graph Database Tutorial

Part 1: Nodes

Part 2: Edges

Part 3: Types

Part 4: Properties

Part 5: Identifiers

Part 6: Traversals

Part 7: Sets

Part 8: Events

Traversals are the most common operations on a graph database. They are just as important for graph databases as joins are for relational databases. But of course, they are something else as graphs are not tables.

A traversal in a graph database is uniquely described by two data items:

  • a start node
  • a specification for the traversal

A traversal always leads to a set of nodes. Depending on the structure of the graph being traversed, that set may contain many, one or zero nodes. For example, if we traverse from a node representing a Person to the nodes representing their grandchildren, we may or may not find any, depending on whether they have any.

From a given node, we can traverse in many different ways (i.e. same node, different traversal specifications). Or, given the same traversal specification, we can start with different nodes.

By way of analogy, consider street directions:

  • start node: my house
  • traversal specification: first turn left, then go either straight or left.

The result of this particular traversal is a single-element set containing the neighborhood park. If you had started at the same node (my house), but gone right first, you would not have arrived at the park. If you had started at a different node (somebody else’s house), you may or may not have arrived at the park. You may not have arrived anywhere (perhaps there is no left that one can take from certain houses). Or you might have arrived in multiple places (“go either straight or left” might not take you to the same part regardless which you take, but taken you into different directions.

Graph database products seem to differ on how to deliver to the developer the set of nodes that is the result of a traversal. In InfoGrid, all traversals produce a MeshObjectSet, which, as the name says, is a set of nodes. One can then iterate over that set, for example, subset it, unify it with another or ask how many elements it has. In other products, traversals produce an iterator directly which then can be queried for one member of the result set at a time. Regardless of API details, the result of a traversal is always a set (e.g. it can’t contain duplicates.)

Just like there are many ways of giving directions, traversal specifications can be captured in many different ways. In InfoGrid, we have — you guessed it — an abstract data type called TraversalSpecification and several different classes that implement that type, such as:

  • traverse by going to all direct neighbor nodes of the start node
  • go to all neighbor nodes related with a edge of a particular type in a particular direction (e.g. “traverse from employee to their manager(s)”)
  • go N steps in sequence, where each step can be any traversal specification
  • go N steps in parallel, and unify the resulting set
  • select a subset of the found nodes based on some criteria, etc.

The FirstStep example shows some simple traversals.

And just for simplicity, InfoGrid also allows traversals starting from a set of start nodes, not just one. So we can say things like this:

MeshObject me = ...;
MeshObjectSet myParents      = me.traverse( childToParents );
MeshObjectSet myGrandParents = myParents.traverse( childToParents );

In our experience, working with sets makes complex traversals very easily understandable.

Operations on a Graph Database (Part 5 – Identifiers)

Graph Database Tutorial

Part 1: Nodes

Part 2: Edges

Part 3: Types

Part 4: Properties

Part 5: Identifiers

Part 6: Traversals

Part 7: Sets

Part 8: Events

Well, “identifiers” aren’t much of an “operation”, but there are some operations related to identifiers, thus the title.

All first-class objects in a graph database typically have a unique identifier. This means nodes have unique identifiers, and for those graph databases that represent edges as distinct objects (see previous discussion on the pros and cons), they have unique identifiers, too.

This means we can ask a node for their identifier, remember the identifier, and later find the node again by looking it up in the graph database. In InfoGrid, this looks as follows:

MeshObject someNode = ...; // some MeshObject aka Node
MeshObjectIdentifier id = someNode.getIdentifier();

and later we can do this:

MeshBase mb = ...; // some MeshBase
MeshObject nodeFoundAgain = mb.findMeshObjectByIdentifier( id );

As you can see, InfoGrid uses an abstract data type called MeshObjectIdentifier, which you can think of as String for a second. (see below.) In InfoGrid, all identifiers are long-lasting. This means, your object will still have the same MeshObjectIdentifier after you rebooted your database. This has some advantages, e.g. you can define well-known objects in your graph database to which you can easily return even weeks later.

Other graph databases may use different data types as identifiers (e.g. int or long), but the use of identifiers is common with the above operations. They may or may not be the same after rebooting of the database.

Why does the type of identifier matter? Well, it depends on the application you have in mind. For InfoGrid applications, we primarily care about web applications, specifically REST-ful web applications. And so InfoGrid web applications generally use MeshObjectIdentifiers that identical to the URLs of the application. Let’s make an example:

Assume you have a URL bookmarking application which runs at http://example.com/. Let’s say a user creates tag “books”, which can be found at URL http://example.com/books/. It would be most straightforward to create a MeshObject with MeshObjectIdentifier http://example.com/books/. Which is exactly what InfoGrid does by default. No impedance mismatch between URLs that the user sees, the objects in the application layer, and the database! This leads to dramatic simplification of development and debugging.

Operations on a Graph Database (Part 4 – Properties)

Graph Database Tutorial

Part 1: Nodes

Part 2: Edges

Part 3: Types

Part 4: Properties

Part 5: Identifiers

Part 6: Traversals

Part 7: Sets

Part 8: Events

Today we’re looking at properties. There are a few different philosophies that a graph database might employ.

1. The purists often argue that properties aren’t needed at all: all properties can be modeled as edges to separate nodes, each of which represents a value. That’s of course true at some level: instead of a node representing a person, for example, that “contains” the person’s FirstName, LastName and DateOfBirth, one could create two String nodes and a TimeStamp node, and connect the with edges representing “first name”, “last name” and “date of birth”.

The non-purists counter that for practical purposes, it is much simpler to think of these data elements as properties instead of as independent things that are related. For example, it makes deletion of the Person much simpler (and we don’t need to implement cascading delete rules). Also, there are performance tradeoffs: if the three properties are stored with their owning node, for example, a single read is required to restore from disk the node and all of its properties. This would require at least 4 (perhaps 7, depending on how edges are stored in the graph database) reads if stored independently.

In InfoGrid, we believe in properties. We don’t prevent anybody from creating as many edges as they like, of course, but think that properties definitely have their uses.

2. Properties have to be named in some fashion, and the simplest approach — used by a number of graph database projects — is to give them a String label as a name. Correspondingly, the essence of the property API using Strings as labels would look like this:

public Object getPropertyValue( String name );
public void setPropertyValue( String name, Object value );

The advantage of this model is obviously that it is very simple. The disadvantage is that for complex schemas or models created by multiple development teams, name conflicts and spelling errors for property names occur more frequently than one would like. At least that is our experience when building InfoGrid applications, which is why we prefer the next alternative:

3. Properties are identified by true meta-data objects. We call them PropertyTypes, and they are part of what developers define when defining an InfoGrid Model. So the InfoGrid property API looks like this:

public Object getPropertyValue( PropertyType name );
public void setPropertyValue( PropertyType name, Object value );

We’ll have more to say on the subject of meta-data and Models in a future post.

Finally, we need to discuss what in a graph database can carry properties. Everybody other than the purists (see above) agree that nodes (called MeshObjects in InfoGrid) can carry properties. Some graph database projects (like the now-obsolete InfoGrid V1) also allow properties on edges (called Relationships in InfoGrid). Others (InfoGrid today) do not allow that.

It may sound peculiar that we had what looks like a more powerful approach in an earlier InfoGrid version but not any more. Here is what we observed in our practice with InfoGrid:

  • Properties on edges are fairly rare compared to Properties on nodes. We’ve been involved in several projects over the years where the Models were substantial and not a single property was found on any edge; nor did anybody ask for one.
  • If a property is needed on an edge, there is an easy workaround known as “associative entity” in data modeling circles: simply create an intermediary node that carries the property.
  • The deciding factor was performance: if properties are rarely needed on edges, it is possible to traverse from one node to a neighbor node in a single step. If properties are needed on edges, the edge needs to be represented as a separate object, and a traversal from one node to its neighbor requires two steps: from the start node to the connecting edge, and from the edge to the destination node. So not having properties on edges can improve performance by a 100%. Which is why we got rid of them for InfoGrid V2.

In the next post, we will look at data types for properties.

Operations on a Graph Database (Part 3 – Types)

Graph Database Tutorial

Part 1: Nodes

Part 2: Edges

Part 3: Types

Part 4: Properties

Part 5: Identifiers

Part 6: Traversals

Part 7: Sets

Part 8: Events

We need to talk about properties, but before we can do that, we need to talk about types. In a previous post we looked at the various alternatives for typing and their pros and cons. Graph databases may take the two following approaches to typing, and anything in between:

  • No types at all: nodes, edges and properties have no types; anything goes.
  • Fully typed: nodes, edges and properties are all strongly typed; the graph database refuses to perform an operation that is not consistent with the type system.

There are some graph databases on the market that require types for some elements (edges, for example) but don’t allow types for others (nodes, or properties), so hybrids are possible.

A totally untyped graph database has it easy: it allows an arbitrary number of properties, distinguished only by name and with arbitrary values. So the operations are:

Defined on Node and/or Edge:

setProperty( String label, Object value );  // returns nothing
getProperty( String label );                // returns the value
getAllProperties();                         // returns a set of all current property labels

A fully typed graph database needs to offer a much larger set of operations. They are:

Defined on Node:

bless( NodeType type );   // returns nothing
unbless( NodeType type ); // returns nothing
getTypes();               // returns a set of all current NodeTypes
setProperty( PropertyType type, Object value ); // returns nothing
getProperty( PropertyType type );               // returns the value of the Property

Defined on Edge:

bless( EdgeType type );   // returns nothing
unbless( EdgeType type ); // returns nothing
getTypes();               // returns a set of all current EdgeTypes
setProperty( PropertyType type, Object value ); // returns nothing
getProperty( PropertyType type );               // returns the value of the Property

To determine the available properties, one examines the meta-data available in the type system:

Defined on NodeType and on EdgeType

getPropertyTypes();   // returns a set of all possible properties defined by the type

The abstract API I’m listing here is the general case for a fully dynamically-typed hypothetical graph database. No product that I’m aware of implements them all, but InfoGrid comes close. If you read the JavaDoc for MeshObject, for example (MeshObject is InfoGrid’s name for Node), you will find all of the above Node operations. The operations are also there for Relationships (InfoGrid’s name for Edge) except that for efficiency reasons 1) they are defined on MeshObject, and 2) InfoGrid does not support properties on edges. (InfoGrid V1 used to, but we decided to eliminate that support; will talk about that some other time).

If a graph database was fully typed but using a static type system, we obviously could not bless either nodes or edges with new types at run-time as this API defines. (InfoGrid V1 used to be that way; InfoGrid V2 is fully dynamic).

In the next part, we’ll talk about properties then.

Three’s a Crowd: Neo4j, Sones, Filament all implement InfoGrid’s FirstStep Example

Little did I know when I put up InfoGrid’s FirstStep example. The example creates just a few nodes and a few edges to show, in principle, how to build a URL tagging application based on a graph database like InfoGrid.

Alex Popescu at MyNoSQL challenged the Neo4j folks how they would implement it, and they responded promptly. Then, the guys are Sones implemented the same example themselves, and just now the Filament project did the same. Worth a blog post with the links!

Here they are:

for your reading and comparing pleasure.

I’m tempted to list my own observations, but I’d like to avoid a blogging contest in which — naturally — everybody will claim “but the way we do it is better”. Independent reviews anybody?

Strong and Weak Typing With Graph Databases

Whether programming systems should be strongly typed or weakly typed has been one of the longest-running controversies in the history of computer science going back something like 50 years. Generally speaking, strongly typed systems tend to require more programmer effort up-front, in exchange for earlier or more definite error reports.

We also need to distinguish between static typing and dynamic typing: a dynamically typed system enables changes of types at run-time, while a statically typed system can’t do that.

Not surprisingly, typing for graph databases (or any other kind of NoSQL database) can be implemented in different ways, too:

Weakly typed Strongly typed
Dynamically typed

At development time: types may be declared but are not checked except perhaps rudimentarily.

At run-time: errors may occur, which may or may not be discovered; mis-interpretations of data are possible; data corruption is likely in case of programming errors.

At development time: types are declared and checked as well as possible.

At run-time: all operations are checked for type safety; types can be discovered dynamically; type mis-interpretations are not possible.

Statically typed

At development time: only rudimentary checking, if at all

At run-time: errors may occur, which may or may not be discovered; mis-interpretations of data are possible; data corruption is likely in case of programming errors.

At development time: all type errors are caught; additional developer effort is required; some types of data are hard to represent

At run-time: no checking required due to “correctness by construction”.

Let’s insert some systems into this table:

Weakly typed Strongly typed
Dynamically typed Most NoSQL systems InfoGrid
Statically typed SQL database (if used as intended)

Side note: when NoSQL proponents argue that weakly typed systems are much better than stronger-typed SQL, they sometimes throw out the baby with the bath water: there are four choices, not two. We agree that statically, strongly typed systems like a typical SQL database has considerable disadvantages in a fast-moving world, but so do weakly typed systems; the only difference is the type of disadvantage. In our view, a strong but dynamic type system is the best compromise for most applications with a non-trivial schema, which is why InfoGrid V2 implements it. (There are some applications that do not require a non-trivial; web caching for example.)

In a graph database like InfoGrid, the following items can be typed:

  • Nodes
  • Edges
  • Properties

In other graph databases, only a subset of these items may be typed. More in the next post on types.

Access Control the InfoGrid Way

We do it very similarly in InfoGrid.

But we can go one big step further: have InfoGrid automatically enforce the access control rules that were set up. If we have the ACL information, why not use it and have the graph database do the enforcement for us? That functionality has been part of InfoGrid for a couple of years.

For some detailed examples how this works, consult the security tests that are part of InfoGrid’s automated test suite (particularly MeshBaseSecurityTest5).

Here’s the basic idea. (again, paraphrasing the code for easier readability. Consult the above link for full code.) When a graph database in InfoGrid is configured to run with a AclBasedAccessManager, we can do this:

First, we need a MeshObject that is going to be the owner of some access-controlled data object. Any MeshObject will do:

MeshObject owner = createMeshObject();

Then, we associate the owner with the current Thread. (Just like in UNIX, where the ownership of processes determines what the process can do)

theAccessManager.setCaller( owner );

Now here comes the access-controlled data object.

MeshObject data = createMeshObject();

Because the current Thread is associated with the owner MeshObject, InfoGrid automatically sets up an ownership relationship between the data object and the owner object — just like in UNIX, a newly created file automatically has an owner.

Going beyond UNIX, we can now put the data object into something we call a ProtectionDomain. It’s basically a collection of MeshObjects that all have the same access control policy. This is mainly for efficiency and easy of management.

MeshObject protectionDomain = createMeshObject( AclBasedSecuritySubjectArea.PROTECTIONDOMAIN );
domain.relateAndBless( AclBasedSecuritySubjectArea.PROTECTIONDOMAIN_GOVERNS_MESHOBJECT.getSource(), dataObject );

Now, let’s give some another entity some access rights to the data object:

MeshObject actorMayReadNotWrite = createMeshObject();
actorMayReadNotWrite.relateAndBless( AclBasedSecuritySubjectArea.MESHOBJECT_HASREADACCESSTO_PROTECTIONDOMAIN.getSource(), domain );

Note that it is the owner of the object that needs to do that; others can’t.

So now we change ownership on the thread.

theAccessManager.setCaller( actorMayReadNotWrite );

This call will succeed:

dataObject.getPropertyValue( <some property type> );

while this call will throw a NotPermittedException:

dataObject.setPropertyValue( <some property type>, <some property value> )

If the thread was currently associated with the owner, both calls would succeed. Again, I refer you to the follow code linked above. As you can say, it works very similar to how permissions work in UNIX, although of course the underlying ACL information is represented as a MeshObjectGraph.

If you like this, we can do even one better: the whole security mechanism is pluggable in InfoGrid. You don’t like the way we represent and enforce ACLs? Be our guest … write a new subclass of AccessManager, and it will work the way you want. (Did we say that InfoGrid is extremely pluggable?)

P.S. It’s great to see that we aren’t the only ones to think that security-related information is an excellent match for a graph database. There’s also the rather intriguing example for where Microsoft is going with their LDAP directory, which very much looks like evolution in the graph direction. Time to get on board graph databases!

Operations on a Graph Database (Part 2 – Edges)

Graph Database Tutorial

Part 1: Nodes

Part 2: Edges

Part 3: Types

Part 4: Properties

Part 5: Identifiers

Part 6: Traversals

Part 7: Sets

Part 8: Events

In the first post of this series, we looked at creating and deleting Nodes. Today we are looking at Edges.

Unlike simpler NoSQL data stores like key-value stores, graph databases not only manage nodes, but also edges. Edges are things that connect two other data elements, and graph datastores have them as a basic element managed by the store. Think of them as the line between two boxes; that’s exactly what they are.

Edges often take developers a while to get used to, because there isn’t much precedent in the world of software. Even the so-called “relational database” doesn’t actually have “relationships” as a first-class concept: we have to infer them from primary/foreign key declarations; and that only works if developers actually declare them, which is not all that common.

Edges don’t exist in normal code either. Pretty much all mainstream programming languages only have pointers, not relationships aka edges. Edges are bidirectional, managed things, while pointers are one-directional and not managed at all. Let’s take an example (using a simplified version of the InfoGrid API, see the FirstStep example for complete code of a basic URL tagging application):

MeshObject customer = createMeshObject(); // create first node, called MeshObject in InfoGrid
MeshObject order    = createMeshObject(); // create second node
customer.relate( order );

What did we just do?

We created a customer object, and an order object, and then we said the two are related. (The graph database makes sure the objects get persisted automatically when the Transaction is closed; not shown here as we try to stay on topic.)

If we had to do that in straight Java, we’d do something like this:

Customer customer = new Customer();
Order    order    = new Order();
customer.addOrder( order );
order.setCustomer( customer );

and we’d have to write the code to manage the edge ourselves, such as:

class Customer {
    ...
    private List<Order> ordersOfThisCustomer = new ArrayList<Order>();
}
class Order {
    ...
    private Customer customerPlacingThisOrder;
}

The question is: why do we have to do all this work for a simple 1-N relationship between customers and orders? The graph database API is much better: for one, it lets the database worry about how and when to save and restore the involved objects. It could, for example, (as InfoGrid does), decide to restore from disk the Customer object but not the Order object for some time because the application code does not currently need to know the Customer’s orders. And referential integrity is always guaranteed. For example:

customer.traverseToNeighbors().getSingleMember(); // returns the single Order object
order.traverseToNeighbors().getSingleMember(); // returns the single Customer object

// now we delete the edge
customer.unrelate( order );

customer.traverseToNeighbors().getSingleMember(); // returns null
order.traverseToNeighbors().getSingleMember(); // returns null

If there is no graph database involved, we need to do it manually, like this:

customer.removeOrder( order );
order.setCustomer( null );

… and hope that we don’t forget one of those calls, because then referential integrity would be out the window, and the next application crash is a certainty.

Imagine if we wanted to restore the Customer and the Order object at different times from disk. Without help from sophisticated run-time infrastructure like a graph database, band-aid-technologies such as object-relational mapping is most likely going to create a separate instance for, say, the restored Order object, and code such as List.remove( … ) is not going to work because we have two Java objects in memory that represent the same order. (Shudder.)

Of course, code could be written to manage all of this manually, but it’s much better if the platform takes care of it.

[The astute reader will notice that the plain Java example has one advantage: it provides type safety. I'll have to say more about this in an upcoming post about types.]

So: after working with graph databases for a while, many people believe that edges are actually the much more interesting and useful concept than nodes. Just like many data modelers think that the value of a data model is often more in the way the entities are connected by relationships than the details of the entities. Automatic management of relationships make things simple, and that’s what any good database should do. Developers have enough to worry about, and graph databases provide real help here.

In the next post, we’ll look at types.