Operations on a Graph Database (Part 2 – Edges)

Graph Database Tutorial

Part 1: Nodes

Part 2: Edges

Part 3: Types

Part 4: Properties

Part 5: Identifiers

Part 6: Traversals

Part 7: Sets

Part 8: Events

In the first post of this series, we looked at creating and deleting Nodes. Today we are looking at Edges.

Unlike simpler NoSQL data stores like key-value stores, graph databases not only manage nodes, but also edges. Edges are things that connect two other data elements, and graph datastores have them as a basic element managed by the store. Think of them as the line between two boxes; that’s exactly what they are.

Edges often take developers a while to get used to, because there isn’t much precedent in the world of software. Even the so-called “relational database” doesn’t actually have “relationships” as a first-class concept: we have to infer them from primary/foreign key declarations; and that only works if developers actually declare them, which is not all that common.

Edges don’t exist in normal code either. Pretty much all mainstream programming languages only have pointers, not relationships aka edges. Edges are bidirectional, managed things, while pointers are one-directional and not managed at all. Let’s take an example (using a simplified version of the InfoGrid API, see the FirstStep example for complete code of a basic URL tagging application):

MeshObject customer = createMeshObject(); // create first node, called MeshObject in InfoGrid
MeshObject order    = createMeshObject(); // create second node
customer.relate( order );

What did we just do?

We created a customer object, and an order object, and then we said the two are related. (The graph database makes sure the objects get persisted automatically when the Transaction is closed; not shown here as we try to stay on topic.)

If we had to do that in straight Java, we’d do something like this:

Customer customer = new Customer();
Order    order    = new Order();
customer.addOrder( order );
order.setCustomer( customer );

and we’d have to write the code to manage the edge ourselves, such as:

class Customer {
    ...
    private List<Order> ordersOfThisCustomer = new ArrayList<Order>();
}
class Order {
    ...
    private Customer customerPlacingThisOrder;
}

The question is: why do we have to do all this work for a simple 1-N relationship between customers and orders? The graph database API is much better: for one, it lets the database worry about how and when to save and restore the involved objects. It could, for example, (as InfoGrid does), decide to restore from disk the Customer object but not the Order object for some time because the application code does not currently need to know the Customer’s orders. And referential integrity is always guaranteed. For example:

customer.traverseToNeighbors().getSingleMember(); // returns the single Order object
order.traverseToNeighbors().getSingleMember(); // returns the single Customer object

// now we delete the edge
customer.unrelate( order );

customer.traverseToNeighbors().getSingleMember(); // returns null
order.traverseToNeighbors().getSingleMember(); // returns null

If there is no graph database involved, we need to do it manually, like this:

customer.removeOrder( order );
order.setCustomer( null );

… and hope that we don’t forget one of those calls, because then referential integrity would be out the window, and the next application crash is a certainty.

Imagine if we wanted to restore the Customer and the Order object at different times from disk. Without help from sophisticated run-time infrastructure like a graph database, band-aid-technologies such as object-relational mapping is most likely going to create a separate instance for, say, the restored Order object, and code such as List.remove( … ) is not going to work because we have two Java objects in memory that represent the same order. (Shudder.)

Of course, code could be written to manage all of this manually, but it’s much better if the platform takes care of it.

[The astute reader will notice that the plain Java example has one advantage: it provides type safety. I'll have to say more about this in an upcoming post about types.]

So: after working with graph databases for a while, many people believe that edges are actually the much more interesting and useful concept than nodes. Just like many data modelers think that the value of a data model is often more in the way the entities are connected by relationships than the details of the entities. Automatic management of relationships make things simple, and that’s what any good database should do. Developers have enough to worry about, and graph databases provide real help here.

In the next post, we’ll look at types.

Operations on a Graph Database (Part 1 – Nodes)

Graph Database Tutorial

Part 1: Nodes

Part 2: Edges

Part 3: Types

Part 4: Properties

Part 5: Identifiers

Part 6: Traversals

Part 7: Sets

Part 8: Events

Graph databases are still quite unfamiliar to many developers. This is the first post in a series discussing the operations a graph database makes available to the developer. Just like there are only so many different things you can do on a relational database (like CREATE TABLE or INSERT), there are only so many things you can do on a graph database. It is worth looking at them one at a time, and that’s the goal of this series. This first post is on creating and deleting nodes.

To recap, a graph database contains nodes and edges, or MeshObjects and Relationships (as we call them in InfoGrid), or Instances and Links (as the UML would call them), or Resources and Triples (as the semantic web folks would call them), or boxes and arrows (as we draw them on a white board).

Nodes are those objects in a graph database that can stand on their own, they don’t depend on anything else. Edges are those objects that depend on the existence of (typically two) other objects, their source and their destination; we think of edges as connecting nodes.

To create a node in a graph database is one of its basic operations. For example, in InfoGrid, you can simply say:

MeshObject createMeshObject()

and voila, you have one. Similarly, you can delete a node by saying:

deleteMeshObject( MeshObject toDelete )

There are few conditions around those operations, such as that you have to have a transaction open, and that you have to have access rights to actually perform this operation, but that goes without saying.

When deleting a node, the graph database may require you to first delete all edges connected to the node before you get to delete it. Or, it may “ripple delete” all connected edges as part of the delete operation. There are some differences in the various graph database products on this; neither will make much of a difference to the developer.

If the graph database enforces a model (aka schema), as some graph databases do, you may need to make sure you don’t attempt to delete a node in a way that the schema would be violated. For example, if the schema says “an Order must be placed by exactly one Customer”, and you are attempting to delete the node representing the Customer, the graph database may prevent you from doing that as long as there still are nodes representing Order related to the Customer node. We’ll discuss schemas and graph databases in more detail in a later post.

For now, we learned two basic operations on a graph database:

  • create node
  • delete node.

Stay tuned for the next installment.

Comparing FirstStep in InfoGrid and Neo4j

Alex Popescu has a great comparison how the InfoGrid FirstStep example would look like in Neo4j, another graph database. As I noted in an earlier post, there are far more similarities in our approaches to the basics of graph databases than there are differences.

Couple comments, addressing some of Alex’ notes. He says:

everything in Neo4j must happen inside a transaction even if it’s a graph traversal operation (this gives a very strong Isolation level). The InfoGrid traversal code seem to happen outside the transaction…

That’s correct. You can do the traversal inside a transaction if you like, but you are not required to. This gives application developers one more option for concurrency control: transactions, critical sections, and no protection.

Re InfoGrid terminology, it’s ancient roots are in object modeling (think UML) — for example, we still talk about InfoGrid Models. However, over time it became clear that InfoGrid’s core ideas are far distinct enough to warrant their own terms. So when we moved from InfoGrid V1 to V2 a few years ago, we changed terms. For example, an “instance” (in UML or programming terminology) aka “node” (graph terminology) rarely can have more than one type anywhere other than in InfoGrid. Think of a Java object that has more than one class, and you can dynamically add and remove classes to the instance at run-time. So we call them MeshObjects rather than something people might have the wrong connotations with. The closes we are aware of is Perl’s “bless”, which is why we use that term.

the Neo4j uses also the LuceneIndexService for indexing both the tag and web resources nodes, but that’s only because the code there makes sure not to duplicate either tags or web resources (i.e. this functionality is not present in the InfoGrid code and I don’t know how that would look like)

Correction. You are invited to modify the example and attempt to create a second MeshObject with the same identifier. You can’t (it will throw an exception at you).

But never mind the comparatively minor differences between Neo4j and InfoGrid. We should compare this to all the stuff one would have to do with a relational database to build the same thing. Object-relational mapping anybody? No thanks …

FirstStepWithMySQL

Building on the recent InfoGrid FirstStep example, here is another: FirstStepWithMySQL.

Using the same bookmarking/tagging application as FirstStep, it shows how to persist the same MeshObjectGraph using MySQL as a key-value store. It consists of two apps:

  • the first app initializes the store, and creates a graph of objects
  • the second app retrieves the graph from the store, and traverses the graph to retrieve information.

It’s of course a trivial example, but it illustrates:

  • how easy it is in InfoGrid to keep the same application running against different storage backends with minimal code changes (in the initialization only)
  • some of the advantages of graph databases compared to other types of storage technologies: note how simple it is to traverse the graph in all directions.

Annotated source code is here.

InfoGrid 2.9.3 Released

Available for download here. This is mainly an incremental improvement/bug fix release, except:

  1. new capabilities in the ig-lid project related to what Randy Farmer called the tripartite identity pattern.
  2. new example application: FirstStepWithMySQL (see separate post).

Summary of changes:

  • fixed endless loop when Transaction open at MeshBase die time
  • Moved (Net)MeshObjectIdentifierFactory to .mesh.net packages. That allows is to tighten permissions a bit.
  • Check that localIds are at least 4 chars long. Not having this check created confusing results for users of MeshWorld where the GUI assumes this, but not the graph db.
  • Replaced StringRepresentationParseException with java.text.ParseException in most places.
  • More specific subclasses of LidInvalidCredentialException for better error reporting
  • More resilient when Gpg home dir cannot be created
  • Better TimeStampValue.toString()
  • Make it easier to create “su” Transactions with elevated privileges
  • collect all outgoing data into the same XprisoMessage; this prevents sending more than one message in response to a single incoming message
  • improved abilities to freshen Replicas
  • added ForwardReferenceTest9
  • added DelegatingNetMeshObjectIdentifierFactory as a convenience class
  • Major refactoring in module ig-lid to implement what Randy Farmer called the tripartite identity pattern. Nickname is what he calls Public ID. HasIdentifier is used to represent Login ID. LidPersona represents what he calls Account and also manages the relationships to the other items. There is a corresponding Model.
  • Added identifier-as-entered to LidAuthenticationStatus for better error reporting
  • Re-introduced LidPersona as a major concept
  • TransactionAction now carries a few member variables (MeshBase, MeshBaseLifecycleManager, MeshObjectIdentifierFactory) in order to make the writing of transactional code more concise
  • Split org.infogrid.probe.test into several test modules; makes it better manageable
  • removed a bunch of unnecessary files from ig-vendors; they only take up space and bandwidth
  • fixed multiple ModelBase bug
  • failed to load model under some circumstances when not running under the Module framework
  • added isIdentifiedBy method to org.infogrid.util.HasIdentifier
  • changed MeshObjectSet.contains( MeshObjectIdentifier ) to MeshObjectSet.contains( Identifier )
  • added FirstStepWithMySQL
  • localization for LidAbortProcessingPipelineException

InfoGrid Slides On Slideshare

Slideshare sent this last night:

InfoGrid Core Ideas” is being tweeted more than any other document on SlideShare right now. So we’ve put it on the homepage of SlideShare.net (in the “Hot on Twitter” section).

Well done, you!

- SlideShare Team

Must have been a slow evening ;-)

Graph Databases vs. Object Databases — What’s the Difference?

Great question on Stackoverflow.com about the difference between Graph Databases and Object Databases. I answered it there, and decided to post it here as well:

Object and graph databases operate on two different levels of abstraction.

An object database’s main data elements are objects, the way we know them from an object-oriented programming language.

A graph database’s main data elements are nodes and edges.

An object database does not have the notion of a (bidirectional) edge between two things with automatic referential integrity etc. A graph database does not have the notion of a pointer that can be NULL. (Of course one can imagine hybrids.)

In terms of schema, an object database’s schema is whatever the set of classes is in the application. A graph database’s schema (whether implicit, by convention of what String labels mean, or explicit, by declaration as models as we do it in InfoGrid for example) is independent of the application. This makes it much simpler, for example, to write multiple applications against the same data using a graph database instead of an object database, because the schema is application-independent. On the other hand, using a graph database you can’t simply take an arbitrary object and persist it.

Different tools for different jobs I would think.