Operations on a Graph Database (Part 5 – Identifiers)

Graph Database Tutorial

Part 1: Nodes

Part 2: Edges

Part 3: Types

Part 4: Properties

Part 5: Identifiers

Part 6: Traversals

Part 7: Sets

Part 8: Events

Well, “identifiers” aren’t much of an “operation”, but there are some operations related to identifiers, thus the title.

All first-class objects in a graph database typically have a unique identifier. This means nodes have unique identifiers, and for those graph databases that represent edges as distinct objects (see previous discussion on the pros and cons), they have unique identifiers, too.

This means we can ask a node for their identifier, remember the identifier, and later find the node again by looking it up in the graph database. In InfoGrid, this looks as follows:

MeshObject someNode = ...; // some MeshObject aka Node
MeshObjectIdentifier id = someNode.getIdentifier();

and later we can do this:

MeshBase mb = ...; // some MeshBase
MeshObject nodeFoundAgain = mb.findMeshObjectByIdentifier( id );

As you can see, InfoGrid uses an abstract data type called MeshObjectIdentifier, which you can think of as String for a second. (see below.) In InfoGrid, all identifiers are long-lasting. This means, your object will still have the same MeshObjectIdentifier after you rebooted your database. This has some advantages, e.g. you can define well-known objects in your graph database to which you can easily return even weeks later.

Other graph databases may use different data types as identifiers (e.g. int or long), but the use of identifiers is common with the above operations. They may or may not be the same after rebooting of the database.

Why does the type of identifier matter? Well, it depends on the application you have in mind. For InfoGrid applications, we primarily care about web applications, specifically REST-ful web applications. And so InfoGrid web applications generally use MeshObjectIdentifiers that identical to the URLs of the application. Let’s make an example:

Assume you have a URL bookmarking application which runs at http://example.com/. Let’s say a user creates tag “books”, which can be found at URL http://example.com/books/. It would be most straightforward to create a MeshObject with MeshObjectIdentifier http://example.com/books/. Which is exactly what InfoGrid does by default. No impedance mismatch between URLs that the user sees, the objects in the application layer, and the database! This leads to dramatic simplification of development and debugging.

Operations on a Graph Database (Part 4 – Properties)

Graph Database Tutorial

Part 1: Nodes

Part 2: Edges

Part 3: Types

Part 4: Properties

Part 5: Identifiers

Part 6: Traversals

Part 7: Sets

Part 8: Events

Today we’re looking at properties. There are a few different philosophies that a graph database might employ.

1. The purists often argue that properties aren’t needed at all: all properties can be modeled as edges to separate nodes, each of which represents a value. That’s of course true at some level: instead of a node representing a person, for example, that “contains” the person’s FirstName, LastName and DateOfBirth, one could create two String nodes and a TimeStamp node, and connect the with edges representing “first name”, “last name” and “date of birth”.

The non-purists counter that for practical purposes, it is much simpler to think of these data elements as properties instead of as independent things that are related. For example, it makes deletion of the Person much simpler (and we don’t need to implement cascading delete rules). Also, there are performance tradeoffs: if the three properties are stored with their owning node, for example, a single read is required to restore from disk the node and all of its properties. This would require at least 4 (perhaps 7, depending on how edges are stored in the graph database) reads if stored independently.

In InfoGrid, we believe in properties. We don’t prevent anybody from creating as many edges as they like, of course, but think that properties definitely have their uses.

2. Properties have to be named in some fashion, and the simplest approach — used by a number of graph database projects — is to give them a String label as a name. Correspondingly, the essence of the property API using Strings as labels would look like this:

public Object getPropertyValue( String name );
public void setPropertyValue( String name, Object value );

The advantage of this model is obviously that it is very simple. The disadvantage is that for complex schemas or models created by multiple development teams, name conflicts and spelling errors for property names occur more frequently than one would like. At least that is our experience when building InfoGrid applications, which is why we prefer the next alternative:

3. Properties are identified by true meta-data objects. We call them PropertyTypes, and they are part of what developers define when defining an InfoGrid Model. So the InfoGrid property API looks like this:

public Object getPropertyValue( PropertyType name );
public void setPropertyValue( PropertyType name, Object value );

We’ll have more to say on the subject of meta-data and Models in a future post.

Finally, we need to discuss what in a graph database can carry properties. Everybody other than the purists (see above) agree that nodes (called MeshObjects in InfoGrid) can carry properties. Some graph database projects (like the now-obsolete InfoGrid V1) also allow properties on edges (called Relationships in InfoGrid). Others (InfoGrid today) do not allow that.

It may sound peculiar that we had what looks like a more powerful approach in an earlier InfoGrid version but not any more. Here is what we observed in our practice with InfoGrid:

  • Properties on edges are fairly rare compared to Properties on nodes. We’ve been involved in several projects over the years where the Models were substantial and not a single property was found on any edge; nor did anybody ask for one.
  • If a property is needed on an edge, there is an easy workaround known as “associative entity” in data modeling circles: simply create an intermediary node that carries the property.
  • The deciding factor was performance: if properties are rarely needed on edges, it is possible to traverse from one node to a neighbor node in a single step. If properties are needed on edges, the edge needs to be represented as a separate object, and a traversal from one node to its neighbor requires two steps: from the start node to the connecting edge, and from the edge to the destination node. So not having properties on edges can improve performance by a 100%. Which is why we got rid of them for InfoGrid V2.

In the next post, we will look at data types for properties.

Operations on a Graph Database (Part 3 – Types)

Graph Database Tutorial

Part 1: Nodes

Part 2: Edges

Part 3: Types

Part 4: Properties

Part 5: Identifiers

Part 6: Traversals

Part 7: Sets

Part 8: Events

We need to talk about properties, but before we can do that, we need to talk about types. In a previous post we looked at the various alternatives for typing and their pros and cons. Graph databases may take the two following approaches to typing, and anything in between:

  • No types at all: nodes, edges and properties have no types; anything goes.
  • Fully typed: nodes, edges and properties are all strongly typed; the graph database refuses to perform an operation that is not consistent with the type system.

There are some graph databases on the market that require types for some elements (edges, for example) but don’t allow types for others (nodes, or properties), so hybrids are possible.

A totally untyped graph database has it easy: it allows an arbitrary number of properties, distinguished only by name and with arbitrary values. So the operations are:

Defined on Node and/or Edge:

setProperty( String label, Object value );  // returns nothing
getProperty( String label );                // returns the value
getAllProperties();                         // returns a set of all current property labels

A fully typed graph database needs to offer a much larger set of operations. They are:

Defined on Node:

bless( NodeType type );   // returns nothing
unbless( NodeType type ); // returns nothing
getTypes();               // returns a set of all current NodeTypes
setProperty( PropertyType type, Object value ); // returns nothing
getProperty( PropertyType type );               // returns the value of the Property

Defined on Edge:

bless( EdgeType type );   // returns nothing
unbless( EdgeType type ); // returns nothing
getTypes();               // returns a set of all current EdgeTypes
setProperty( PropertyType type, Object value ); // returns nothing
getProperty( PropertyType type );               // returns the value of the Property

To determine the available properties, one examines the meta-data available in the type system:

Defined on NodeType and on EdgeType

getPropertyTypes();   // returns a set of all possible properties defined by the type

The abstract API I’m listing here is the general case for a fully dynamically-typed hypothetical graph database. No product that I’m aware of implements them all, but InfoGrid comes close. If you read the JavaDoc for MeshObject, for example (MeshObject is InfoGrid’s name for Node), you will find all of the above Node operations. The operations are also there for Relationships (InfoGrid’s name for Edge) except that for efficiency reasons 1) they are defined on MeshObject, and 2) InfoGrid does not support properties on edges. (InfoGrid V1 used to, but we decided to eliminate that support; will talk about that some other time).

If a graph database was fully typed but using a static type system, we obviously could not bless either nodes or edges with new types at run-time as this API defines. (InfoGrid V1 used to be that way; InfoGrid V2 is fully dynamic).

In the next part, we’ll talk about properties then.

Strong and Weak Typing With Graph Databases

Whether programming systems should be strongly typed or weakly typed has been one of the longest-running controversies in the history of computer science going back something like 50 years. Generally speaking, strongly typed systems tend to require more programmer effort up-front, in exchange for earlier or more definite error reports.

We also need to distinguish between static typing and dynamic typing: a dynamically typed system enables changes of types at run-time, while a statically typed system can’t do that.

Not surprisingly, typing for graph databases (or any other kind of NoSQL database) can be implemented in different ways, too:

Weakly typed Strongly typed
Dynamically typed

At development time: types may be declared but are not checked except perhaps rudimentarily.

At run-time: errors may occur, which may or may not be discovered; mis-interpretations of data are possible; data corruption is likely in case of programming errors.

At development time: types are declared and checked as well as possible.

At run-time: all operations are checked for type safety; types can be discovered dynamically; type mis-interpretations are not possible.

Statically typed

At development time: only rudimentary checking, if at all

At run-time: errors may occur, which may or may not be discovered; mis-interpretations of data are possible; data corruption is likely in case of programming errors.

At development time: all type errors are caught; additional developer effort is required; some types of data are hard to represent

At run-time: no checking required due to “correctness by construction”.

Let’s insert some systems into this table:

Weakly typed Strongly typed
Dynamically typed Most NoSQL systems InfoGrid
Statically typed SQL database (if used as intended)

Side note: when NoSQL proponents argue that weakly typed systems are much better than stronger-typed SQL, they sometimes throw out the baby with the bath water: there are four choices, not two. We agree that statically, strongly typed systems like a typical SQL database has considerable disadvantages in a fast-moving world, but so do weakly typed systems; the only difference is the type of disadvantage. In our view, a strong but dynamic type system is the best compromise for most applications with a non-trivial schema, which is why InfoGrid V2 implements it. (There are some applications that do not require a non-trivial; web caching for example.)

In a graph database like InfoGrid, the following items can be typed:

  • Nodes
  • Edges
  • Properties

In other graph databases, only a subset of these items may be typed. More in the next post on types.

Access Control the InfoGrid Way

We do it very similarly in InfoGrid.

But we can go one big step further: have InfoGrid automatically enforce the access control rules that were set up. If we have the ACL information, why not use it and have the graph database do the enforcement for us? That functionality has been part of InfoGrid for a couple of years.

For some detailed examples how this works, consult the security tests that are part of InfoGrid’s automated test suite (particularly MeshBaseSecurityTest5).

Here’s the basic idea. (again, paraphrasing the code for easier readability. Consult the above link for full code.) When a graph database in InfoGrid is configured to run with a AclBasedAccessManager, we can do this:

First, we need a MeshObject that is going to be the owner of some access-controlled data object. Any MeshObject will do:

MeshObject owner = createMeshObject();

Then, we associate the owner with the current Thread. (Just like in UNIX, where the ownership of processes determines what the process can do)

theAccessManager.setCaller( owner );

Now here comes the access-controlled data object.

MeshObject data = createMeshObject();

Because the current Thread is associated with the owner MeshObject, InfoGrid automatically sets up an ownership relationship between the data object and the owner object — just like in UNIX, a newly created file automatically has an owner.

Going beyond UNIX, we can now put the data object into something we call a ProtectionDomain. It’s basically a collection of MeshObjects that all have the same access control policy. This is mainly for efficiency and easy of management.

MeshObject protectionDomain = createMeshObject( AclBasedSecuritySubjectArea.PROTECTIONDOMAIN );
domain.relateAndBless( AclBasedSecuritySubjectArea.PROTECTIONDOMAIN_GOVERNS_MESHOBJECT.getSource(), dataObject );

Now, let’s give some another entity some access rights to the data object:

MeshObject actorMayReadNotWrite = createMeshObject();
actorMayReadNotWrite.relateAndBless( AclBasedSecuritySubjectArea.MESHOBJECT_HASREADACCESSTO_PROTECTIONDOMAIN.getSource(), domain );

Note that it is the owner of the object that needs to do that; others can’t.

So now we change ownership on the thread.

theAccessManager.setCaller( actorMayReadNotWrite );

This call will succeed:

dataObject.getPropertyValue( <some property type> );

while this call will throw a NotPermittedException:

dataObject.setPropertyValue( <some property type>, <some property value> )

If the thread was currently associated with the owner, both calls would succeed. Again, I refer you to the follow code linked above. As you can say, it works very similar to how permissions work in UNIX, although of course the underlying ACL information is represented as a MeshObjectGraph.

If you like this, we can do even one better: the whole security mechanism is pluggable in InfoGrid. You don’t like the way we represent and enforce ACLs? Be our guest … write a new subclass of AccessManager, and it will work the way you want. (Did we say that InfoGrid is extremely pluggable?)

P.S. It’s great to see that we aren’t the only ones to think that security-related information is an excellent match for a graph database. There’s also the rather intriguing example for where Microsoft is going with their LDAP directory, which very much looks like evolution in the graph direction. Time to get on board graph databases!

Operations on a Graph Database (Part 2 – Edges)

Graph Database Tutorial

Part 1: Nodes

Part 2: Edges

Part 3: Types

Part 4: Properties

Part 5: Identifiers

Part 6: Traversals

Part 7: Sets

Part 8: Events

In the first post of this series, we looked at creating and deleting Nodes. Today we are looking at Edges.

Unlike simpler NoSQL data stores like key-value stores, graph databases not only manage nodes, but also edges. Edges are things that connect two other data elements, and graph datastores have them as a basic element managed by the store. Think of them as the line between two boxes; that’s exactly what they are.

Edges often take developers a while to get used to, because there isn’t much precedent in the world of software. Even the so-called “relational database” doesn’t actually have “relationships” as a first-class concept: we have to infer them from primary/foreign key declarations; and that only works if developers actually declare them, which is not all that common.

Edges don’t exist in normal code either. Pretty much all mainstream programming languages only have pointers, not relationships aka edges. Edges are bidirectional, managed things, while pointers are one-directional and not managed at all. Let’s take an example (using a simplified version of the InfoGrid API, see the FirstStep example for complete code of a basic URL tagging application):

MeshObject customer = createMeshObject(); // create first node, called MeshObject in InfoGrid
MeshObject order    = createMeshObject(); // create second node
customer.relate( order );

What did we just do?

We created a customer object, and an order object, and then we said the two are related. (The graph database makes sure the objects get persisted automatically when the Transaction is closed; not shown here as we try to stay on topic.)

If we had to do that in straight Java, we’d do something like this:

Customer customer = new Customer();
Order    order    = new Order();
customer.addOrder( order );
order.setCustomer( customer );

and we’d have to write the code to manage the edge ourselves, such as:

class Customer {
    ...
    private List<Order> ordersOfThisCustomer = new ArrayList<Order>();
}
class Order {
    ...
    private Customer customerPlacingThisOrder;
}

The question is: why do we have to do all this work for a simple 1-N relationship between customers and orders? The graph database API is much better: for one, it lets the database worry about how and when to save and restore the involved objects. It could, for example, (as InfoGrid does), decide to restore from disk the Customer object but not the Order object for some time because the application code does not currently need to know the Customer’s orders. And referential integrity is always guaranteed. For example:

customer.traverseToNeighbors().getSingleMember(); // returns the single Order object
order.traverseToNeighbors().getSingleMember(); // returns the single Customer object

// now we delete the edge
customer.unrelate( order );

customer.traverseToNeighbors().getSingleMember(); // returns null
order.traverseToNeighbors().getSingleMember(); // returns null

If there is no graph database involved, we need to do it manually, like this:

customer.removeOrder( order );
order.setCustomer( null );

… and hope that we don’t forget one of those calls, because then referential integrity would be out the window, and the next application crash is a certainty.

Imagine if we wanted to restore the Customer and the Order object at different times from disk. Without help from sophisticated run-time infrastructure like a graph database, band-aid-technologies such as object-relational mapping is most likely going to create a separate instance for, say, the restored Order object, and code such as List.remove( … ) is not going to work because we have two Java objects in memory that represent the same order. (Shudder.)

Of course, code could be written to manage all of this manually, but it’s much better if the platform takes care of it.

[The astute reader will notice that the plain Java example has one advantage: it provides type safety. I'll have to say more about this in an upcoming post about types.]

So: after working with graph databases for a while, many people believe that edges are actually the much more interesting and useful concept than nodes. Just like many data modelers think that the value of a data model is often more in the way the entities are connected by relationships than the details of the entities. Automatic management of relationships make things simple, and that’s what any good database should do. Developers have enough to worry about, and graph databases provide real help here.

In the next post, we’ll look at types.

Operations on a Graph Database (Part 1 – Nodes)

Graph Database Tutorial

Part 1: Nodes

Part 2: Edges

Part 3: Types

Part 4: Properties

Part 5: Identifiers

Part 6: Traversals

Part 7: Sets

Part 8: Events

Graph databases are still quite unfamiliar to many developers. This is the first post in a series discussing the operations a graph database makes available to the developer. Just like there are only so many different things you can do on a relational database (like CREATE TABLE or INSERT), there are only so many things you can do on a graph database. It is worth looking at them one at a time, and that’s the goal of this series. This first post is on creating and deleting nodes.

To recap, a graph database contains nodes and edges, or MeshObjects and Relationships (as we call them in InfoGrid), or Instances and Links (as the UML would call them), or Resources and Triples (as the semantic web folks would call them), or boxes and arrows (as we draw them on a white board).

Nodes are those objects in a graph database that can stand on their own, they don’t depend on anything else. Edges are those objects that depend on the existence of (typically two) other objects, their source and their destination; we think of edges as connecting nodes.

To create a node in a graph database is one of its basic operations. For example, in InfoGrid, you can simply say:

MeshObject createMeshObject()

and voila, you have one. Similarly, you can delete a node by saying:

deleteMeshObject( MeshObject toDelete )

There are few conditions around those operations, such as that you have to have a transaction open, and that you have to have access rights to actually perform this operation, but that goes without saying.

When deleting a node, the graph database may require you to first delete all edges connected to the node before you get to delete it. Or, it may “ripple delete” all connected edges as part of the delete operation. There are some differences in the various graph database products on this; neither will make much of a difference to the developer.

If the graph database enforces a model (aka schema), as some graph databases do, you may need to make sure you don’t attempt to delete a node in a way that the schema would be violated. For example, if the schema says “an Order must be placed by exactly one Customer”, and you are attempting to delete the node representing the Customer, the graph database may prevent you from doing that as long as there still are nodes representing Order related to the Customer node. We’ll discuss schemas and graph databases in more detail in a later post.

For now, we learned two basic operations on a graph database:

  • create node
  • delete node.

Stay tuned for the next installment.

FirstStepWithMySQL

Building on the recent InfoGrid FirstStep example, here is another: FirstStepWithMySQL.

Using the same bookmarking/tagging application as FirstStep, it shows how to persist the same MeshObjectGraph using MySQL as a key-value store. It consists of two apps:

  • the first app initializes the store, and creates a graph of objects
  • the second app retrieves the graph from the store, and traverses the graph to retrieve information.

It’s of course a trivial example, but it illustrates:

  • how easy it is in InfoGrid to keep the same application running against different storage backends with minimal code changes (in the initialization only)
  • some of the advantages of graph databases compared to other types of storage technologies: note how simple it is to traverse the graph in all directions.

Annotated source code is here.

Web-based log4j configuration

Usually a reconfiguration of log4j requires an application restart. That’s annoying, particularly if it is necessary to track what an application is doing in a production environment. It would be nice to be able to change logging levels without having to re-deploy the application.

So I created a Viewlet that allows the user to reconfigure log4j logging levels of a running InfoGrid application from the web browser. Screen shot to the right. Nothing particularly fancy, but useful. Obviously, you can set any access control for it that you like in the application’s ViewletFactory.

The Viewlet has been added to MeshWorld and NetMeshWorld in trunk. The code is in a new project called org.infogrid.jee.viewlet.log4j.

Tracking XPRISO Messages

XPRISO is the messaging protocol used by InfoGrid to keep distributed information consistent. The acronym stands for “eXtensible Protocol for the Replication, Integration and Synchronization of distributed Objects”.

The current version of InfoGrid uses XPRISO mostly for communication between an application’s main NetMeshBase, and the ShadowMeshBases that track (“shadow”) external data feeds.

Sometimes it is really useful to track XPRISO messages.The simplest way o track all XPRISO messages in your application is this:

  1. Determine the classes that your application instantiates that implement the NetMeshBases communicating via XPRISO. For example, that might be org.infogrid.meshbase.net.local.m.LocalNetMMeshBase and org.infogrid.probe.shadow.m.MShadowMeshBase, or perhaps org.infogrid.meshbase.net.local.store.IterableLocalNetStoreMeshBase and org.infogrid.probe.shadow.store.StoreShadowMeshBase.
  2. Set their logging level to INFO.

This might be as simple as adding the following two lines to your Log.properties file:

log4j.category.org.infogrid.meshbase.net.local.store.IterableLocalNetStoreMeshBase=INFO
log4j.category.org.infogrid.probe.shadow.store.StoreShadowMeshBase=INFO

Voilá, all XPRISO messages are written to your logger. (Some detail is omitted to make the log readable.)

Here is an actual example:

INFO  2009-10-12 15:32:14,992 [pool-1-thread-1] shadow.store.StoreShadowMeshBase (Log4jLog.java:152)
 @ org.infogrid.util.logging.log4j.Log4jLog.logInfo:152
 - org.infogrid.meshbase.net.xpriso.SimpleXprisoMessage@5baec83c{
    route: http://localhost:8085/app1/ -> http://localhost:8085/app2/, requestId: 1255386734969
    requestedFirstTime:
        #http://localhost:8085/app2/
}

INFO  2009-10-12 15:32:15,018 [pool-1-thread-2] local.store.IterableLocalNetStoreMeshBase (Log4jLog.java:152)
@ org.infogrid.util.logging.log4j.Log4jLog.logInfo:152
 - org.infogrid.meshbase.net.xpriso.ParserFriendlyXprisoMessage@4e940b1e{
    route: http://localhost:8085/app2/ -> http://localhost:8085/app1/, responseId: 1255386734969
    conveyedMeshObjects:
        id: http://localhost:8085/app2/
}

Here, the first NetMeshBase asked the second for an object, and promptly got a replica in return.

If you don’t like the logging format, or want to log them differently, you can use your own XprisoMessageLogger simply by invoking setXprisoMessageLogger on the NetMeshBase whose messages you like to track.