Required vs. Optional Property Values

InfoGrid distinguishes between properties that must have a non-null value, and properties that may or or may not be null.

When creating an InfoGrid model, a developer has to specify which by using the <isoptional/> tag in the model file.

Why?

By way of parallel, consider the following piece of Java code:

class Foo {
    private int max1 = 10;
    private Integer max2 = 20;

    public void doSomething() {
        for( int i=0 ; i<max1 ; ++i ) {
            //...
        }
        for( int i=0 ; i<max2 ; ++i ) {
            //...
        }
    }

Spot the problem? max2 of course might be null, which means our code will throw an exception in the innocent-looking second for loop. To get the code right, we will have to protect that section with an if-then-else section that checks for null first.

Of course, such a protection is often the right thing to do. But in this example, a “max” should hardly ever be null, so using an “int” as a data type like for max1 (which can’t be null) is much better than using an “Integer” like for max2 (which may be null).

It’s the same thing for properties in InfoGrid models. Some properties simply should never be null. For example, consider a time stamp indicating when a MeshObject was created. Given that the MeshObject was created, the time stamp must exist, and therefore a null value makes no sense. In which case the property would be specified as “mandatory”. On the contrary, a time stamp when a MeshObject is likely to become obsolete is very likely optional: we might not know that time (yet), or it might never become obsolete, so null values are fine.

If InfoGrid did not distinguish between required and optional values, application code would be littered with unnecessary tests for null values. (or failing that, unexpected NullPointerExceptions.) We think being specific is better when creating the model; higher-quality and less cluttered application code is the reward.

Also check out the following related posts:

Operations on a Graph Database (Part 7 – Sets)

Graph Database Tutorial

Part 1: Nodes

Part 2: Edges

Part 3: Types

Part 4: Properties

Part 5: Identifiers

Part 6: Traversals

Part 7: Sets

Part 8: Events

Sets are a core concept of most databases. For example, any SQL SELECT statement in a relational database produces a set. Sets apply to Graph Databases just as well and are just as useful:

The most frequently encountered set of nodes in a Graph Database is the result of a traversal. For example, in InfoGrid, all traversal operations result in a set like this:

MeshObject    startNode     = ...; // some start node
MeshObjectSet neighborNodes = startNode.traverseToNeighbors();

We might as well have returned an array, or an Iterator over the members of the set, were it not for the fact that there are well-understood set operations that often make our jobs as developers much simpler: like set unification, intersection and so forth.

For example, in a social bookmarking application we might want to find out which sites both you and I have bookmarked. Code might look like this:

MeshObject me  = ...; // node representing me
MeshObject you = ...; // node representing you

TraversalSpecification ME_TO_BOOKMARKS_SPEC = ...;
    // how to get from a person to their bookmarks, see post on traversals
MeshObjectSet myBookmarks   = me.traverse( ME_TO_BOOKMARKS_SPEC );
MeshObjectSet yourBookmarks = you.traverse( ME_TO_BOOKMARKS_SPEC );

// Bookmarks that you and I share
MeshObjectSet sharedBookmarks = myBookmarks.intersect( yourBookmarks );

Notice how simple this code is to understand? One of the powers of sets. Or, if you know what a “minus” operation is on a set, this is immediately obvious:

// Bookmarks unique to me
MeshObjectSet myUniqueBookmarks = myBookmarks.minus( yourBookmarks );

This is clearly much simpler than writing imperative code which would have lots of loops and if/then/else’s and comparisons and perhaps indexes in it. (And seeing this might put some concerns to rest that NoSQL databases are primitive because they don’t have a SQL-like query language. I’d argue it’s less the language but the power of sets, and if you have sets you have a lot of power at your fingertips.)

To check out sets in InfoGrid, try package org.infogrid.mesh.set. Clearly much more can be done than we have so far in InfoGrid, but it’s a very useful start in our experience.

Operations on a Graph Database (Part 5 – Identifiers)

Graph Database Tutorial

Part 1: Nodes

Part 2: Edges

Part 3: Types

Part 4: Properties

Part 5: Identifiers

Part 6: Traversals

Part 7: Sets

Part 8: Events

Well, “identifiers” aren’t much of an “operation”, but there are some operations related to identifiers, thus the title.

All first-class objects in a graph database typically have a unique identifier. This means nodes have unique identifiers, and for those graph databases that represent edges as distinct objects (see previous discussion on the pros and cons), they have unique identifiers, too.

This means we can ask a node for their identifier, remember the identifier, and later find the node again by looking it up in the graph database. In InfoGrid, this looks as follows:

MeshObject someNode = ...; // some MeshObject aka Node
MeshObjectIdentifier id = someNode.getIdentifier();

and later we can do this:

MeshBase mb = ...; // some MeshBase
MeshObject nodeFoundAgain = mb.findMeshObjectByIdentifier( id );

As you can see, InfoGrid uses an abstract data type called MeshObjectIdentifier, which you can think of as String for a second. (see below.) In InfoGrid, all identifiers are long-lasting. This means, your object will still have the same MeshObjectIdentifier after you rebooted your database. This has some advantages, e.g. you can define well-known objects in your graph database to which you can easily return even weeks later.

Other graph databases may use different data types as identifiers (e.g. int or long), but the use of identifiers is common with the above operations. They may or may not be the same after rebooting of the database.

Why does the type of identifier matter? Well, it depends on the application you have in mind. For InfoGrid applications, we primarily care about web applications, specifically REST-ful web applications. And so InfoGrid web applications generally use MeshObjectIdentifiers that identical to the URLs of the application. Let’s make an example:

Assume you have a URL bookmarking application which runs at http://example.com/. Let’s say a user creates tag “books”, which can be found at URL http://example.com/books/. It would be most straightforward to create a MeshObject with MeshObjectIdentifier http://example.com/books/. Which is exactly what InfoGrid does by default. No impedance mismatch between URLs that the user sees, the objects in the application layer, and the database! This leads to dramatic simplification of development and debugging.

Operations on a Graph Database (Part 4 – Properties)

Graph Database Tutorial

Part 1: Nodes

Part 2: Edges

Part 3: Types

Part 4: Properties

Part 5: Identifiers

Part 6: Traversals

Part 7: Sets

Part 8: Events

Today we’re looking at properties. There are a few different philosophies that a graph database might employ.

1. The purists often argue that properties aren’t needed at all: all properties can be modeled as edges to separate nodes, each of which represents a value. That’s of course true at some level: instead of a node representing a person, for example, that “contains” the person’s FirstName, LastName and DateOfBirth, one could create two String nodes and a TimeStamp node, and connect the with edges representing “first name”, “last name” and “date of birth”.

The non-purists counter that for practical purposes, it is much simpler to think of these data elements as properties instead of as independent things that are related. For example, it makes deletion of the Person much simpler (and we don’t need to implement cascading delete rules). Also, there are performance tradeoffs: if the three properties are stored with their owning node, for example, a single read is required to restore from disk the node and all of its properties. This would require at least 4 (perhaps 7, depending on how edges are stored in the graph database) reads if stored independently.

In InfoGrid, we believe in properties. We don’t prevent anybody from creating as many edges as they like, of course, but think that properties definitely have their uses.

2. Properties have to be named in some fashion, and the simplest approach — used by a number of graph database projects — is to give them a String label as a name. Correspondingly, the essence of the property API using Strings as labels would look like this:

public Object getPropertyValue( String name );
public void setPropertyValue( String name, Object value );

The advantage of this model is obviously that it is very simple. The disadvantage is that for complex schemas or models created by multiple development teams, name conflicts and spelling errors for property names occur more frequently than one would like. At least that is our experience when building InfoGrid applications, which is why we prefer the next alternative:

3. Properties are identified by true meta-data objects. We call them PropertyTypes, and they are part of what developers define when defining an InfoGrid Model. So the InfoGrid property API looks like this:

public Object getPropertyValue( PropertyType name );
public void setPropertyValue( PropertyType name, Object value );

We’ll have more to say on the subject of meta-data and Models in a future post.

Finally, we need to discuss what in a graph database can carry properties. Everybody other than the purists (see above) agree that nodes (called MeshObjects in InfoGrid) can carry properties. Some graph database projects (like the now-obsolete InfoGrid V1) also allow properties on edges (called Relationships in InfoGrid). Others (InfoGrid today) do not allow that.

It may sound peculiar that we had what looks like a more powerful approach in an earlier InfoGrid version but not any more. Here is what we observed in our practice with InfoGrid:

  • Properties on edges are fairly rare compared to Properties on nodes. We’ve been involved in several projects over the years where the Models were substantial and not a single property was found on any edge; nor did anybody ask for one.
  • If a property is needed on an edge, there is an easy workaround known as “associative entity” in data modeling circles: simply create an intermediary node that carries the property.
  • The deciding factor was performance: if properties are rarely needed on edges, it is possible to traverse from one node to a neighbor node in a single step. If properties are needed on edges, the edge needs to be represented as a separate object, and a traversal from one node to its neighbor requires two steps: from the start node to the connecting edge, and from the edge to the destination node. So not having properties on edges can improve performance by a 100%. Which is why we got rid of them for InfoGrid V2.

In the next post, we will look at data types for properties.