Operations on a Graph Database (Part 3 - Types)

March 8th, 2010

So far we have looked at the following graph database operations:

We need to talk about properties, but before we can do that, we need to talk about types. In a previous post we looked at the various alternatives for typing and their pros and cons. Graph databases may take the two following approaches to typing, and anything in between:

  • No types at all: nodes, edges and properties have no types; anything goes.
  • Fully typed: nodes, edges and properties are all strongly typed; the graph database refuses to perform an operation that is not consistent with the type system.

There are some graph databases on the market that require types for some elements (edges, for example) but don’t allow types for others (nodes, or properties), so hybrids are possible.

A totally untyped graph database has it easy: it allows an arbitrary number of properties, distinguished only by name and with arbitrary values. So the operations are:

Defined on Node and/or Edge:

setProperty( String label, Object value );  // returns nothing
getProperty( String label );                // returns the value
getAllProperties();                         // returns a set of all current property labels

A fully typed graph database needs to offer a much larger set of operations. They are:

Defined on Node:

bless( NodeType type );   // returns nothing
unbless( NodeType type ); // returns nothing
getTypes();               // returns a set of all current NodeTypes
setProperty( PropertyType type, Object value ); // returns nothing
getProperty( PropertyType type );               // returns the value of the Property

Defined on Edge:

bless( EdgeType type );   // returns nothing
unbless( EdgeType type ); // returns nothing
getTypes();               // returns a set of all current EdgeTypes
setProperty( PropertyType type, Object value ); // returns nothing
getProperty( PropertyType type );               // returns the value of the Property

To determine the available properties, one examines the meta-data available in the type system:

Defined on NodeType and on EdgeType

getPropertyTypes();   // returns a set of all possible properties defined by the type

The abstract API I’m listing here is the general case for a fully dynamically-typed hypothetical graph database. No product that I’m aware of implements them all, but InfoGrid comes close. If you read the JavaDoc for MeshObject, for example (MeshObject is InfoGrid’s name for Node), you will find all of the above Node operations. The operations are also there for Relationships (InfoGrid’s name for Edge) except that for efficiency reasons 1) they are defined on MeshObject, and 2) InfoGrid does not support properties on edges. (InfoGrid V1 used to, but we decided to eliminate that support; will talk about that some other time).

If a graph database was fully typed but using a static type system, we obviously could not bless either nodes or edges with new types at run-time as this API defines. (InfoGrid V1 used to be that way; InfoGrid V2 is fully dynamic).

In the next part, we’ll talk about properties then.

0 comments.

Three’s a Crowd: Neo4j, Sones, Filament all implement InfoGrid’s FirstStep Example

March 4th, 2010

Little did I know when I put up InfoGrid’s FirstStep example. The example creates just a few nodes and a few edges to show, in principle, how to build a URL tagging application based on a graph database like InfoGrid.

Alex Popescu at MyNoSQL challenged the Neo4j folks how they would implement it, and they responded promptly. Then, the guys are Sones implemented the same example themselves, and just now the Filament project did the same. Worth a blog post with the links!

Here they are:

for your reading and comparing pleasure.

I’m tempted to list my own observations, but I’d like to avoid a blogging contest in which — naturally — everybody will claim “but the way we do it is better”. Independent reviews anybody?

0 comments.

Strong and Weak Typing With Graph Databases

March 4th, 2010

Whether programming systems should be strongly typed or weakly typed has been one of the longest-running controversies in the history of computer science going back something like 50 years. Generally speaking, strongly typed systems tend to require more programmer effort up-front, in exchange for earlier or more definite error reports.

We also need to distinguish between static typing and dynamic typing: a dynamically typed system enables changes of types at run-time, while a statically typed system can’t do that.

Not surprisingly, typing for graph databases (or any other kind of NoSQL database) can be implemented in different ways, too:

Weakly typed Strongly typed
Dynamically typed

At development time: types may be declared but are not checked except perhaps rudimentarily.

At run-time: errors may occur, which may or may not be discovered; mis-interpretations of data are possible; data corruption is likely in case of programming errors.

At development time: types are declared and checked as well as possible.

At run-time: all operations are checked for type safety; types can be discovered dynamically; type mis-interpretations are not possible.

Statically typed

At development time: only rudimentary checking, if at all

At run-time: errors may occur, which may or may not be discovered; mis-interpretations of data are possible; data corruption is likely in case of programming errors.

At development time: all type errors are caught; additional developer effort is required; some types of data are hard to represent

At run-time: no checking required due to “correctness by construction”.

Let’s insert some systems into this table:

Weakly typed Strongly typed
Dynamically typed Most NoSQL systems InfoGrid
Statically typed SQL database (if used as intended)

Side note: when NoSQL proponents argue that weakly typed systems are much better than stronger-typed SQL, they sometimes throw out the baby with the bath water: there are four choices, not two. We agree that statically, strongly typed systems like a typical SQL database has considerable disadvantages in a fast-moving world, but so do weakly typed systems; the only difference is the type of disadvantage. In our view, a strong but dynamic type system is the best compromise for most applications with a non-trivial schema, which is why InfoGrid V2 implements it. (There are some applications that do not require a non-trivial; web caching for example.)

In a graph database like InfoGrid, the following items can be typed:

  • Nodes
  • Edges
  • Properties

In other graph databases, only a subset of these items may be typed. More in the next post on types.

0 comments.

Access Control the InfoGrid Way

March 1st, 2010

We do it very similarly in InfoGrid.

But we can go one big step further: have InfoGrid automatically enforce the access control rules that were set up. If we have the ACL information, why not use it and have the graph database do the enforcement for us? That functionality has been part of InfoGrid for a couple of years.

For some detailed examples how this works, consult the security tests that are part of InfoGrid’s automated test suite (particularly MeshBaseSecurityTest5).

Here’s the basic idea. (again, paraphrasing the code for easier readability. Consult the above link for full code.) When a graph database in InfoGrid is configured to run with a AclBasedAccessManager, we can do this:

First, we need a MeshObject that is going to be the owner of some access-controlled data object. Any MeshObject will do:

MeshObject owner = createMeshObject();

Then, we associate the owner with the current Thread. (Just like in UNIX, where the ownership of processes determines what the process can do)

theAccessManager.setCaller( owner );

Now here comes the access-controlled data object.

MeshObject data = createMeshObject();

Because the current Thread is associated with the owner MeshObject, InfoGrid automatically sets up an ownership relationship between the data object and the owner object — just like in UNIX, a newly created file automatically has an owner.

Going beyond UNIX, we can now put the data object into something we call a ProtectionDomain. It’s basically a collection of MeshObjects that all have the same access control policy. This is mainly for efficiency and easy of management.

MeshObject protectionDomain = createMeshObject( AclBasedSecuritySubjectArea.PROTECTIONDOMAIN );
domain.relateAndBless( AclBasedSecuritySubjectArea.PROTECTIONDOMAIN_GOVERNS_MESHOBJECT.getSource(), dataObject );

Now, let’s give some another entity some access rights to the data object:

MeshObject actorMayReadNotWrite = createMeshObject();
actorMayReadNotWrite.relateAndBless( AclBasedSecuritySubjectArea.MESHOBJECT_HASREADACCESSTO_PROTECTIONDOMAIN.getSource(), domain );

Note that it is the owner of the object that needs to do that; others can’t.

So now we change ownership on the thread.

theAccessManager.setCaller( actorMayReadNotWrite );

This call will succeed:

dataObject.getPropertyValue( <some property type> );

while this call will throw a NotPermittedException:

dataObject.setPropertyValue( <some property type>, <some property value> )

If the thread was currently associated with the owner, both calls would succeed. Again, I refer you to the follow code linked above. As you can say, it works very similar to how permissions work in UNIX, although of course the underlying ACL information is represented as a MeshObjectGraph.

If you like this, we can do even one better: the whole security mechanism is pluggable in InfoGrid. You don’t like the way we represent and enforce ACLs? Be our guest … write a new subclass of AccessManager, and it will work the way you want. (Did we say that InfoGrid is extremely pluggable?)

P.S. It’s great to see that we aren’t the only ones to think that security-related information is an excellent match for a graph database. There’s also the rather intriguing example for where Microsoft is going with their LDAP directory, which very much looks like evolution in the graph direction. Time to get on board graph databases!

0 comments.

Operations on a Graph Database (Part 2 - Edges)

February 25th, 2010

In the first post of this series, we looked at creating and deleting Nodes. Today we are looking at Edges.

Unlike simpler NoSQL data stores like key-value stores, graph databases not only manage nodes, but also edges. Edges are things that connect two other data elements, and graph datastores have them as a basic element managed by the store. Think of them as the line between two boxes; that’s exactly what they are.

Edges often take developers a while to get used to, because there isn’t much precedent in the world of software. Even the so-called “relational database” doesn’t actually have “relationships” as a first-class concept: we have to infer them from primary/foreign key declarations; and that only works if developers actually declare them, which is not all that common.

Edges don’t exist in normal code either. Pretty much all mainstream programming languages only have pointers, not relationships aka edges. Edges are bidirectional, managed things, while pointers are one-directional and not managed at all. Let’s take an example (using a simplified version of the InfoGrid API, see the FirstStep example for complete code of a basic URL tagging application):

MeshObject customer = createMeshObject(); // create first node, called MeshObject in InfoGrid
MeshObject order    = createMeshObject(); // create second node
customer.relate( order );

What did we just do?

We created a customer object, and an order object, and then we said the two are related. (The graph database makes sure the objects get persisted automatically when the Transaction is closed; not shown here as we try to stay on topic.)

If we had to do that in straight Java, we’d do something like this:

Customer customer = new Customer();
Order    order    = new Order();
customer.addOrder( order );
order.setCustomer( customer );

and we’d have to write the code to manage the edge ourselves, such as:

class Customer {
    ...
    private List<Order> ordersOfThisCustomer = new ArrayList<Order>();
}
class Order {
    ...
    private Customer customerPlacingThisOrder;
}

The question is: why do we have to do all this work for a simple 1-N relationship between customers and orders? The graph database API is much better: for one, it lets the database worry about how and when to save and restore the involved objects. It could, for example, (as InfoGrid does), decide to restore from disk the Customer object but not the Order object for some time because the application code does not currently need to know the Customer’s orders. And referential integrity is always guaranteed. For example:

customer.traverseToNeighbors().getSingleMember(); // returns the single Order object
order.traverseToNeighbors().getSingleMember(); // returns the single Customer object

// now we delete the edge
customer.unrelate( order );

customer.traverseToNeighbors().getSingleMember(); // returns null
order.traverseToNeighbors().getSingleMember(); // returns null

If there is no graph database involved, we need to do it manually, like this:

customer.removeOrder( order );
order.setCustomer( null );

… and hope that we don’t forget one of those calls, because then referential integrity would be out the window, and the next application crash is a certainty.

Imagine if we wanted to restore the Customer and the Order object at different times from disk. Without help from sophisticated run-time infrastructure like a graph database, band-aid-technologies such as object-relational mapping is most likely going to create a separate instance for, say, the restored Order object, and code such as List.remove( … ) is not going to work because we have two Java objects in memory that represent the same order. (Shudder.)

Of course, code could be written to manage all of this manually, but it’s much better if the platform takes care of it.

[The astute reader will notice that the plain Java example has one advantage: it provides type safety. I'll have to say more about this in an upcoming post about types.]

So: after working with graph databases for a while, many people believe that edges are actually the much more interesting and useful concept than nodes. Just like many data modelers think that the value of a data model is often more in the way the entities are connected by relationships than the details of the entities. Automatic management of relationships make things simple, and that’s what any good database should do. Developers have enough to worry about, and graph databases provide real help here.

In the next post, we’ll look at types.

0 comments.

Operations on a Graph Database (Part 1 - Nodes)

February 23rd, 2010

Graph databases are still quite unfamiliar to many developers. This is the first post in a series discussing the operations a graph database makes available to the developer. Just like there are only so many different things you can do on a relational database (like CREATE TABLE or INSERT), there are only so many things you can do on a graph database. It is worth looking at them one at a time, and that’s the goal of this series. This first post is on creating and deleting nodes.

To recap, a graph database contains nodes and edges, or MeshObjects and Relationships (as we call them in InfoGrid), or Instances and Links (as the UML would call them), or Resources and Triples (as the semantic web folks would call them), or boxes and arrows (as we draw them on a white board).

Nodes are those objects in a graph database that can stand on their own, they don’t depend on anything else. Edges are those objects that depend on the existence of (typically two) other objects, their source and their destination; we think of edges as connecting nodes.

To create a node in a graph database is one of its basic operations. For example, in InfoGrid, you can simply say:

MeshObject createMeshObject()

and voila, you have one. Similarly, you can delete a node by saying:

deleteMeshObject( MeshObject toDelete )

There are few conditions around those operations, such as that you have to have a transaction open, and that you have to have access rights to actually perform this operation, but that goes without saying.

When deleting a node, the graph database may require you to first delete all edges connected to the node before you get to delete it. Or, it may “ripple delete” all connected edges as part of the delete operation. There are some differences in the various graph database products on this; neither will make much of a difference to the developer.

If the graph database enforces a model (aka schema), as some graph databases do, you may need to make sure you don’t attempt to delete a node in a way that the schema would be violated. For example, if the schema says “an Order must be placed by exactly one Customer”, and you are attempting to delete the node representing the Customer, the graph database may prevent you from doing that as long as there still are nodes representing Order related to the Customer node. We’ll discuss schemas and graph databases in more detail in a later post.

For now, we learned two basic operations on a graph database:

  • create node
  • delete node.

Stay tuned for the next installment.

0 comments.

Comparing FirstStep in InfoGrid and Neo4j

February 22nd, 2010

Alex Popescu has a great comparison how the InfoGrid FirstStep example would look like in Neo4j, another graph database. As I noted in an earlier post, there are far more similarities in our approaches to the basics of graph databases than there are differences.

Couple comments, addressing some of Alex’ notes. He says:

everything in Neo4j must happen inside a transaction even if it’s a graph traversal operation (this gives a very strong Isolation level). The InfoGrid traversal code seem to happen outside the transaction…

That’s correct. You can do the traversal inside a transaction if you like, but you are not required to. This gives application developers one more option for concurrency control: transactions, critical sections, and no protection.

Re InfoGrid terminology, it’s ancient roots are in object modeling (think UML) — for example, we still talk about InfoGrid Models. However, over time it became clear that InfoGrid’s core ideas are far distinct enough to warrant their own terms. So when we moved from InfoGrid V1 to V2 a few years ago, we changed terms. For example, an “instance” (in UML or programming terminology) aka “node” (graph terminology) rarely can have more than one type anywhere other than in InfoGrid. Think of a Java object that has more than one class, and you can dynamically add and remove classes to the instance at run-time. So we call them MeshObjects rather than something people might have the wrong connotations with. The closes we are aware of is Perl’s “bless”, which is why we use that term.

the Neo4j uses also the LuceneIndexService for indexing both the tag and web resources nodes, but that’s only because the code there makes sure not to duplicate either tags or web resources (i.e. this functionality is not present in the InfoGrid code and I don’t know how that would look like)

Correction. You are invited to modify the example and attempt to create a second MeshObject with the same identifier. You can’t (it will throw an exception at you).

But never mind the comparatively minor differences between Neo4j and InfoGrid. We should compare this to all the stuff one would have to do with a relational database to build the same thing. Object-relational mapping anybody? No thanks …

0 comments.

FirstStepWithMySQL

February 17th, 2010

Building on the recent InfoGrid FirstStep example, here is another: FirstStepWithMySQL.

Using the same bookmarking/tagging application as FirstStep, it shows how to persist the same MeshObjectGraph using MySQL as a key-value store. It consists of two apps:

  • the first app initializes the store, and creates a graph of objects
  • the second app retrieves the graph from the store, and traverses the graph to retrieve information.

It’s of course a trivial example, but it illustrates:

  • how easy it is in InfoGrid to keep the same application running against different storage backends with minimal code changes (in the initialization only)
  • some of the advantages of graph databases compared to other types of storage technologies: note how simple it is to traverse the graph in all directions.

Annotated source code is here.

0 comments.

InfoGrid 2.9.3 Released

February 17th, 2010

Available for download here. This is mainly an incremental improvement/bug fix release, except:

  1. new capabilities in the ig-lid project related to what Randy Farmer called the tripartite identity pattern.
  2. new example application: FirstStepWithMySQL (see separate post).

Summary of changes:

  • fixed endless loop when Transaction open at MeshBase die time
  • Moved (Net)MeshObjectIdentifierFactory to .mesh.net packages. That allows is to tighten permissions a bit.
  • Check that localIds are at least 4 chars long. Not having this check created confusing results for users of MeshWorld where the GUI assumes this, but not the graph db.
  • Replaced StringRepresentationParseException with java.text.ParseException in most places.
  • More specific subclasses of LidInvalidCredentialException for better error reporting
  • More resilient when Gpg home dir cannot be created
  • Better TimeStampValue.toString()
  • Make it easier to create “su” Transactions with elevated privileges
  • collect all outgoing data into the same XprisoMessage; this prevents sending more than one message in response to a single incoming message
  • improved abilities to freshen Replicas
  • added ForwardReferenceTest9
  • added DelegatingNetMeshObjectIdentifierFactory as a convenience class
  • Major refactoring in module ig-lid to implement what Randy Farmer called the tripartite identity pattern. Nickname is what he calls Public ID. HasIdentifier is used to represent Login ID. LidPersona represents what he calls Account and also manages the relationships to the other items. There is a corresponding Model.
  • Added identifier-as-entered to LidAuthenticationStatus for better error reporting
  • Re-introduced LidPersona as a major concept
  • TransactionAction now carries a few member variables (MeshBase, MeshBaseLifecycleManager, MeshObjectIdentifierFactory) in order to make the writing of transactional code more concise
  • Split org.infogrid.probe.test into several test modules; makes it better manageable
  • removed a bunch of unnecessary files from ig-vendors; they only take up space and bandwidth
  • fixed multiple ModelBase bug
  • failed to load model under some circumstances when not running under the Module framework
  • added isIdentifiedBy method to org.infogrid.util.HasIdentifier
  • changed MeshObjectSet.contains( MeshObjectIdentifier ) to MeshObjectSet.contains( Identifier )
  • added FirstStepWithMySQL
  • localization for LidAbortProcessingPipelineException

0 comments.

InfoGrid Slides On Slideshare

February 11th, 2010

Slideshare sent this last night:

InfoGrid Core Ideas” is being tweeted more than any other document on SlideShare right now. So we’ve put it on the homepage of SlideShare.net (in the “Hot on Twitter” section).

Well done, you!

- SlideShare Team

Must have been a slow evening ;-)

0 comments.