Operations on a Graph Database (Part 7 – Sets)

Graph Database Tutorial

Part 1: Nodes

Part 2: Edges

Part 3: Types

Part 4: Properties

Part 5: Identifiers

Part 6: Traversals

Part 7: Sets

Part 8: Events

Sets are a core concept of most databases. For example, any SQL SELECT statement in a relational database produces a set. Sets apply to Graph Databases just as well and are just as useful:

The most frequently encountered set of nodes in a Graph Database is the result of a traversal. For example, in InfoGrid, all traversal operations result in a set like this:

MeshObject    startNode     = ...; // some start node
MeshObjectSet neighborNodes = startNode.traverseToNeighbors();

We might as well have returned an array, or an Iterator over the members of the set, were it not for the fact that there are well-understood set operations that often make our jobs as developers much simpler: like set unification, intersection and so forth.

For example, in a social bookmarking application we might want to find out which sites both you and I have bookmarked. Code might look like this:

MeshObject me  = ...; // node representing me
MeshObject you = ...; // node representing you

TraversalSpecification ME_TO_BOOKMARKS_SPEC = ...;
    // how to get from a person to their bookmarks, see post on traversals
MeshObjectSet myBookmarks   = me.traverse( ME_TO_BOOKMARKS_SPEC );
MeshObjectSet yourBookmarks = you.traverse( ME_TO_BOOKMARKS_SPEC );

// Bookmarks that you and I share
MeshObjectSet sharedBookmarks = myBookmarks.intersect( yourBookmarks );

Notice how simple this code is to understand? One of the powers of sets. Or, if you know what a “minus” operation is on a set, this is immediately obvious:

// Bookmarks unique to me
MeshObjectSet myUniqueBookmarks = myBookmarks.minus( yourBookmarks );

This is clearly much simpler than writing imperative code which would have lots of loops and if/then/else’s and comparisons and perhaps indexes in it. (And seeing this might put some concerns to rest that NoSQL databases are primitive because they don’t have a SQL-like query language. I’d argue it’s less the language but the power of sets, and if you have sets you have a lot of power at your fingertips.)

To check out sets in InfoGrid, try package org.infogrid.mesh.set. Clearly much more can be done than we have so far in InfoGrid, but it’s a very useful start in our experience.

Welcome FlockDB

Nick Kallen at Twitter last night released FlockDB, Twitter’s social graph database. Source code is here.

If anybody doubted that graph databases are real, or are useful, this release is yet another good reason to investigate. Welcome FlockDB to the crowd.

I haven’t had time to take a detailed look, but it appears that FlockDB has a hard-coded schema developed specifically for the needs at Twitter. That makes a lot of sense for Twitter but less so as a general-purpose graph database. On the other hand, lots of people could probably benefit from that schema when building social applications. We’ll see.

So, welcome.

Big Data Workshop April 23, Mountain View, CA

I’m planning to be at Big Data Workshop, the first unconference on NoSQL and Big Data. If past events moderated by Kaliya Hamlin are any guide, it will be a great opportunity for everybody:

  • to explore together how the Big Data market will be coming together
  • to understand how the key technologies and projects work
  • what interfaces and interoperability standards are emerging and/or needed
  • how we can grow the overall market and make it easier for everybody to adopt these technologies for interesting new projects.

Arguably, without Internet Identity Workshop (also moderated by Kaliya) was the enabler for the stunning adoption rate over the past five years of OpenID, OAuth and related technologies (at last count, more than 1 billion enabled accounts). I hope history repeats itself here.

P.S. Feel free to corner me on InfoGrid, graph databases or any other subject. That’s the whole point of an unconference.

Three’s a Crowd: Neo4j, Sones, Filament all implement InfoGrid’s FirstStep Example

Little did I know when I put up InfoGrid’s FirstStep example. The example creates just a few nodes and a few edges to show, in principle, how to build a URL tagging application based on a graph database like InfoGrid.

Alex Popescu at MyNoSQL challenged the Neo4j folks how they would implement it, and they responded promptly. Then, the guys are Sones implemented the same example themselves, and just now the Filament project did the same. Worth a blog post with the links!

Here they are:

for your reading and comparing pleasure.

I’m tempted to list my own observations, but I’d like to avoid a blogging contest in which — naturally — everybody will claim “but the way we do it is better”. Independent reviews anybody?

Comparing FirstStep in InfoGrid and Neo4j

Alex Popescu has a great comparison how the InfoGrid FirstStep example would look like in Neo4j, another graph database. As I noted in an earlier post, there are far more similarities in our approaches to the basics of graph databases than there are differences.

Couple comments, addressing some of Alex’ notes. He says:

everything in Neo4j must happen inside a transaction even if it’s a graph traversal operation (this gives a very strong Isolation level). The InfoGrid traversal code seem to happen outside the transaction…

That’s correct. You can do the traversal inside a transaction if you like, but you are not required to. This gives application developers one more option for concurrency control: transactions, critical sections, and no protection.

Re InfoGrid terminology, it’s ancient roots are in object modeling (think UML) — for example, we still talk about InfoGrid Models. However, over time it became clear that InfoGrid’s core ideas are far distinct enough to warrant their own terms. So when we moved from InfoGrid V1 to V2 a few years ago, we changed terms. For example, an “instance” (in UML or programming terminology) aka “node” (graph terminology) rarely can have more than one type anywhere other than in InfoGrid. Think of a Java object that has more than one class, and you can dynamically add and remove classes to the instance at run-time. So we call them MeshObjects rather than something people might have the wrong connotations with. The closes we are aware of is Perl’s “bless”, which is why we use that term.

the Neo4j uses also the LuceneIndexService for indexing both the tag and web resources nodes, but that’s only because the code there makes sure not to duplicate either tags or web resources (i.e. this functionality is not present in the InfoGrid code and I don’t know how that would look like)

Correction. You are invited to modify the example and attempt to create a second MeshObject with the same identifier. You can’t (it will throw an exception at you).

But never mind the comparatively minor differences between Neo4j and InfoGrid. We should compare this to all the stuff one would have to do with a relational database to build the same thing. Object-relational mapping anybody? No thanks …

InfoGrid 2.9.3 Released

Available for download here. This is mainly an incremental improvement/bug fix release, except:

  1. new capabilities in the ig-lid project related to what Randy Farmer called the tripartite identity pattern.
  2. new example application: FirstStepWithMySQL (see separate post).

Summary of changes:

  • fixed endless loop when Transaction open at MeshBase die time
  • Moved (Net)MeshObjectIdentifierFactory to .mesh.net packages. That allows is to tighten permissions a bit.
  • Check that localIds are at least 4 chars long. Not having this check created confusing results for users of MeshWorld where the GUI assumes this, but not the graph db.
  • Replaced StringRepresentationParseException with java.text.ParseException in most places.
  • More specific subclasses of LidInvalidCredentialException for better error reporting
  • More resilient when Gpg home dir cannot be created
  • Better TimeStampValue.toString()
  • Make it easier to create “su” Transactions with elevated privileges
  • collect all outgoing data into the same XprisoMessage; this prevents sending more than one message in response to a single incoming message
  • improved abilities to freshen Replicas
  • added ForwardReferenceTest9
  • added DelegatingNetMeshObjectIdentifierFactory as a convenience class
  • Major refactoring in module ig-lid to implement what Randy Farmer called the tripartite identity pattern. Nickname is what he calls Public ID. HasIdentifier is used to represent Login ID. LidPersona represents what he calls Account and also manages the relationships to the other items. There is a corresponding Model.
  • Added identifier-as-entered to LidAuthenticationStatus for better error reporting
  • Re-introduced LidPersona as a major concept
  • TransactionAction now carries a few member variables (MeshBase, MeshBaseLifecycleManager, MeshObjectIdentifierFactory) in order to make the writing of transactional code more concise
  • Split org.infogrid.probe.test into several test modules; makes it better manageable
  • removed a bunch of unnecessary files from ig-vendors; they only take up space and bandwidth
  • fixed multiple ModelBase bug
  • failed to load model under some circumstances when not running under the Module framework
  • added isIdentifiedBy method to org.infogrid.util.HasIdentifier
  • changed MeshObjectSet.contains( MeshObjectIdentifier ) to MeshObjectSet.contains( Identifier )
  • added FirstStepWithMySQL
  • localization for LidAbortProcessingPipelineException

The FirstStep Example

The new FirstStep example application allows you to get an InfoGrid application running literally in 60 seconds or less.

FirstStep shows the essence of how a tagging application like delicious would be implemented using InfoGrid.

Instructions and annotated source code are here: http://infogrid.org/wiki/Examples/FirstStep.

First Academic Workshop on Graph Databases: in China

This is remarkable.

Sooner or later, somebody had to organize an “international workshop on graph databases” in an academic setting. It happens with all technologies. So it’s not surprising that there will be one in July this year, in conjunction with a conference and with the proceedings published by Springer, just like you would expect.

It is surprising that the workshop is organized in China, by Chinese researchers. That’s a first, at least from what I have seen so far. Usually you would expect something in, say, Florida, or the south of France or perhaps Spain or Germany. But it is China.

Way to go! Some people are faster to spot a trend than others, and as an entrepreneur, I admire that.

InfoGrid 2.9.1 Released

This is an incremental release focusing on bug fixes and minor enhancements that make life easier for the developer. To download, go to http://infogrid.org/wiki/Docs/Downloads.

Summary of changes:

  • support for reverse proxies (e.g. Apache in front of Tomcat) with corresponding changes of http/s, port, host and path
  • CommandLineBootLoader now deactivates Modules
  • allow ResourceHelper initialization from Module initialization
  • added traceMethodCallExit with return value to Log
  • added identifierSuffix to enable giving an LDAP domain for authentication
  • attempt LDAP reconnect when communication exception
  • many LID/OpenID fixes and improvements, including ability to run behind reverse Proxy
  • all JDBC and database names all-lowercase; too many funny issues on some platforms
  • LidProcessingPipeline sets request attributes org_infogrid_lid_RequestingClient and org_infogrid_lid_RequestedResource instead of something less straightforward
  • support for page-wise scrolling in Viewlets, e.g. the AllMeshObjectsViewlet
  • added InstrumentedThread.advanceTo with a timeout
  • Simplified TransactionAction Exception signatures to make invocation code shorter
  • Split org.infogrid.mesh.TypedMeshObjectFacade into an interface and a class — all generated interfaces now inherit from org.infogrid.mesh.TypedMeshObjectFacade
  • Treat localhost as resolvable global identifier
  • Derive theDefaultMeshBaseIdentifier from first incoming HTTP request if not given in web.xml
  • Less of the mystifying error messages when undeploying on Tomcat
  • added ProbeUpdateSpecification_LastRunUsedProbeClass to Probe model
  • Allow custom HostnameVerifier to deal with non-official SSL certs
  • removed Test model from MeshWorld, NetMeshWorld
  • added Tagging model to MeshWorld, NetMeshWorld
  • changed cookie value encoding
  • clean up cookie values that may have been double-quoted
  • renamed LID cookies to use hyphens not periods, makes interop with PHP easier (which uncomprehensively changes all periods in cookie names to underscores)
  • refactored and expanded session management and related code in LID modules
  • Generic mechanism to add HTTP headers instead of one limited to Yadis
  • moved default gpg working directory to /var/run
  • made SaneRequest more sane: separate URL and POST’d arguments cleanly.
  • added SaneRequest.getAbsoluteFullUriWithoutMatchingArguments
  • moved to NetBeans 2.8
  • various bug fixes, including those found in a static analysis run
  • more tests
  • some API extensions
  • some improved formatting in HTML output

Jonathan Ellis: The NoSQL Ecosystem

Excellent article by Jonathan Ellis on the various approaches to non-relational databases in the market today. He categorizes products and projects along three dimensions:

  1. scalability, in particular how well one can add and remove servers in local or remote data centers
  2. data and query model. He finds a lot of variety there.
  3. persistence design. Alternatives range from in-memory only to smart caching strategies to on-disk.

This categorization is really useful, and more useful than several other categorizations that have been proposed.

Let’s apply this to InfoGrid’s graph database layer:

Re scalability, InfoGrid scales as well as the underlying persistence layer. InfoGrid makes storage pluggable by delegating to the Store abstraction, and Store can be implemented on top of any key-value store. So InfoGrid is just as scalable as the underlying Store.

InfoGrid’s data and query model is based on a graph and an explicit object model. This makes life even easier for the developer than any of the alternatives he discusses in his article. Also, we think our traversal API is a lot simpler than some others that we have seen.

InfoGrid’s persistence design actually gives developers more choices than is typical: InfoGrid can be entirely in memory (if class MMeshBase is instantiated, for example), or smartly cached to an external Store (if class StoreMeshBase is instantiated). Most importantly: the API that developers write to is the same. This allows developers to write application code once, and only later decide how to store their application data. Or, if one kind of Store does not work out (or does not scale once the application becomes popular), move to another without changing the application (other than the initialization).