10gen: Why NoSQL is Here to Stay…

Good line of reasoning in 10gen’s blog post:

One reason why NoSQL, or some iteration, is here to stay is that the way computer architectures are heading, having systems that can run across multiple machines is going to be an absolute requirement. The limitations of vertical scaling are going to get worse and worse. You’re going to get new chips that have more and more CPU cores on them, but the speed isn’t much higher. And they’re going to be cheaper too so you can get more computers but you’re not going to be able to get one computer that’s really fast at any price. But you’re going to be able to get 1000 computers that are not terribly fast really cheaply. So the question is, at the data storage layer, can you leverage that? The traditional approach is no, not without a lot of manual effort.  But changing computer architectures, as well as the growth of cloud computing, necessitates a better set of database systems built to achieve scale. These new solutions are going to solve that and it’s going to be critical. We want a new set of tools for the data storage layer that work well with those cloud principles, which are things like infinite scalability, low to 0 configuration, and ease of development without friction.

What Makes It NoSQL?

Alex Popescu pickes up my post to the NoSQL mailing list and seems to agree:

An interesting post on the NOSQL Group about what takes a storage to be considered NoSQL:

  1. SQL-the-language vs. alternate query languages
  2. A tabular model for data as opposed to one that is not (e.g. key-value, object, graph, …)
  3. ACID vs. non-ACID
  4. Centralized vs. distributed/decentralized

In case we agree with the author, Johannes Ernst, then we might be tempted to conclude as he does:

It’s interesting to observe that any “NoSQL” product could be “NoSQL” in any number of these dimensions. […]

Which would also explain why so many “NoSQL” products are so dissimilar to each other.

So, what makes it NoSQL?

Applying this to InfoGrid:

  • InfoGrid does not use SQL-the-language. Otherwise, how could we do things such as blessing the same object with multiple types at run-time? (Example)
  • InfoGrid uses a graph database model, not a tabular model, for the same reason Tim Berners-Lee decided that the entire world-wide-web could not fit into a relational database either.
  • InfoGrid relaxes ACID. No truly distributed system that I know of has ever had ACID properties, nor wanted them. Too many things can go wrong.
  • InfoGrid uses the “small pieces loosely joined” paradigm, not a top-down paradigm.

So if there are degrees of NoSQL-ness, InfoGrid scores high. (all while being able to run on top of a SQL database, if you wish — and no, that’s not a contradiction)

Jonathan Ellis: The NoSQL Ecosystem

Excellent article by Jonathan Ellis on the various approaches to non-relational databases in the market today. He categorizes products and projects along three dimensions:

  1. scalability, in particular how well one can add and remove servers in local or remote data centers
  2. data and query model. He finds a lot of variety there.
  3. persistence design. Alternatives range from in-memory only to smart caching strategies to on-disk.

This categorization is really useful, and more useful than several other categorizations that have been proposed.

Let’s apply this to InfoGrid’s graph database layer:

Re scalability, InfoGrid scales as well as the underlying persistence layer. InfoGrid makes storage pluggable by delegating to the Store abstraction, and Store can be implemented on top of any key-value store. So InfoGrid is just as scalable as the underlying Store.

InfoGrid’s data and query model is based on a graph and an explicit object model. This makes life even easier for the developer than any of the alternatives he discusses in his article. Also, we think our traversal API is a lot simpler than some others that we have seen.

InfoGrid’s persistence design actually gives developers more choices than is typical: InfoGrid can be entirely in memory (if class MMeshBase is instantiated, for example), or smartly cached to an external Store (if class StoreMeshBase is instantiated). Most importantly: the API that developers write to is the same. This allows developers to write application code once, and only later decide how to store their application data. Or, if one kind of Store does not work out (or does not scale once the application becomes popular), move to another without changing the application (other than the initialization).

The NoSQL Business and Use Cases

My question about the most important business and use cases for NoSQL technologies on the NoSQL mailing list sparked an interesting discussion. There appears to be widespread agreement on the following three high-level use/business cases:

  1. The amount of data, or bandwidth, required by an application is so massive that a massively distributed architecture is needed.
    This is the original use case for systems such as Google’s BigTable built to index the internet.
  2. The query load or query complexity is too large to be handled by relational “joins”.
    Digg explained this very well as their reason to move to a NoSQL architecture.
  3. The gap between the physical relational data structures required by a SQL database and an application’s schema complexity and flexibility requirements is too large.
    This encompasses the entire range from needing very loose, weakly typed storage to needing very expressive, strongly typed systems (e.g. graph databases with explicit object models).

Each of these of course is a big category, and more detail can be added. But it is good to see that the community seems to be able to agree on the top-three. It should also put to rest the argument that “NoSQL is not needed”.

Carsonified: Why Graph Databases

Martin Kleppman summarizes the case for Graph Databases at carsonified.com. This is exactly why InfoGrid is built around a graph of MeshObjects:

… graph databases focus on the relationships between items — a better fit for highly interconnected data models.

Standard SQL cannot query transitive relationships, i.e. variable-length chains of joins which continue until some condition is reached. Graph databases, on the other hand, are optimised precisely for this kind of data. Look out for these symptoms indicating that your data would better fit into a graph model:

  • you find yourself writing long chains of joins (join table A to B, B to C, C to D) in your queries;
  • you are writing loops of queries in your application in order to follow a chain of relationships (particularly when you don’t know in advance how long that chain is going to be);
  • you have lots of many-to-many joins or tree-like data structures;
  • your data is already in a graph form (e.g. information about who is friends with whom in a social network).

Graph databases are often associated with the semantic web and RDF datastores, which is one of the applications they are used for. I actually believe that many other applications’ data would also be well represented in graphs. However, as before, don’t try to force data into a graph if it fits better into tables or documents.

In our experience, particularly social applications or applications that deal with complex interrelated data are much easier to build using a graph of typed objects in InfoGrid than to shoehorn into relational tables. But then, InfoGrid can use relational databases as storage engines, so we have the best of both worlds: graphs on the front, and enterprise-friendly SQL on the back.

Neo4j and InfoGrid

Just came across Neo4j, an “open source NoSQL graph database”. Neo4j is clearly very close in philosophy and API to InfoGrid, in fact closer than anything else that I’ve come across so far.

Compare this:

Neoj4 InfoGrid
Node firstNode
    = neo.createNode();
Node secondNode
    = neo.createNode();
firstNode.createRelationshipTo(
    secondNode,
    MyRelationshipTypes.KNOWS );
MeshObject firstObject
    = life.createMeshObject();
MeshObject secondObject
    = life.createMeshObject();
firstObject.relateAndBless(
    secondObject,
    MySubjectArea.MESHOBJECT_KNOWS_MESHOBJECT.getSource() );

If this isn’t similar, what is?

Transactions:

Neoj4 InfoGrid
Transaction tx = neo.beginTx();
try {
    // do something
   tx.success();
} finally {
   tx.finish();
}
Transaction tx = null;
try {
    tx = mb.createTransactionNow();
    // do something
} finally {
    if( tx != null ) {
        tx.commitTransaction();
    }
}

Properties:

Neoj4 InfoGrid
firstNode.setProperty(
    "Name",
    "Neo4j" );
firstObject.setPropertyValue(
    MySubjectArea.PERSON_NAME,
    StringValue.create( "InfoGrid" ));

Regarding differences, it seems the Neo4j folks have spent a lot more time than we have on make it a “database” (while InfoGrid delegates to other storage engines like MySQL or Hadoop).
On the other hand, InfoGrid.fnd is type-safe, and instead of a command-line shell, uses a set of web Viewlets to access the graph of objects. (Which then can be incrementally refactored into an application.)
And then of course there is InfoGrid.net, which does not seem to have an equivalent in Neo4j. (see also InfoGrid Core features and Neo4j wiki)

Worth digging into more deeply …

Linux Magazine Article on MongoDB

Good introduction to MongoDB at Linux Magazine. It makes the case for post-relational web application architectures quite well:

Web applications and traditional relational databases are nearing an end to their tumultuous relationship. For over a decade now, most Web applications have been built on top of relational databases, with various layers of indirection to simplify coding and boost the productivity of developers. For every Web programming language, there are any number of object-relational mapping (ORM) choices, each with pros and cons, yet none good enough that a developer can forget about SQL or ignore protecting the database. Moreover, as Web applications grow more complicated and sites need to be created faster, adapt instantly, and scale massively, these old solutions are no longer satisfying the demands of the Web.

There are a number of different projects working on new database technologies, all of which forego the stalwart relational model. Relational databases are difficult to scale, largely because distributed joins are difficult to perform efficiently. Further, mapping from the many popular dynamically-typed languages to SQL is complicated, inefficient, and time consuming. While often called the “NoSQL” movement, the need for new technologies is caused by the relational model, rather than SQL.

Beyond the relational model, there are a number of data model choices: key-value stores, tabular databases, graph databases, and document databases…

InfoGrid is built around a graph database in this terminology. Except of course, that by virtue of the Store abstraction, InfoGrid can delegate to virtually any kind of database. InfoGrid does not contain a database itself. We believe this separation of concerns makes it easiest for developers to benefit from the high-level services that InfoGrid offers, while still being able to choose whatever storage technology they prefer.

InfoGrid and Relational Databases

The other day I was asked:

Your pitch for InfoGrid really disses relational databases. But then, InfoGrid applications usually use MySQL (or PostgreSQL) to store their data. What gives?

To which I responded:

All the database vendors want you to store your data in their database, instead of files in the file system. But then, the databases themselves store their data as files in the file system. What gives?

This does not sound as contradictory. It’s fine that a database stores its data as files in a file system; it may or may not, as an application developer you really don’t care much. You care about the high-level facilities (such as SQL) that the database provides, because writing code against them is much easier and faster than writing against files (for many applications).

The InfoGrid argument is the same one, just one level up: It is much better the develop against the InfoGrid APIs than against SQL directly, because of all the high-level facilities that InfoGrid gives you. Here’s an example:

Try this with a relational database:

MeshObject employee = ...;
employee.bless( CustomerSubjectArea.CUSTOMER );

Your employee has just also become a customer, with all that this entails (e.g. participating in the relationship Customer_Places_Order, which you can’t as a mere employee). For more on blessing objects, see the documentation.

With raw SQL, you wouldn’t even know where exactly to start, but chances are you would have to redesign your schema, and write and update a whole lot of application code.

Adam Keys: Post-Relational, not NoSQL

Adam Keys makes a good argument why the budding NoSQL movement should instead be called post-relational. Among other things, he says:

Right now, the best we have is NoSQL. The problem with that name is that it only defines what it is not…

What we’re seeing its the end of the assumption that valuable data should go in some kind of relational database. The end of the assumption that SQL and ACID are the only tools for solving our problems. The end of the viability of master/slave scaling. The end of weaving the relational model through our application code.

We agree, which is why we use the term post-relational when talking about InfoGrid.

NoSQL East Conference in Atlanta

The aspiring NoSQL movement has a conference:

no:sql(east)

October 28-30, 2009. Atlanta, GA.

You got to love their motto:

select fun, profit from real_world where relational=false;

Reportedly there will be talks on:

and perhaps others (e.g. Project Voldemort, Tokyo *, Neo4J, Riak, Kai, Hypertable, Dryad/Cosmos).

Should be interesting.