March 4th, 2010
Whether programming systems should be strongly typed or weakly typed has been one of the longest-running controversies in the history of computer science going back something like 50 years. Generally speaking, strongly typed systems tend to require more programmer effort up-front, in exchange for earlier or more definite error reports.
We also need to distinguish between static typing and dynamic typing: a dynamically typed system enables changes of types at run-time, while a statically typed system can’t do that.
Not surprisingly, typing for graph databases (or any other kind of NoSQL database) can be implemented in different ways, too:
|
Weakly typed |
Strongly typed |
| Dynamically typed |
At development time: types may be declared but are not checked except perhaps rudimentarily.
At run-time: errors may occur, which may or may not be discovered; mis-interpretations of data are possible; data corruption is likely in case of programming errors.
|
At development time: types are declared and checked as well as possible.
At run-time: all operations are checked for type safety; types can be discovered dynamically; type mis-interpretations are not possible.
|
| Statically typed |
At development time: only rudimentary checking, if at all
At run-time: errors may occur, which may or may not be discovered; mis-interpretations of data are possible; data corruption is likely in case of programming errors.
|
At development time: all type errors are caught; additional developer effort is required; some types of data are hard to represent
At run-time: no checking required due to “correctness by construction”.
|
Let’s insert some systems into this table:
|
Weakly typed |
Strongly typed |
| Dynamically typed |
Most NoSQL systems |
InfoGrid |
| Statically typed |
|
SQL database (if used as intended) |
Side note: when NoSQL proponents argue that weakly typed systems are much better than stronger-typed SQL, they sometimes throw out the baby with the bath water: there are four choices, not two. We agree that statically, strongly typed systems like a typical SQL database has considerable disadvantages in a fast-moving world, but so do weakly typed systems; the only difference is the type of disadvantage. In our view, a strong but dynamic type system is the best compromise for most applications with a non-trivial schema, which is why InfoGrid V2 implements it. (There are some applications that do not require a non-trivial; web caching for example.)
In a graph database like InfoGrid, the following items can be typed:
In other graph databases, only a subset of these items may be typed. More in the next post on types.
February 17th, 2010
Building on the recent InfoGrid FirstStep example, here is another: FirstStepWithMySQL.
Using the same bookmarking/tagging application as FirstStep, it shows how to persist the same MeshObjectGraph using MySQL as a key-value store. It consists of two apps:
- the first app initializes the store, and creates a graph of objects
- the second app retrieves the graph from the store, and traverses the graph to retrieve information.
It’s of course a trivial example, but it illustrates:
- how easy it is in InfoGrid to keep the same application running against different storage backends with minimal code changes (in the initialization only)
- some of the advantages of graph databases compared to other types of storage technologies: note how simple it is to traverse the graph in all directions.
Annotated source code is here.
February 17th, 2010
Available for download here. This is mainly an incremental improvement/bug fix release, except:
- new capabilities in the
ig-lid project related to what Randy Farmer called the tripartite identity pattern.
- new example application: FirstStepWithMySQL (see separate post).
Summary of changes:
- fixed endless loop when Transaction open at MeshBase die time
- Moved (Net)MeshObjectIdentifierFactory to .mesh.net packages. That allows is to tighten permissions a bit.
- Check that localIds are at least 4 chars long. Not having this check created confusing results for users of MeshWorld where the GUI assumes this, but not the graph db.
- Replaced StringRepresentationParseException with java.text.ParseException in most places.
- More specific subclasses of LidInvalidCredentialException for better error reporting
- More resilient when Gpg home dir cannot be created
- Better TimeStampValue.toString()
- Make it easier to create “su” Transactions with elevated privileges
- collect all outgoing data into the same XprisoMessage; this prevents sending more than one message in response to a single incoming message
- improved abilities to freshen Replicas
- added ForwardReferenceTest9
- added DelegatingNetMeshObjectIdentifierFactory as a convenience class
- Major refactoring in module ig-lid to implement what Randy Farmer called the tripartite identity pattern. Nickname is what he calls Public ID. HasIdentifier is used to represent Login ID. LidPersona represents what he calls Account and also manages the relationships to the other items. There is a corresponding Model.
- Added identifier-as-entered to LidAuthenticationStatus for better error reporting
- Re-introduced LidPersona as a major concept
- TransactionAction now carries a few member variables (MeshBase, MeshBaseLifecycleManager, MeshObjectIdentifierFactory) in order to make the writing of transactional code more concise
- Split org.infogrid.probe.test into several test modules; makes it better manageable
- removed a bunch of unnecessary files from ig-vendors; they only take up space and bandwidth
- fixed multiple ModelBase bug
- failed to load model under some circumstances when not running under the Module framework
- added isIdentifiedBy method to org.infogrid.util.HasIdentifier
- changed MeshObjectSet.contains( MeshObjectIdentifier ) to MeshObjectSet.contains( Identifier )
- added FirstStepWithMySQL
- localization for LidAbortProcessingPipelineException
January 26th, 2010
The new FirstStep example application allows you to get an InfoGrid application running literally in 60 seconds or less.
FirstStep shows the essence of how a tagging application like delicious would be implemented using InfoGrid.
Instructions and annotated source code are here: http://infogrid.org/wiki/Examples/FirstStep.
January 26th, 2010
InfoGrid 2.9.2 is focused on the new project layout of the code base. This new layout has also been documented on the wiki, starting with the front page and continuing to the projects page.
The new layout will make it easier for newcomers to find their way around InfoGrid, and to selectively include only those parts of InfoGrid required for a given application. It’s top-level structure is as follows:
Below, you find directories such as:
- modules: contains the functionality of the project
- tests: automated tests for the project
- testapps: web applications testing the project
- etc.
Enjoy!
January 21st, 2010
Good line of reasoning in 10gen’s blog post:
One reason why NoSQL, or some iteration, is here to stay is that the way computer architectures are heading, having systems that can run across multiple machines is going to be an absolute requirement. The limitations of vertical scaling are going to get worse and worse. You’re going to get new chips that have more and more CPU cores on them, but the speed isn’t much higher. And they’re going to be cheaper too so you can get more computers but you’re not going to be able to get one computer that’s really fast at any price. But you’re going to be able to get 1000 computers that are not terribly fast really cheaply. So the question is, at the data storage layer, can you leverage that? The traditional approach is no, not without a lot of manual effort. But changing computer architectures, as well as the growth of cloud computing, necessitates a better set of database systems built to achieve scale. These new solutions are going to solve that and it’s going to be critical. We want a new set of tools for the data storage layer that work well with those cloud principles, which are things like infinite scalability, low to 0 configuration, and ease of development without friction.
December 13th, 2009
Alex Popescu pickes up my post to the NoSQL mailing list and seems to agree:
An interesting post on the NOSQL Group about what takes a storage to be considered NoSQL:
- SQL-the-language vs. alternate query languages
- A tabular model for data as opposed to one that is not (e.g. key-value, object, graph, …)
- ACID vs. non-ACID
- Centralized vs. distributed/decentralized
In case we agree with the author, Johannes Ernst, then we might be tempted to conclude as he does:
It’s interesting to observe that any “NoSQL” product could be “NoSQL” in any number of these dimensions. […]
Which would also explain why so many “NoSQL” products are so dissimilar to each other.
So, what makes it NoSQL?
Applying this to InfoGrid:
- InfoGrid does not use SQL-the-language. Otherwise, how could we do things such as blessing the same object with multiple types at run-time? (Example)
- InfoGrid uses a graph database model, not a tabular model, for the same reason Tim Berners-Lee decided that the entire world-wide-web could not fit into a relational database either.
- InfoGrid relaxes ACID. No truly distributed system that I know of has ever had ACID properties, nor wanted them. Too many things can go wrong.
- InfoGrid uses the “small pieces loosely joined” paradigm, not a top-down paradigm.
So if there are degrees of NoSQL-ness, InfoGrid scores high. (all while being able to run on top of a SQL database, if you wish — and no, that’s not a contradiction)
November 9th, 2009
Excellent article by Jonathan Ellis on the various approaches to non-relational databases in the market today. He categorizes products and projects along three dimensions:
- scalability, in particular how well one can add and remove servers in local or remote data centers
- data and query model. He finds a lot of variety there.
- persistence design. Alternatives range from in-memory only to smart caching strategies to on-disk.
This categorization is really useful, and more useful than several other categorizations that have been proposed.
Let’s apply this to InfoGrid’s graph database layer:
Re scalability, InfoGrid scales as well as the underlying persistence layer. InfoGrid makes storage pluggable by delegating to the Store abstraction, and Store can be implemented on top of any key-value store. So InfoGrid is just as scalable as the underlying Store.
InfoGrid’s data and query model is based on a graph and an explicit object model. This makes life even easier for the developer than any of the alternatives he discusses in his article. Also, we think our traversal API is a lot simpler than some others that we have seen.
InfoGrid’s persistence design actually gives developers more choices than is typical: InfoGrid can be entirely in memory (if class MMeshBase is instantiated, for example), or smartly cached to an external Store (if class StoreMeshBase is instantiated). Most importantly: the API that developers write to is the same. This allows developers to write application code once, and only later decide how to store their application data. Or, if one kind of Store does not work out (or does not scale once the application becomes popular), move to another without changing the application (other than the initialization).
November 2nd, 2009
My question about the most important business and use cases for NoSQL technologies on the NoSQL mailing list sparked an interesting discussion. There appears to be widespread agreement on the following three high-level use/business cases:
- The amount of data, or bandwidth, required by an application is so massive that a massively distributed architecture is needed.
This is the original use case for systems such as Google’s BigTable built to index the internet.
- The query load or query complexity is too large to be handled by relational “joins”.
Digg explained this very well as their reason to move to a NoSQL architecture.
- The gap between the physical relational data structures required by a SQL database and an application’s schema complexity and flexibility requirements is too large.
This encompasses the entire range from needing very loose, weakly typed storage to needing very expressive, strongly typed systems (e.g. graph databases with explicit object models).
Each of these of course is a big category, and more detail can be added. But it is good to see that the community seems to be able to agree on the top-three. It should also put to rest the argument that “NoSQL is not needed”.
October 28th, 2009
Martin Kleppman summarizes the case for Graph Databases at carsonified.com. This is exactly why InfoGrid is built around a graph of MeshObjects:
… graph databases focus on the relationships between items — a better fit for highly interconnected data models.
Standard SQL cannot query transitive relationships, i.e. variable-length chains of joins which continue until some condition is reached. Graph databases, on the other hand, are optimised precisely for this kind of data. Look out for these symptoms indicating that your data would better fit into a graph model:
- you find yourself writing long chains of joins (join table A to B, B to C, C to D) in your queries;
- you are writing loops of queries in your application in order to follow a chain of relationships (particularly when you don’t know in advance how long that chain is going to be);
- you have lots of many-to-many joins or tree-like data structures;
- your data is already in a graph form (e.g. information about who is friends with whom in a social network).
…
Graph databases are often associated with the semantic web and RDF datastores, which is one of the applications they are used for. I actually believe that many other applications’ data would also be well represented in graphs. However, as before, don’t try to force data into a graph if it fits better into tables or documents.
In our experience, particularly social applications or applications that deal with complex interrelated data are much easier to build using a graph of typed objects in InfoGrid than to shoehorn into relational tables. But then, InfoGrid can use relational databases as storage engines, so we have the best of both worlds: graphs on the front, and enterprise-friendly SQL on the back.