Turn your application into a graph computation engine

Published 10/19/2015 6:42:00 AM
Filed under Other

Sometimes I come across a subject that doesn't really fancy my interest. Either because I absolutely don't see any
application for a particular piece of technology or because I simply don't have enough time to take a closer look.

Graphs is one of of those things that just didn't grab my attention until a few weeks ago. But now that I have seen
the amount of fuss around this subject I think I should have looked into it a bit earlier.

Like a lot of IT related things, graph computation isn't something new. I looked it up on Wikipedia and it's actually
quite old. The theory dates from 1736. I'd say that's plenty old for something
that is hot in 2015. But looking at a few more Wikipedia articles I've come to the conclusion that graph databases are not
quite that old. Most of them were invented in the last few years.

Why do we need graphs?

So why all the fuss? I think that we're finally reaching a point where we have found a problem that requires a graph.
Tables are cool, but very limiting if you're trying to determine things based on the relationship between two objects.

For example, if you want to know the measure of similarity between users you're not talking about the users directly.
Rather you're talking about a relationship between them. Also, if you want to calculate a measure of interest based
on user actions, you're not talking about articles and users, you're talking about the connection between them.

A table oriented application can't tell you that. You can try to model a relation as an object, but ultimately the grammar you're
using is limited to talking about objects. Our brain can get us very far in translating between relations and objects, but
there is a limit. So if you're planning on doing something with a focus on relations you best grab something that lets you
talk about relationships in a way that is natural.

It's not that you can't do things in SQL when you want to perform graph oriented computations, but it's quite a bit harder
than doing the same thing in Cypher or Gremlin.

Getting started with graph computation

As I was looking around the web I discovered that there are quite a few tools out there that let you compute things in a graph.
There's Neo4J, a commercial database and there's a lot of open-source out there too. One of them is TinkerPop.

Tinkerpop is a typical Apache Foundation project. Loads and loads (Did I mention there's a lot) of stuff available and as of the moment
of writing not a whole lot of documentation that is easy to digest. But don't let that keep you back. It's great stuff once you know how
to read the docs.

The easiest way to get started with graph computation using TinkerPop is by grabbing the gremlin console.
This will let you work with graphs from the comfort of the terminal without having to learn a programming language.
Well another programming language on top of Gremlin that is. As gremlin has its own syntax you are required to learn that.

The gremlin console is made up of several components of which the main ones are:

  • Gremlin: The query language to traverse the graph.
    You can ask Gremlin things like "Give me all the nodes that have certain properties, that are connected
    through a particular type of relationship to another that has a certain property.". The cool thing here is that there's a Gremlin implementation for C#, Java, Ruby, Python and Scala. All of them are strong-typed (more or less, depending on the implementation). And most of them follow the same syntax, which makes it easy to learn Gremlin in your
    favorite development environment.

  • Pipes: A lower level API that translates the queries you write into an executable set of steps.

    This API uses a combination of a visitor and the steps to build a resultset. It basically takes the steps and executes them
    one by one against the current vertex or edge.

  • Blueprints: The driver that knows how to talk to a graph database.

    This API knows only about vertices and edges. It doesn't know how to traverse them, but you can ask Blueprints for anything you like.
    Very low level, but a necessary part of the Graph computation engine.

In practice you don't have to know all of these APIs to work with a graph, except for Gremlin. But I think it's good to know what to expect, should you
have to drop down to a lower level to get the results you want.

To get a sense for what you can expect from Gremlin, let's start the console app. When you've downloaded the Gremlin console, extract it somewhere on disk.
Open up a terminal and navigate to the folder where you extracted the Gremlin console and execute the following command:

bin/gremlin.sh

This will fire up the console and show some fancy ASCII art. Once started you need something to work on.
Normally if you want to start a graph computation you start by creating a new empty graph. To do this you need to
query the graph like this:

g = TinkerGraph.open()

With the empty graph in place you can start to add vertices and edges to it using the following commands:

// Create a country to live in
netherlands = g.addVertex("The Netherlands")

// Create the medals that can be won
gold = g.addVertex("Gold")
silver = g.addVertex("Silver")
bronze = g.addVertex("Bronze")

// Create a few participants who won something
epke = g.addVertex("Epke Zonderland")
marianne = g.addVertex("Marianne Vos")
marit = g.addVertex("Marit Bouwmeester")

// All of the people live in the netherlands
epke.addEdge("lives", netherlands)
marianne.addEdge("lives", netherlands)
marit.addEdge("lives", netherlands)

// The participants won a few medals
epke.addEdge("won", gold)
marianne.addEdge("won", gold)
marit.addEdge("won", silver)

addVertex creates a new vertex with a label. You can attach properties to them using the property(key,value) method.
To connect to nodes in the graph call the addEdge method with a label and target vertex. Edges can also have properties.

Now that you have some data, let's do something with it. Let's find out who won a medal:

g.traversal().V().hasLabel("Gold","Silver","Bronze").in("won").label()

This returns all nodes that are connected through the won relationship to the nodes that have the label Gold, Silver or Bronze.
Notice that I used the in() command to find the participant vertices in the graph. I'm using the in() command, because the edge is
pointing from participant to medal instead of the other way around. Would I use out() here I wouldn't get anything back.

The results in itself is not very useful, because you don't know who won which medal. Let's see if we can make this better:

g.traversal().V().hasLabel("Gold","Silver","Bronze").as("medal").in("won").as("participant").select("medal","participant")

The difference here is that we tell the graph traversal that we want to label the first set of vertices as medals and the last set of vertices as participants.
Finally you can use the select step to select the labeled vertices to return them in the resultset.

Please note that while I use g.traversal() here I recommend that you call g.traversal() only once and save it in a variable.
Calling the traversal() method is quite expensive (Thanks @twarko for pointing me in the right direction).

As you can see Gremlin can get complicated very quickly. But imagine doing this in SQL with a bigger graph that has properties on the edges and vertices.
Something like the taxonomy of wikipedia, but with metadata.

Moving beyond the Gremlin terminal

Building graphs in the Gremlin terminal is fun when you're trying something out, but you can't run any production system on that.
For production use you need to have a backend that can store your data and scale beyond one machine.

You are in luck, there are quite a few options available to run a Graph in production. There's Neo4J which is a commercial product.
It supports Tinkerpop3, but it also has its own Graph computation language called Cypher. I personally really like Cypher, so if you're
going for Neo4J I suggest you try Cypher first.

However if you want an open-source option I suggest you take a look at TitanDB. It's a full implementation of Tinkerpop3 and if you use
the TitanGraph API directly it supports even more. TitanDB supports storing and indexing graphs using several technologies.
One combination that I found is very scalable is Cassandra and ElasticSearch. When you configure TitanDB to use this combination it will
store the vertices and edges in Cassandra and index them in ElasticSearch. Both of these products can be scaled across several hundreds of nodes
so you don't have to worry about that.

To give you an idea of what graph computation using TitanDB would look like, take a look at the following example:

graph = TitanFactory.build()
        .set("storage.backend", "cassandra")
        .set("storage.hostname","127.0.0.1")
        .set("index.search.hostname", "127.0.0.1")
        .set("index.search.elasticsearch.client-only", "true")
        .open();

Vertex epke = graph.addVertex("Epke Zonderland");
Vertex marianne = graph.addVertex("Marianne Vos");
Vertex marit = graph.addVertex("Marit Bouwmeester");

Vertex gold = graph.addVertex("Gold");
Vertex silver = graph.addVertex("Silver");
Vertex bronze = graph.addVertex("Bronze");

epke.addEdge("won", gold);
marianne.addEdge("won", gold);
marit.addEdge("won", silver);

GraphTraversal<Vertex, Map<String, Object>> medalsWonByParticipants =
        graph.traversal().V().hasLabel("Gold", "Silver", "Bronze").as("medal")
                .in("won").as("participant").select("medal", "participant");

The first lines are especially interesting. Using the titan factory I created a new configuration
that connects to Cassandra on localhost and ElasticSearch also running on my local machine.

With the connection made to the graph I can add vertices and edges. And finally I can run gremlin queries against the graph.
The code is all Java. Gremlin isn't your average grammar. It is a set of patterns that a Gremlin implementation should follow.
Because of that, the API in Ruby is the same as in Java, Scala and Python.

Keep this in mind when you are looking up things in the manual. All samples are actually portable to your own favorite programming
language without much effort. The methods are the same and the kinds of resultsets are also equal across all programming languages.

Wnat to learn more?

If you're interested in more please go grab TitanDB and give it a spin. You can find the software right here: http://thinkaurelius.github.io/titan/
The documentation on the website is limited, but like I said before: The Gremlin API is the same as for all other TinkerPop implementations.
For the Gremlin API description check the following website: http://tinkerpop.incubator.apache.org/docs/3.0.1-incubating/#traversal