I need some help to be confirm my choice... and to learn if you can give me some information. My storage database is TitanDb with Cassandra. I have a very large graph. My goal is to use Mllib on the graph latter.
My first idea : use Titan with GraphX but I did not found anything or in development in progress... TinkerPop is not ready yet. So I have a look to Giraph. TinkerPop, Titan can communique with Rexster from TinkerPop.
My question is : What are the benefit to use Giraph ? Gremlin seems to do the same think and is distributed.
Thank you very much to explain me. I think I don't really understand the difference between Gremlin and Giraph (or GraphX).
Have a nice day.
Interesting question. I am on the same track.
First your question about MLlib. I assume that you mean Apache Spark MLlib, the machine learning (ML) implementation on top of Apache Spark. So my conclusion is: you want to run ML algorithms for purposes such as clustering and classification using the data in your Titan/Cassandra based graph database. Please note that you could also use graph processing algorithms like Page Rank mentioned by spidy to do things like clustering on top of your Titan/Cassandra graph database. In other words: you don't need ML to do clustering when your starting point is a graph database.
Apache Spark MLlib seems to be future proof and widely supported, their most recent announcements were regarding new ML algorithms, although Apache Mahout, another Apache ML project, is more mature regarding the amount of supported ML algorithms. Apache Mahout has also adopted Apache Spark as their data storage layer, so I therefore mention it in this post. Apache Spark offers, in addition to in-memory computing, the mentioned MLlib for machine learning, Spark SQL which is like Hive on Spark, GraphX which is a graph processing system as explained by spidy and Spark Streaming for processing of streaming data.
I consider Apache Spark itself as a logical data layer, represented as RDDs (Resilient Distributed Datasets) on top of storage layers such as Cassandra, Hadoop/Hcatalog and HBase. Apache Spark offers a connector to Cassandra. Note that RDDs are immutable, you cannot alter data using Spark, you can only process and analyze the data in Spark. Regarding the Apache Spark logical storage layer RDD: You could compare an RDD as a view in the good old SQL times, RDDs give you a view on for example a table in Cassandra of HBase. Note also that Apache Spark offers an API for 3 development environments: Scala, Java and Python.
Apache Giraph is also a graph processing toolset, functional equivalent to Apache Spark GraphX. Apache Giraph uses Hadoop as the data storage layer. You are using Titan/Cassandra so you will probably enter data migration tasks when you select Apache Giraph as your solution. Secondly, you started your post with a question regarding ML using MLlib and Apache Giraph is not a ML solution.
Your conclusion regarding Giraph and Gremlin is not correct: they are not the same although both are using a graph database. Giraph is a solution for graph processing as spidy explained. Using Giraph you can execute graph analysis algorithms such as Page Rank, e.g. who has the most followers, whilst Gremlin is meant for traversing e.g. queury the graph database using the complex relationships (edges) between entities (vertices) obtaining result sets of vertex and edge properties.
I believe you're asking for difference between graphx or giraph and titan. To be more specific, why should you use graph processing system when you already have your data in graph database?
So it essentially is the difference between graph database and graph processing system.
Graph database is your guy when your application requires frequently querying the data. E.g. for a facebook kind of application, given a user, return all his/her friends. This is suitable for graph database and you can use gremlin to query.
Now, if you want to compute rank of each user in facebook, you need to run the pagerank algorithm over whole graph. In other words, pagerank algorithm process your whole graph and returns you the map . This is suitable application for graph processing system. Yes, you can write queries using gremlin framework to do this but 1. it won't be as userfriendly as underlying pregel model used by giraph or graphx. 2. it won't be efficient.
To summarize, it really depends on your application. If you think your application is like query. Don't bother loading unloading into any graph processing system. If you think your application is more like pagerank (which requires processing whole graph) and you have a large graph (atleast 1M edges). Go for giraph or graphx.
giraph and graphx has the graph input format. You can dump your data into that form in a file and can input it into one of these systems or you can write your own input format.
p.s. it'd be good to have an input format added in giraph graphx which accepts data stored in titan.