I have a multinode giraph cluster working properly in my PC. I executed the SimpleShortestPathExample from Giraph and was executed fine.
This algorithm was ran with this file (tiny_graph.txt):
[0,0,[[1,1],[3,3]]]
[1,0,[[0,1],[2,2],[3,1]]]
[2,0,[[1,2],[4,4]]]
[3,0,[[0,3],[1,1],[4,4]]]
[4,0,[[3,4],[2,4]]]
This file has the following input format:
[source_id,source_value,[[dest_id, edge_value],...]]
Now, I’m trying to execute this same algorithm, in this same cluster, but with an input file different from the original. My own file is like this:
[Portada,0,[[Sugerencias para la cita del día,1]]]
[Proverbios españoles,0,[]]
[Neil Armstrong,0,[[Luna,1][ideal,1][verdad,1][Categoria:Ingenieros,2,[Categoria:Estadounidenses,2][Categoria:Astronautas,2]]]
[Categoria:Ingenieros,1,[[Neil Armstrong,2]]]
[Categoria:Estadounidenses,1,[[Neil Armstrong,2]]]
[Categoria:Astronautas,1,[[Neil Armstrong,2]]]
It's very similar to the original, but the id's are String and the vertex and edges values are Long. My question it's which TextInputFormat should i use for this, because i already try with org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat
and org.apache.giraph.io.formats.TextDoubleDoubleAdjacencyListVertexInputFormat
and i couldn't get this working.
With this problem solved, i could adapt the original shortest path example algorithm and let it work for my file, but until i get a solution for this i can't reach to that point.
If this format it's not a good decision, i could adapt it maybe, but i don't know which it's my best option, my knowledge from Text Input and Output Format in giraph it's really bad, that's why i0me here asking for advice.
I solved this adapting my own file to fit in
org.apache.giraph.io.formats.TextDoubleDoubleAdjacencyListVertexInputFormat
. My original file should be like this:Those spaces between the data are tab spaces ('\t'), because this format has that option as predetermined token value for spliting the original lines into several strings.
Thanks @masoud-sagharichian for your help anyway!! :D
It's better to write your own inputformat. I suggest use hash codes of your strings. I write a sample code such that each line consists of: [vertex_id (integer e.g. hash code of your string), vertex_val (long), [[neighbor_id (integer), neighbor_val (long)], ....]