I want to read a huge MongoDB collection from Spark create an persistent RDD and do further data analysis on it.
Is there any way I can read the data from MongoDB faster.
Have tried with the approach of MongoDB Java + Casbah
Can I use the worker/slave to read data in parallel from MongoDB and then save it as persistent data and use it.
There are two ways of getting the data from MongoDB to Apache Spark.
Method 1:
Using Casbah (Layer on MongDB Java Driver)
val uriRemote = MongoClientURI("mongodb://RemoteURL:27017/")
val mongoClientRemote = MongoClient(uriRemote)
val dbRemote = mongoClientRemote("dbName")
val collectionRemote = dbRemote("collectionName")
val ipMongo = collectionRemote.find
val ipRDD = sc.makeRDD(ipMongo.toList)
ipRDD.saveAsTextFile("hdfs://path/to/hdfs")
Over here we are using Scala and Casbah to get the data first and then save it to HDFS.
Method 2: Spark Worker at our use
Better version of code: Using Spark worker and multiple core to use to get the data in short time.
val config = new Configuration()
config.set("mongo.job.input.format","com.mongodb.hadoop.MongoInputFormat")
config.set("mongo.input.uri", "mongodb://RemoteURL:27017/dbName.collectionName")
val keyClassName = classOf[Object]
val valueClassName = classOf[BSONObject]
val inputFormatClassName = classOf[com.mongodb.hadoop.MongoInputFormat]
val ipRDD = sc.newAPIHadoopRDD(config,inputFormatClassName,keyClassName,valueClassName)
ipRDD.saveAsTextFile("hdfs://path/to/hdfs")