Reading Huge MongoDB collection from Spark with he

2019-02-19 09:40发布

I want to read a huge MongoDB collection from Spark create an persistent RDD and do further data analysis on it.

Is there any way I can read the data from MongoDB faster. Have tried with the approach of MongoDB Java + Casbah

Can I use the worker/slave to read data in parallel from MongoDB and then save it as persistent data and use it.

标签： mongodb scala apache-spark casbah

1条回答

神经病院院长

2楼-- · 2019-02-19 10:06

There are two ways of getting the data from MongoDB to Apache Spark.

Method 1: Using Casbah (Layer on MongDB Java Driver)

val uriRemote = MongoClientURI("mongodb://RemoteURL:27017/")
val mongoClientRemote =  MongoClient(uriRemote)
val dbRemote = mongoClientRemote("dbName")
val collectionRemote = dbRemote("collectionName")
val ipMongo = collectionRemote.find
val ipRDD = sc.makeRDD(ipMongo.toList)
ipRDD.saveAsTextFile("hdfs://path/to/hdfs")

Over here we are using Scala and Casbah to get the data first and then save it to HDFS.

Method 2: Spark Worker at our use

Better version of code: Using Spark worker and multiple core to use to get the data in short time.

val config = new Configuration()
config.set("mongo.job.input.format","com.mongodb.hadoop.MongoInputFormat")
config.set("mongo.input.uri", "mongodb://RemoteURL:27017/dbName.collectionName")
val keyClassName = classOf[Object]
val valueClassName = classOf[BSONObject]
val inputFormatClassName = classOf[com.mongodb.hadoop.MongoInputFormat]
val ipRDD = sc.newAPIHadoopRDD(config,inputFormatClassName,keyClassName,valueClassName)
ipRDD.saveAsTextFile("hdfs://path/to/hdfs")

0人赞添加讨论(0) 举报

Reading Huge MongoDB collection from Spark with he

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间