Transfer large MongoDB collections to data.frame i

2020-02-26 13:17发布

问题:

I have some strange results with huge collections sets when trying to transfer as data frames from MongoDB to R with rmongodb and plyr packages. I pick up this code from various github and forums on the subject, and adapt it for my purposes :

## load the both packages
library(rmongodb)
library(plyr)
## connect to MongoDB
mongo <- mongo.create(host="localhost")
# [1] TRUE
## get the list of the databases
mongo.get.databases(mongo)
# list of databases (with mydatabase)
## get the list of the collections of mydatabase
mongo.get.collections(mongo, db = "mydatabase")
# list of all the collections of my database
## Verify the size of mycollection
DBNS = "mycollection"
mongo.count(mongo, ns = DBNS)
# [1] 845923 documents inside "my collection"
## transform mycollection (in BSON MongoDB format) to a data frame (adapted for R)
export = data.frame(stringAsFactors = FALSE)
cursor = mongo.find(mongo, DBNS)
i = 1
while(mongo.cursor.next(cursor))
{
tmp = mongo.bson.to.list(mongo.cursor.value(cursor))
tmp.df = as.data.frame(t(unlist(tmp)), stringAsFactors = FALSE)
export = rbind.fill(export, tmp.df)
i = i + 1
}
## show the size of the database "export"
dim(export)
# [1] 20585 23
## check more information on the database "export"
str(export)
# 'data.frame': 20585 obs. of 23 variables
# etc…

The transfer is not well done : there is a huge difference between the 845923 documents inside "mycollection" found in MongoDB and the 20585 observations in R.

I may not agree with the code above. I'm not sure that the i = 1 and the i = i + 1 are useful for this function (may be coming from code with queries with rmongodb), if I have no specific values to attached with. I found also the "t(unlist(tmp))" strange, where the t comes from ?

The problem is that I encounter some big differences from collections size in MongoDB and database size in R with large collections sets (superior to several thousands of documents). My PC have a good RAM and R seems to work well during the process (no freeze, no crash, taking time but normal due to the large conversion to do from BSON to list to data frame).

I have succeed to transfer a MongoDB collection of 36100 documents from MongoDB to R for data analysis with no problem.

So I'm not sure where the problem is coming from.

Thanks in advance for any help on this subject.

回答1:

I would say all this is not needed. You can proceed in simple way as follows: This requires a package named "rmongodb" in R. This package require latest version and would not be present in the earlier versions. This package deals with mongodb. There are other packages as well such as "RMongo".

for installing rmongodb in R

install.packages("rmongodb")

To convert large data of MongoDB into a data frame in R

library(rmongodb)
mongo <- mongo.create() # create a connection to mongodb localhost
mongo.is.connected(mongo) # check whether mongodb is connected
mongo.get.databases(mongo) #shows all databases present in mongodb
mongo.get.database.collections(mongo,"mydb") #displays all collections present in database mydb
data <- mongo.find.all(mongo,"mydb.collection",data.frame=TRUE) # This would suffice as this would convert the entire list into a data frame in R.