s3fs on Amazon EMR: Will it scale for approx 100mi

2019-05-02 15:54发布

问题:

Please refer to the following questions already asked: Write 100 million files to s3 and Too many open files in EMR

The size of data being handled here is atleast around 4-5TB. To be precise - 300GB with gzip compression.

The size of input will grow gradually as this step aggregates the data over time.

For example, the logs till December 2012 will contain:

UDID-1, DateTime, Lat, Lng, Location
UDID-2, DateTime, Lat, Lng, Location
UDID-3, DateTime, Lat, Lng, Location
UDID-1, DateTime, Lat, Lng, Location

For this we would have to generate separate files with UDID(Unique device identifier) as filenames and records belonging to that UDID in the file in sorted order.

Ex:

UDID-1.dat => File Contents
DateTime1, Lat1, Lng1, Location1
DateTime2, Lat2, Lng2, Location2
DateTime3, Lat3, Lng3, Location3

Now when we have the logs for the month of Jan, 2013, this step will read both the old data, the files already generated for the older months by this step, and the newer logs to aggregate the data of UDIDs.

Ex:

If the logs for month of Jan has a record as: UDID-1, DateTime4, Lat4, Lng4, Location4, the file UDID-1.dat would need to be updated with this data. Each UDID's file should be chronologically sorted.

For this step, we thought of writing the data to an EBS volume and keep it as-is for later use. But EBS volumes have a limit of 1TB. As already mentioned in the referenced questions, generating the files on s3 directly or generating on HDFS and then moving to s3 is not a viable option for this use case as there are around 100 million small files which needs to be moved. And moving such large number of files is way too slow even by using s3distcp.

So, next we are going to try s3fs - FUSE-based file system backed by Amazon S3. Does anybody have any idea that how scalable is s3fs? Will it be able to handle 100 million small files? How much time will it take to move 3-5TB of data, spread across 100 million files, from s3 to local filesystem so that it can be used by the MR job? And how much time will it take to move the data back to s3? Will it have the same problem as was faced while using s3distcp?

Thanks in advance !

回答1:

I would recommend against using s3fs to copy large amounts of small files.

I've had tried on a few occasions to move large amounts of small files from HDFS and the s3fs daemon kept on crashing. I was using both cp and rsync. This is gets even more aggravating if you are doing incremental updates. One alternative is to use the use_cache option and see how it behaves.

We have resorted to using s3cmd and iterating through each one of the files say using the Unix find command. Something like this:

find <hdfs fuse mounted dir> -type f -exec s3cmd put {} s3://bucketname \;

You can also try the s3cmd sync with something like this:

s3cmd sync /<local-dir>/ s3://bucketname