Please refer to the following questions already asked: Write 100 million files to s3 and Too many open files in EMR
The size of data being handled here is atleast around 4-5TB. To be precise - 300GB with gzip compression.
The size of input will grow gradually as this step aggregates the data over time.
For example, the logs till December 2012 will contain:
UDID-1, DateTime, Lat, Lng, Location
UDID-2, DateTime, Lat, Lng, Location
UDID-3, DateTime, Lat, Lng, Location
UDID-1, DateTime, Lat, Lng, Location
For this we would have to generate separate files with UDID(Unique device identifier) as filenames and records belonging to that UDID in the file in sorted order.
Ex:
UDID-1.dat => File Contents
DateTime1, Lat1, Lng1, Location1
DateTime2, Lat2, Lng2, Location2
DateTime3, Lat3, Lng3, Location3
Now when we have the logs for the month of Jan, 2013, this step will read both the old data, the files already generated for the older months by this step, and the newer logs to aggregate the data of UDIDs.
Ex:
If the logs for month of Jan has a record as: UDID-1, DateTime4, Lat4, Lng4, Location4, the file UDID-1.dat would need to be updated with this data. Each UDID's file should be chronologically sorted.
For this step, we thought of writing the data to an EBS volume and keep it as-is for later use. But EBS volumes have a limit of 1TB. As already mentioned in the referenced questions, generating the files on s3 directly or generating on HDFS and then moving to s3 is not a viable option for this use case as there are around 100 million small files which needs to be moved. And moving such large number of files is way too slow even by using s3distcp.
So, next we are going to try s3fs - FUSE-based file system backed by Amazon S3. Does anybody have any idea that how scalable is s3fs? Will it be able to handle 100 million small files? How much time will it take to move 3-5TB of data, spread across 100 million files, from s3 to local filesystem so that it can be used by the MR job? And how much time will it take to move the data back to s3? Will it have the same problem as was faced while using s3distcp?
Thanks in advance !