I'm having an issue where my Hadoop job on AWS's EMR is not being saved to S3. When I run the job on a smaller sample, the job stores the output just fine. When I run the same command but on my full dataset, the job completes again, but there is nothing existing on S3 where I specified my output to go.
Apparently there was a bug with AWS EMR in 2009, but it was "fixed".
Anyone else ever have this problem? I still have my cluster online, hoping that the data is buried on the servers somewhere. If anyone has an idea where I can find this data, please let me know!
Update: When I look at the logs from one of the reducers, everything looks fine:
2012-06-23 11:09:04,437 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Creating new file 's3://myS3Bucket/output/myOutputDirFinal/part-00000' in S3
2012-06-23 11:09:04,439 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Outputstream for key 'output/myOutputDirFinal/part-00000' writing to tempfile '/mnt1/var/lib/hadoop/s3/output-3834156726628058755.tmp'
2012-06-23 11:50:26,706 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Outputstream for key 'output/myOutputDirFinal/part-00000' is being closed, beginning upload.
2012-06-23 11:50:26,958 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Outputstream for key 'output/myOutputDirFinal/part-00000' upload complete
2012-06-23 11:50:27,328 INFO org.apache.hadoop.mapred.Task (main): Task:attempt_201206230638_0001_r_000000_0 is done. And is in the process of commiting
2012-06-23 11:50:29,927 INFO org.apache.hadoop.mapred.Task (main): Task 'attempt_201206230638_0001_r_000000_0' done.
When I connect to this task's node, the temp directory mentioned is empty.
Update 2: After reading Difference between Amazon S3 and S3n in Hadoop, I'm wondering if my problem is using "s3://" instead of "s3n://" as my output path. In my both my small sample (that stores fine), and my full job, I used "s3://". Any thoughts on if this could be my problem?
Update 3: I see now that on AWS's EMR, s3:// and s3n:// both map to the S3 native file system (AWS EMR documentation).
Update 4: I re-ran this job two more times, each time increasing the number of servers and reducers. The first of these two finished with 89/90 reducer outputs being copied to S3. The 90th said it successfully copied according to logs, but AWS Support says file is not there. They've escalated this problem to their engineering team. My second run with even more reducers and and servers actually finished with all data being copied to S3 (thankfully!). One oddness though is that some reducers take FOREVER to copy the data to S3 -- in both of these new runs, there was a reducer whose output took 1 or 2 hours to copy to S3, where as the other reducers only took 10 minutes max (files are 3GB or so). I think this is relates to something wrong with the S3NativeFileSystem used by EMR (e.g. the long hanging -- which I'm getting billed for of course; and the alleged successful uploads that don't get uploaded). I'd upload to local HDFS first, then to S3, but I was having issues on this front as well (pending AWS engineering team's review).
TLDR; Using AWS EMR to directly store on S3 seems buggy; their engineering team looking into.
This turned out to be a bug on AWS's part, and they've fixed it in the latest AMI version 2.2.1, briefly described in these release notes.
The long explanation I got from AWS is that when the reducer files are > the block limit for S3 (i.e. 5GB?), then multipart is used, but there was not proper error-checking going on, so that is why it would sometimes work, and other times not.
In case this continues for anyone else, refer to my case number, 62849531.