S3AFileSystem - FileAlreadyExistsException when pr

2019-08-23 04:54发布

问题:

We are running Apache Spark jobs with aws-java-sdk-1.7.4.jar hadoop-aws-2.7.5.jar to write parquet files to an S3 bucket.

We have the key 's3://mybucket/d1/d2/d3/d4/d5/d6/d7' in s3 (d7 being a text file). We also have keys 's3://mybucket/d1/d2/d3/d4/d5/d6/d7/d8/d9/part_dt=20180615/a.parquet' (a.parquet being a file)

When we run a spark job to write b.parquet file under 's3://mybucket/d1/d2/d3/d4/d5/d6/d7/d8/d9/part_dt=20180616/' (ie would like to have 's3://mybucket/d1/d2/d3/d4/d5/d6/d7/d8/d9/part_dt=20180616/b.parquet' get created in s3) we get the below error

org.apache.hadoop.fs.FileAlreadyExistsException: Can't make directory for path 's3a://mybucket/d1/d2/d3/d4/d5/d6/d7' since it is a file.
at org.apache.hadoop.fs.s3a.S3AFileSystem.mkdirs(S3AFileSystem.java:861)
at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1881)

回答1:

As discussed in HADOOP-15542. you can't have files under directories in a "normal" FS; you don't get them in the S3A connector, at least where it does enough due diligence.

It just confuses every single tree walking algorithm, renames, deletes, anything which scans for files. This will include the spark partitioning logic. That new directory tree you are trying to create would probably appear invisible to callers. (you could test this by creating it, doing the PUT of that text file into place, see what happens)

We try to define what an FS should do in The Hadoop Filesystem Specification, including defining things "so obvious" that nobody bothered to write them down or write tests for, such as

  • Only directories can have children
  • All children must have a parent
  • Only files can have data (exception: ReiserFS)
  • Files are as long as they say they are (this is why S3A doesn't support client-side encryption, BTW).

Every so often we discover some new thing we forgot to consider, which "real" filesystems enforce out the box, but which object stores don't. Then we add tests, try our best to maintain the metaphor except when the performance impact would make it unusable. Then we opt not to fix things and hope nobody notices. Generally, because people working with data in the hadoop/hive/spark space have those same preconceptions of what a filesystem does, those ambiguities don't actually cause problems in production.

Except of course eventual consistency, which is why you shouldn't be writing data straight to S3 from spark without a consistency service (S3Guard, consistent EMRFS), or a commit protocol designed for this world (S3A Committer, databricks DBIO).