Listing directories at a given level in Amazon S3

2020-02-29 04:26发布

问题:

I am storing two million files in an amazon S3 bucket. There is a given root (l1) below, a list of directories under l1 and then each directory contains files. So my bucket will look something like the following

l1/a1/file1-1.jpg
l1/a1/file1-2.jpg
l1/a1/... another 500 files
l1/a2/file2-1.jpg
l1/a2/file2-2.jpg
l1/a2/... another 500 files
....

l1/a5000/file5000-1.jpg

I would like to list as fast as possible the second level entries, so I would like to get a1, a2, a5000. I do not want to list all the keys, this will take a lot longer.

I am open to using directly the AWS api, however I have played so far with the right_aws gem in ruby http://rdoc.info/projects/rightscale/right_aws

There are at least two APIs in that gem, I tried using bucket.keys() in the S3 module and incrementally_list_bucket() in the S3Interface module. I can set the prefix and delimiter to list all of l1/a1/*, for example, but I cannot figure out how to list just the first level in l1. There is a :common_prefixes entry in the hash returned by incrementally_list_bucket() but in my test sample it is not filled in.

Is this operation possible with the S3 API?

Thanks!

回答1:

right_aws allows to do this as part of their underlying S3Interface class, but you can create your own method for an easier (and nicer) use. Put this at the top of your code:

module RightAws
  class S3
    class Bucket
      def common_prefixes(prefix, delimiter = '/')
        common_prefixes = []
        @s3.interface.incrementally_list_bucket(@name, { 'prefix' => prefix, 'delimiter' => delimiter }) do |thislist|          
          common_prefixes += thislist[:common_prefixes]
        end
        common_prefixes
      end
    end
  end
end

This adds the common_prefixes method to the RightAws::S3::Bucket class. Now, instead of calling mybucket.keys to fetch the list of keys in your bucket, you can use mybucket.common_prefixes to get an array of common prefixes. In your case:

mybucket.common_prefixes("l1/")
# => ["l1/a1", "l1/a2", ... "l1/a5000"]

I must say I tested it only with a small number of common prefixes; you should check that this works with more than 1000 common prefixes.



回答2:

This thread is quite old but I did run into this issue recently and wanted to assert my 2cents...

It is a hassle and a half (it seems) to cleanly list out folders given a path in an S3 bucket. Most of the current gem wrappers around the S3 API (AWS-SDK official, S3) don't correctly parse the return object (specifically the CommonPrefixes) so it is difficult to get back a list of folders (delimiter nightmares).

Here is a quick fix for those using the S3 gem... Sorry it isn't one size fits all but it's the best I wanted to do.

https://github.com/qoobaa/s3/issues/61

Code snippet:

module S3
  class Bucket
    # this method recurses if the response coming back
    # from S3 includes a truncation flag (IsTruncated == 'true')
    # then parses the combined response(s) XML body
    # for CommonPrefixes/Prefix AKA directories
    def directory_list(options = {}, responses = [])
      options = {:delimiter => "/"}.merge(options)
      response = bucket_request(:get, :params => options)

      if is_truncated?(response.body)
        directory_list(options.merge({:marker => next_marker(response.body)}), responses << response.body)
      else
        parse_xml_array(responses + [response.body], options)
      end
    end

    private

    def parse_xml_array(xml_array, options = {}, clean_path = true)
      names = []
      xml_array.each do |xml|
        rexml_document(xml).elements.each("ListBucketResult/CommonPrefixes/Prefix") do |e|
          if clean_path
            names << e.text.gsub((options[:prefix] || ''), '').gsub((options[:delimiter] || ''), '')
          else
            names << e.text
          end
        end
      end
      names
    end

    def next_marker(xml)
      marker = nil
      rexml_document(xml).elements.each("ListBucketResult/NextMarker") {|e| marker ||= e.text }
      if marker.nil?
        raise StandardError
      else
        marker
      end
    end

    def is_truncated?(xml)
      is_truncated = nil
      rexml_document(xml).elements.each("ListBucketResult/IsTruncated") {|e| is_truncated ||= e.text }
      is_truncated == 'true'
    end
  end
end