Extract RAR files from Google Cloud Storage

2019-06-25 10:30发布

问题:

I got a big multipart compressed CSV file using RAR utility (100GB uncompressed, 20GB compressed), so I have 100 RAR file parts, that were uploaded to Google Cloud Storage. I need to extract it to Google Cloud Storage. It would be best if I could use Python on GAE. Any ideas? I don't want to download, extract, and upload. I want to do it all in the cloud.

回答1:

There's no way to directly decompress/extract your RAR file in the cloud. Are you aware of the gsutil -m (multithreading/multiprocessing) option? It speeds up transfers by running them in parallel. I'd suggest this sequence:

  • download compressed archive file
  • unpack locally
  • upload unpacked files in parallel using gsutil -m cp file-pattern dest-bucket

Unless you have a very slow internet connection, 20GB should not take very long (well under an hour, I'd expect) and likewise for the parallel upload (though that's a function of how much parallelism you get, which in turns depends on the size of the archive files).

Btw, you can tune the parallelism used by gsutil -m via the parallel_thread_count and parallel_process_count variables in your $HOME/.boto file.



回答2:

This question was already answered (and accepted), but for future similar use cases, I would recommend doing this entirely in the cloud by spinning up a tiny Linux instance on GCE, e.g., f1-micro, and then running the steps as suggested by Marc Cohen in his answer. The instances come with gsutil preinstalled so it's easy to use. When you're done, just shut down and delete your micro-instance, as your resulting file was already stored in Google Cloud Storage.

Step-by-step instructions:

  1. Create a Google Compute Engine VM instance
  2. SSH to the instance
  3. Follow the instructions in the other answer

The benefit here is that instead of downloading to your own computer, you're transferring all the data within Google Cloud itself, so the transfers should be very fast, and do not depend on your own Internet connection speed or consume any of your bandwidth.


Note: network bandwidth is proportional to the size of the VM (in vCPUs), so for faster performance, consider creating a larger VM. Google Compute Engine pricing for VM instances is as follows:

  1. minimum 10 minutes
  2. rounded up to the nearest minute

So, for example, given that an n1-standard-1 costs USD $0.05 / hr (as of 8 Oct 2016), 15 minutes of usage will cost USD $0.0125 in total.