I got a big multipart compressed CSV file using RAR utility (100GB uncompressed, 20GB compressed), so I have 100 RAR file parts, that were uploaded to Google Cloud Storage. I need to extract it to Google Cloud Storage. It would be best if I could use Python on GAE. Any ideas? I don't want to download, extract, and upload. I want to do it all in the cloud.
相关问题
- Cannot upload large file to Google Cloud Storage
- Google Cloud Storage requests are slow using Paper
- Mounting a gcePersistentDisk kubernetes volume is
- compute engine use gsutil to download tgz file has
- Downloading files from Google Storage using Spark
相关文章
- Google Storage access based on IP Address
- gsutil / gcloud storage file listing sorted date d
- Need help understanding the Firebase Storage CDN
- How to upload an image from web into Google Cloud
- Using cloud function to load data into Big Query T
- How to authenticate google APIs with different ser
- An efficient way of exporting 10 datasets (each ha
- Permission denied on Cloud KMS key when using clou
There's no way to directly decompress/extract your RAR file in the cloud. Are you aware of the
gsutil -m
(multithreading/multiprocessing) option? It speeds up transfers by running them in parallel. I'd suggest this sequence:gsutil -m cp file-pattern dest-bucket
Unless you have a very slow internet connection, 20GB should not take very long (well under an hour, I'd expect) and likewise for the parallel upload (though that's a function of how much parallelism you get, which in turns depends on the size of the archive files).
Btw, you can tune the parallelism used by
gsutil -m
via theparallel_thread_count
andparallel_process_count
variables in your$HOME/.boto
file.This question was already answered (and accepted), but for future similar use cases, I would recommend doing this entirely in the cloud by spinning up a tiny Linux instance on GCE, e.g.,
f1-micro
, and then running the steps as suggested by Marc Cohen in his answer. The instances come withgsutil
preinstalled so it's easy to use. When you're done, just shut down and delete your micro-instance, as your resulting file was already stored in Google Cloud Storage.Step-by-step instructions:
The benefit here is that instead of downloading to your own computer, you're transferring all the data within Google Cloud itself, so the transfers should be very fast, and do not depend on your own Internet connection speed or consume any of your bandwidth.
Note: network bandwidth is proportional to the size of the VM (in vCPUs), so for faster performance, consider creating a larger VM. Google Compute Engine pricing for VM instances is as follows:
So, for example, given that an
n1-standard-1
costs USD $0.05 / hr (as of 8 Oct 2016), 15 minutes of usage will cost USD $0.0125 in total.