AWS, Cuda, Tensorflow

2019-07-13 03:20发布

问题:

When I'm running my Python code on the most powerfull AWS GPU instances (with 1 or 8 x Tesla v100 16mb aka. P3.x2large or P3.16xlarge) they are both only 2-3 times faster than my DELL XPS Geforce 1050-Ti laptop?

I'm using Windows, Keras, Cuda 9, Tensorflow 1.12 and the newest Nvidia drivers.

When I check the GPU load via GZU the GPU max. run at 43% load for a very short period - each time. The controller runs at max. 100%...

The dataset I use is matrices in JSON format and the files are located on a Nitro drive at 10TB with MAX 64.000 IOPS. No matter if the folder contains 10TB, 1TB or 100mb...the training is still very very slow per iteration?

All advises are more than welcome!

UPDATE 1:

From the Tensorflow docs:

"To start an input pipeline, you must define a source. For example, to construct a Dataset from some tensors in memory, you can use tf.data.Dataset.from_tensors() or tf.data.Dataset.from_tensor_slices(). Alternatively, if your input data are on disk in the recommended TFRecord format, you can construct a tf.data.TFRecordDataset."

Before I had matrices stored in JSON format (Made by Node). My TF runs in Python. I will now only save the coordinates in Node and save it in JSON format. The question is now: In Python what is the best solution to load data? Can TF use the coordinates only or do I have to make the coordinates back to matrices again or what?

回答1:

The performance of any machine learning model depends on many things. Including but not limited to: How much pre-processing you do, how much data you copy from CPU to GPU, Op bottlenecks, and many more. Check out the tensorflow performance guide as a first step. There are also a few videos from the tensorflow dev summit 2018 that talk about performance. How to properly use tf.data, and how to debug performance are two that I recommend.

The only thing I can say for sure is that JSON is a bad format for this purpose. You should switch to tfrecord format, which uses protobuf (better than JSON).

Unfortunately performance and optimisation of any system takes a lot of effort and time, and can be a rabbit hole that just keeps going down.



回答2:

First off, you should be having a really good reason to go for an increased computational overhead with Windows-based AMI.

If your CPU is at ~100%, while GPU is <100%, then your CPU is likely the bottleneck. If you are on cloud, consider moving to instances with larger CPU-count (CPU is cheap, GPU is scarce). If you can't increase CPU count, moving some parts of your graph to GPU is an option. However, tf.data-based input pipeline is run entirely on CPU (but highly scalable due to C++ implementation). Prefetching to GPUs might also help here, but the cost of spawning another background thread to populate the buffer for downstream might damp this effect. Another option is to do some or all pre-processing steps offline (i.e. prior to training).

A word of caution on using Keras as the input pipeline. Keras relies on Python´s multithreading (and optionally multiprocessing) libraries, which may both lack performance (when doing heavy I/O or augmentations on-the-fly) and scalability (when running on multiple CPUs) compared to GIL-free implementations. Consider performing preprocessing offline, pre-loading input data, or using alternative input pipelines (as the aforementioned TF native tf.data, or 3rd party ones, like Tensorpack).