First off I wasn't sure if I should post this as a Ubuntu question or here. But I'm guessing it's more of an Python question than a OS one.
My Python application is running on top of Ubuntu on a 64 core AMD server.
It pulls images from 5 GigE cameras over the network by calling out to a .so through ctypes
and then processes them.
I am seeing frequent pauses in my application causing frames from the cameras to be dropped by the external camera library.
To debug this I've used the popular psutil
Python package with which I log out CPU stats every 0.2 seconds in a separate thread.
I sleep for 0.2 seconds in that thread and when that sleep takes substantially longer I also see camera frames being dropped.
I have seen pauses up to 17 seconds long!
Most of my processing is either in OpenCV or Numpy (both of which release the GIL) or in one part of the app a multiprocessing.Pool
with 59 processes (this it to get around the Python GIL).
My debug logging shows very high 'system' (i.e. kernel) CPU time on many of my process' threads when the pauses happen.
For example. I see CPU times as follows (usually every 0.2 seconds) and then suddenly a big jump
('Process' numbers are in CPU utilization, i.e. 1 CPU fully used would be 1, Linux top
showing 123% would be 1.2):
Process user | Process system | OS system % | OS idle %
19.9 | 10.5 | 6 | 74
5.6 | 2.3 | 4 | 87
6.8 | 1.7 | 11 | 75
4.6 | 5.5 | 43 | 52
0.5 | 26.4 | 4 | 90
I don't know why the high OS system usage is reported one line before matching high process system usage.
The two match up since 26.4 of 64 cores = 41%. At that point my application experienced an approximately 3.5 second pause
(as determined by my CPU info logging thread using OpenCV's cv2.getTickCount()
and also the jump in time stamps in the Python logging output)
causing multiple camera frames to be dropped.
When this happens I have also logged the CPU info for each thread of my process. For the example above 25 threads were running at a 'system' CPU utilization of 0.9 and a few more at 0.6, which matches the total for the process of 26.4 above. At that point there were about 183 threads running.
This pause usually seems to happen close after the multiprocessing pool is used (it's used for short bursts) but by no means happens every time the pool is used. Also, if I halve the amount of processing that needs to happen outside the pool then no camera skipping happens.
Question: how can I determine why OS 'system' / kernel time suddenly goes through the roof? Why would that happen in a Python app?
And more importantly: any ideas why this is happening and how to avoid it?
Notes:
- This runs as root (it has to for the camera library unfortunately) from
upstart
- When the cameras are turned off the app restarts (using
respawn
in upstart) and this happens multiple times a day so it's not due to being long running, I have also seen this happen very soon after the process starts - It is the same code being run over and over, it's not due to running a different branch of my code
- Currently has a
nice
of -2, I have tried removing thenice
with no affect - Ubuntu 12.04.5 LTS
- Python 2.7
- Machine has 128GB of memory which I am no where near using