I'm attempting use caffe and python to do real-time image classification. I'm using OpenCV to stream from my webcam in one process, and in a separate process, using caffe to perform image classification on the frames pulled from the webcam. Then I'm passing the result of the classification back to the main thread to caption the webcam stream.
The problem is that even though I have an NVIDIA GPU and am performing the caffe predictions on the GPU, the main thread gets slown down. Normally without doing any predictions, my webcam stream runs at 30 fps; however, with the predictions, my webcam stream gets at best 15 fps.
I've verified that caffe is indeed using the GPU when performing the predictions, and that my GPU or GPU memory is not maxing out. I've also verified that my CPU cores are not getting maxed out at any point during the program. I'm wondering if I am doing something wrong or if there is no way to keep these 2 processes truly separate. Any advice is appreciated. Here is my code for reference
class Consumer(multiprocessing.Process):
def __init__(self, task_queue, result_queue):
multiprocessing.Process.__init__(self)
self.task_queue = task_queue
self.result_queue = result_queue
#other initialization stuff
def run(self):
caffe.set_mode_gpu()
caffe.set_device(0)
#Load caffe net -- code omitted
while True:
image = self.task_queue.get()
#crop image -- code omitted
text = net.predict(image)
self.result_queue.put(text)
return
import cv2
import caffe
import multiprocessing
import Queue
tasks = multiprocessing.Queue()
results = multiprocessing.Queue()
consumer = Consumer(tasks,results)
consumer.start()
#Creating window and starting video capturer from camera
cv2.namedWindow("preview")
vc = cv2.VideoCapture(0)
#Try to get the first frame
if vc.isOpened():
rval, frame = vc.read()
else:
rval = False
frame_copy[:] = frame
task_empty = True
while rval:
if task_empty:
tasks.put(frame_copy)
task_empty = False
if not results.empty():
text = results.get()
#Add text to frame
cv2.putText(frame,text)
task_empty = True
#Showing the frame with all the applied modifications
cv2.imshow("preview", frame)
#Getting next frame from camera
rval, frame = vc.read()
frame_copy[:] = frame
#Getting keyboard input
key = cv2.waitKey(1)
#exit on ESC
if key == 27:
break
I am pretty sure it is the caffe prediction slowing everything down, because when I comment out the prediction and pass dummy text back and forth between the processes, I get 30 fps again.
class Consumer(multiprocessing.Process):
def __init__(self, task_queue, result_queue):
multiprocessing.Process.__init__(self)
self.task_queue = task_queue
self.result_queue = result_queue
#other initialization stuff
def run(self):
caffe.set_mode_gpu()
caffe.set_device(0)
#Load caffe net -- code omitted
while True:
image = self.task_queue.get()
#crop image -- code omitted
#text = net.predict(image)
text = "dummy text"
self.result_queue.put(text)
return
import cv2
import caffe
import multiprocessing
import Queue
tasks = multiprocessing.Queue()
results = multiprocessing.Queue()
consumer = Consumer(tasks,results)
consumer.start()
#Creating window and starting video capturer from camera
cv2.namedWindow("preview")
vc = cv2.VideoCapture(0)
#Try to get the first frame
if vc.isOpened():
rval, frame = vc.read()
else:
rval = False
frame_copy[:] = frame
task_empty = True
while rval:
if task_empty:
tasks.put(frame_copy)
task_empty = False
if not results.empty():
text = results.get()
#Add text to frame
cv2.putText(frame,text)
task_empty = True
#Showing the frame with all the applied modifications
cv2.imshow("preview", frame)
#Getting next frame from camera
rval, frame = vc.read()
frame_copy[:] = frame
#Getting keyboard input
key = cv2.waitKey(1)
#exit on ESC
if key == 27:
break
Some Explanations and Some Rethinks:
I ran my code below on a laptop with an Intel Core i5-6300HQ @2.3GHz
cpu, 8 GB RAM
and NVIDIA GeForce GTX 960M
gpu(2GB memory), and the result was:
Whether I ran the code with caffe running or not(by commenting out or not net_output = this->net_->Forward(net_input)
and some necessary stuff in void Consumer::entry()
), I could always get around 30 fps in the main thread.
The similar result was got on a PC with an Intel Core i5-4440
cpu, 8 GB RAM
, NVIDIA GeForce GT 630
gpu(1GB memory).
I ran the code of @user3543300 in the question on the same laptop, the result was:
Whether caffe was running(on gpu) or not, I could also get around 30 fps.
According to @user3543300 's feedback, with the 2 versions of code mentioned above, @user3543300 could get only around 15 fps, when running caffe(on a Nvidia GeForce 940MX GPU and Intel® Core™ i7-6500U CPU @ 2.50GHz × 4
laptop).
And there will also be a slowdown of frame rate of the webcam when caffe running on gpu as an independent program.
So I still think that the problem may most possibly lie in hardware I/O limitaions such as DMA bandwidth(This thread about DMA may hint.) or RAM bandwidth.
Hope @user3543300 can check this or find out the true problem that I haven't realized of.
If the problem is indeed what I think of above, then a sensible thought would be to reduce memory I/O overhead introduced by the CNN network. In fact, to solve the similar problem on embedded systems with limited hardware resources, there have been some research on this topic, e.g. Qautization Structurally Sparse Deep Neural Networks, SqueezeNet, Deep-Compression. So hopefully, it will also help to improve the frame rate of webcam in the question by applying such skills.
Original Answer:
Try this c++ solution. It uses threads for the I/O overhead in your task, I tested it using bvlc_alexnet.caffemodel
, deploy.prototxt to do image classification and didn't see obvious slowing down of the main thread(webcam stream) when caffe running(on GPU):
#include <stdio.h>
#include <iostream>
#include <string>
#include <boost/thread.hpp>
#include <boost/shared_ptr.hpp>
#include "caffe/caffe.hpp"
#include "caffe/util/blocking_queue.hpp"
#include "caffe/data_transformer.hpp"
#include "opencv2/opencv.hpp"
using namespace cv;
//Queue pair for sharing image/results between webcam and caffe threads
template<typename T>
class QueuePair {
public:
explicit QueuePair(int size);
~QueuePair();
caffe::BlockingQueue<T*> free_;
caffe::BlockingQueue<T*> full_;
DISABLE_COPY_AND_ASSIGN(QueuePair);
};
template<typename T>
QueuePair<T>::QueuePair(int size) {
// Initialize the free queue
for (int i = 0; i < size; ++i) {
free_.push(new T);
}
}
template<typename T>
QueuePair<T>::~QueuePair(){
T *data;
while (free_.try_pop(&data)){
delete data;
}
while (full_.try_pop(&data)){
delete data;
}
}
template class QueuePair<Mat>;
template class QueuePair<std::string>;
//Do image classification(caffe predict) using a subthread
class Consumer{
public:
Consumer(boost::shared_ptr<QueuePair<Mat>> task
, boost::shared_ptr<QueuePair<std::string>> result);
~Consumer();
void Run();
void Stop();
void entry(boost::shared_ptr<QueuePair<Mat>> task
, boost::shared_ptr<QueuePair<std::string>> result);
private:
bool must_stop();
boost::shared_ptr<QueuePair<Mat> > task_q_;
boost::shared_ptr<QueuePair<std::string> > result_q_;
//caffe::Blob<float> *net_input_blob_;
boost::shared_ptr<caffe::DataTransformer<float> > data_transformer_;
boost::shared_ptr<caffe::Net<float> > net_;
std::vector<std::string> synset_words_;
boost::shared_ptr<boost::thread> thread_;
};
Consumer::Consumer(boost::shared_ptr<QueuePair<Mat>> task
, boost::shared_ptr<QueuePair<std::string>> result) :
task_q_(task), result_q_(result), thread_(){
//for data preprocess
caffe::TransformationParameter trans_para;
//set mean
trans_para.set_mean_file("/path/to/imagenet_mean.binaryproto");
//set crop size, here is cropping 227x227 from 256x256
trans_para.set_crop_size(227);
//instantiate a DataTransformer using trans_para for image preprocess
data_transformer_.reset(new caffe::DataTransformer<float>(trans_para
, caffe::TEST));
//initialize a caffe net
net_.reset(new caffe::Net<float>(std::string("/path/to/deploy.prototxt")
, caffe::TEST));
//net parameter
net_->CopyTrainedLayersFrom(std::string("/path/to/bvlc_alexnet.caffemodel"));
std::fstream synset_word("path/to/caffe/data/ilsvrc12/synset_words.txt");
std::string line;
if (!synset_word.good()){
std::cerr << "synset words open failed!" << std::endl;
}
while (std::getline(synset_word, line)){
synset_words_.push_back(line.substr(line.find_first_of(' '), line.length()));
}
//a container for net input, holds data converted from cv::Mat
//net_input_blob_ = new caffe::Blob<float>(1, 3, 227, 227);
}
Consumer::~Consumer(){
Stop();
//delete net_input_blob_;
}
void Consumer::entry(boost::shared_ptr<QueuePair<Mat>> task
, boost::shared_ptr<QueuePair<std::string>> result){
caffe::Caffe::set_mode(caffe::Caffe::GPU);
caffe::Caffe::SetDevice(0);
cv::Mat *frame;
cv::Mat resized_image(256, 256, CV_8UC3);
cv::Size re_size(resized_image.cols, resized_image.rows);
//for caffe input and output
const std::vector<caffe::Blob<float> *> net_input = this->net_->input_blobs();
std::vector<caffe::Blob<float> *> net_output;
//net_input.push_back(net_input_blob_);
std::string *res;
int pre_num = 1;
while (!must_stop()){
std::stringstream result_strm;
frame = task->full_.pop();
cv::resize(*frame, resized_image, re_size, 0, 0, CV_INTER_LINEAR);
this->data_transformer_->Transform(resized_image, *net_input[0]);
net_output = this->net_->Forward();
task->free_.push(frame);
res = result->free_.pop();
//Process results here
for (int i = 0; i < pre_num; ++i){
result_strm << synset_words_[net_output[0]->cpu_data()[i]] << " "
<< net_output[0]->cpu_data()[i + pre_num] << "\n";
}
*res = result_strm.str();
result->full_.push(res);
}
}
void Consumer::Run(){
if (!thread_){
try{
thread_.reset(new boost::thread(&Consumer::entry, this, task_q_, result_q_));
}
catch (std::exception& e) {
std::cerr << "Thread exception: " << e.what() << std::endl;
}
}
else
std::cout << "Consumer thread may have been running!" << std::endl;
};
void Consumer::Stop(){
if (thread_ && thread_->joinable()){
thread_->interrupt();
try {
thread_->join();
}
catch (boost::thread_interrupted&) {
}
catch (std::exception& e) {
std::cerr << "Thread exception: " << e.what() << std::endl;
}
}
}
bool Consumer::must_stop(){
return thread_ && thread_->interruption_requested();
}
int main(void)
{
int max_queue_size = 1000;
boost::shared_ptr<QueuePair<Mat>> tasks(new QueuePair<Mat>(max_queue_size));
boost::shared_ptr<QueuePair<std::string>> results(new QueuePair<std::string>(max_queue_size));
char str[100], info_str[100] = " results: ";
VideoCapture vc(0);
if (!vc.isOpened())
return -1;
Consumer consumer(tasks, results);
consumer.Run();
Mat frame, *frame_copy;
namedWindow("preview");
double t, fps;
while (true){
t = (double)getTickCount();
vc.read(frame);
if (waitKey(1) >= 0){
consuer.Stop();
break;
}
if (tasks->free_.try_peek(&frame_copy)){
frame_copy = tasks->free_.pop();
*frame_copy = frame.clone();
tasks->full_.push(frame_copy);
}
std::string *res;
std::string frame_info("");
if (results->full_.try_peek(&res)){
res = results->full_.pop();
frame_info = frame_info + info_str;
frame_info = frame_info + *res;
results->free_.push(res);
}
t = ((double)getTickCount() - t) / getTickFrequency();
fps = 1.0 / t;
sprintf(str, " fps: %.2f", fps);
frame_info = frame_info + str;
putText(frame, frame_info, Point(5, 20)
, FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0, 255, 0));
imshow("preview", frame);
}
}
And in src/caffe/util/blocking_queue.cpp, make a little change below and rebuild caffe:
...//Other stuff
template class BlockingQueue<Batch<float>*>;
template class BlockingQueue<Batch<double>*>;
template class BlockingQueue<Datum*>;
template class BlockingQueue<shared_ptr<DataReader::QueuePair> >;
template class BlockingQueue<P2PSync<float>*>;
template class BlockingQueue<P2PSync<double>*>;
//add these 2 lines below
template class BlockingQueue<cv::Mat*>;
template class BlockingQueue<std::string*>;
It seems like caffe's python wrapper blocks the Global Interpreter Lock (GIL). Thus calling any caffe python command blocks ALL python threads.
A workaround (at your own risk) would be to disable the GIL for specific caffe functions. For instance, if you want to be able to run forward
without lock, you can edit $CAFFE_ROOT/python/caffe/_caffe.cpp
. Add this function:
void Net_Forward(Net<Dtype>& net, int start, int end) {
Py_BEGIN_ALLOW_THREADS; // <-- disable GIL
net.ForwardFromTo(start, end);
Py_END_ALLOW_THREADS; // <-- restore GIL
}
And replace .def("_forward", &Net<Dtype>::ForwardFromTo)
with:
.def("_forward", &Net_Forward)
Don't forget to make pycaffe
after the change.
See this for more details.
One think might happen in your code, that is it works in gpu mode for the first call and on later calls it calculates the classification under cpu mode as it the default mode. On older version of caffe set gpu mode for once was enough, now newer version it needs to set mode everytime. You can try with following change:
def run(self):
#Load caffe net -- code omitted
while True:
caffe.set_mode_gpu()
caffe.set_device(0)
image = self.task_queue.get()
#crop image -- code omitted
text = net.predict(image)
self.result_queue.put(text)
return
Also please have a look at the gpu timings while the consumer thread is running. You can use following command for nvidia:
nvidia-smi
Above command will show you the gpu utilization at runtime.
If it not solves another solution is, make the opencv frame extraction code under a thread. As it is related with I/O and device access you might get benefit running it on separate thread from GUI thread/main thread. That thread will push frames in an queue and current consumer thread will predict. In that case carefully handle the queue with critical block.
Try multi threading approach instead of multiprocessing. Spawning processes is slower than spawning into threads. Once they are running, there is not much difference. In your case I think threading approach will benefit as there are so many frames data involved.