I'm using sklearn's RandomForestClassifier
for a classification problem. I would like to train the trees of the a forest individually as I am grabbing subsets of a (VERY) large set for each tree. However, when I fit trees manually, memory consumption bloats.
Here's a line-by-line memory profile using memory_profiler
of a custom fit vs using the RandomForestClassifier
's fit
function. As far as I can tell the source fit function performs the same steps as the custom fit. So what gives with all the extra memory??
normal fit:
Line # Mem usage Increment Line Contents
================================================
17 28.004 MiB 0.000 MiB @profile
18 def normal_fit():
19 28.777 MiB 0.773 MiB X = random.random((1000,100))
20 28.781 MiB 0.004 MiB Y = random.random(1000) < 0.5
21 28.785 MiB 0.004 MiB rfc = RFC(n_estimators=100,n_jobs=1)
22 28.785 MiB 0.000 MiB rfc.n_classes_ = 2
23 28.785 MiB 0.000 MiB rfc.classes_ = array([False, True],dtype=bool)
24 28.785 MiB 0.000 MiB rfc.n_outputs_ = 1
25 28.785 MiB 0.000 MiB rfc.n_features_ = 100
26 28.785 MiB 0.000 MiB rfc.bootstrap = False
27 37.668 MiB 8.883 MiB rfc.fit(X,Y)
custom fit:
Line # Mem usage Increment Line Contents
================================================
4 28.004 MiB 0.000 MiB @profile
5 def custom_fit():
6 28.777 MiB 0.773 MiB X = random.random((1000,100))
7 28.781 MiB 0.004 MiB Y = random.random(1000) < 0.5
8 28.785 MiB 0.004 MiB rfc = RFC(n_estimators=100,n_jobs=1)
9 28.785 MiB 0.000 MiB rfc.n_classes_ = 2
10 28.785 MiB 0.000 MiB rfc.classes_ = array([False, True],dtype=bool)
11 28.785 MiB 0.000 MiB rfc.n_outputs_ = 1
12 28.785 MiB 0.000 MiB rfc.n_features_ = 100
13 73.266 MiB 44.480 MiB for i in range(rfc.n_estimators):
14 72.820 MiB -0.445 MiB rfc._make_estimator()
15 73.262 MiB 0.441 MiB rfc.estimators_[-1].fit(X,Y,check_input=False)
Follow up:
I instead create a python script for building a single tree and dumping it via pickle. Then I glue everything together with some shell scripting and a final python script to create and dump the RF model. This way memory is returned after each tree creation as each has its own thread of execution.
The
sklearn
implementation gets around the memory issue in a way that I believe has to do with the the_parallel_build_tree
method as the custom implementation only differs in that respect. I'm posting my workaround as answer, but if in the future, someone could enlighten me on the previous, I'd appreciate it.