I'm using sklearn's RandomForestClassifier
for a classification problem. I would like to train the trees of the a forest individually as I am grabbing subsets of a (VERY) large set for each tree. However, when I fit trees manually, memory consumption bloats.
Here's a line-by-line memory profile using memory_profiler
of a custom fit vs using the RandomForestClassifier
's fit
function. As far as I can tell the source fit function performs the same steps as the custom fit. So what gives with all the extra memory??
normal fit:
Line # Mem usage Increment Line Contents
================================================
17 28.004 MiB 0.000 MiB @profile
18 def normal_fit():
19 28.777 MiB 0.773 MiB X = random.random((1000,100))
20 28.781 MiB 0.004 MiB Y = random.random(1000) < 0.5
21 28.785 MiB 0.004 MiB rfc = RFC(n_estimators=100,n_jobs=1)
22 28.785 MiB 0.000 MiB rfc.n_classes_ = 2
23 28.785 MiB 0.000 MiB rfc.classes_ = array([False, True],dtype=bool)
24 28.785 MiB 0.000 MiB rfc.n_outputs_ = 1
25 28.785 MiB 0.000 MiB rfc.n_features_ = 100
26 28.785 MiB 0.000 MiB rfc.bootstrap = False
27 37.668 MiB 8.883 MiB rfc.fit(X,Y)
custom fit:
Line # Mem usage Increment Line Contents
================================================
4 28.004 MiB 0.000 MiB @profile
5 def custom_fit():
6 28.777 MiB 0.773 MiB X = random.random((1000,100))
7 28.781 MiB 0.004 MiB Y = random.random(1000) < 0.5
8 28.785 MiB 0.004 MiB rfc = RFC(n_estimators=100,n_jobs=1)
9 28.785 MiB 0.000 MiB rfc.n_classes_ = 2
10 28.785 MiB 0.000 MiB rfc.classes_ = array([False, True],dtype=bool)
11 28.785 MiB 0.000 MiB rfc.n_outputs_ = 1
12 28.785 MiB 0.000 MiB rfc.n_features_ = 100
13 73.266 MiB 44.480 MiB for i in range(rfc.n_estimators):
14 72.820 MiB -0.445 MiB rfc._make_estimator()
15 73.262 MiB 0.441 MiB rfc.estimators_[-1].fit(X,Y,check_input=False)