I'm trying to perform clustering in Python using Random Forests. In the R implementation of Random Forests, there is a flag you can set to get the proximity matrix. I can't seem to find anything similar in the python scikit version of Random Forest. Does anyone know if there is an equivalent calculation for the python version?
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to get the background from multiple images by
- Evil ctypes hack in python
- Correctly parse PDF paragraphs with Python
We don't implement proximity matrix in Scikit-Learn (yet).
However, this could be done by relying on the
apply
function provided in our implementation of decision trees. That is, for all pairs of samples in your dataset, iterate over the decision trees in the forest (throughforest.estimators_
) and count the number of times they fall in the same leaf, i.e., the number of timesapply
give the same node id for both samples in the pair.Hope this helps.
There is nothing currently implemented for this in python. I took a first try at it here. It would be great if somebody would be interested in adding these methods to scikit.
Based on Gilles Louppe answer I have written a function. I don't know if it is effective, but it works. Best regards.