I'm writting a piece of code to evaluate my Clustering Algorithm and I find that every kind of evaluation method needs the basic data from a m*n
matrix like A = {aij}
where aij
is the number of data points that are members of class ci
and elements of cluster kj
.
But there appear to be two of this type of matrix in INTRODUCTION TO DATA MINING (Pang-Ning Tan et al.), one is the Confusion Matrix, the other is the Contingency Table. I do not fully understand the difference between the two. Which best describes the matrix I want to use?
Thank you very much for your answer!
In short, contingency table is used to describe data. and confusion matrix is, as others have pointed out, often used when comparing two hypothesis. One can think of predicted vs actual classification/categorization as two hypothesis, with the ground truth being the null and the model output being the alternative.
Wikipedia's definition:
Confusion matrix should be clear, it basically tells how many actual results match the predicted results. For example, see this confusion matrix
It tells that:
Column1, row 1 means that the classifier has predicted 15 items as belonging to class
c1
, and actually 15 items belong to classc1
(which is a correct prediction)the second column row 1 tells that the classifier has predicted that 3 items belong to class
c2
, but they actually belong to classc1
(which is a wrong prediction)Column 1 row 2 means that none of the items that actually belong to class
c2
have been predicted to belong to classc1
(which is a wrong prediction)Column 2 row 2 tells that 2 items that belong to class
c2
have been predicted to belong to classc2
(which is a correct prediction)Now see the formula of Accuracy and Error Rate from your book (Chapter 4, 4.2), and you should be able to clearly understand what is a confusion matrix. It is used to test the accuracy of a classifier using data with known results. The K-Fold method (also mentioned in the book) is one of the methods to calculate the accuracy of a classifier that has also been mentioned in your book.
Now, for Contingency table: Wikipedia's definition:
In data mining, contingency tables are used to show what items appeared in a reading together, like in a transaction or in the shopping-cart of a sales analysis. For example (this is the example from the book you have mentioned):
It tells that in 1000 responses (responses about do they like Coffee and tea or both or one of them, results of a survey):
Contingency tables are used to find the Support and Confidence of association rules, basically to evaluate association rules (read Chapter 6, 6.7.1).
Now the difference is that Confusion Matrix is used to evaluate the performance of a classifier, and it tells how accurate a classifier is in making predictions about classification, and contingency table is used to evaluate association rules.
Now after reading the answer, google a bit (always use google while you are reading your book), read what is in the book, see a few examples, and don't forget to solve a few exercises given in the book, and you should have a clear concept about both of them, and also what to use in a certain situation and why.
Hope this helps.