I implemented an Authorship attribution project where I was able to train my KNN model with articles from two authors using KNN. Then, I classify the author of a new article to be either author A or author B.
I use knn() function to generate the model.
The output of the model is the table below.
Word1 Word2 Word3 Author
11 1 48 8 A
2 2 0 0 B
29 1 45 9 A
1 2 0 0 B
4 0 0 0 B
28 3 1 1 B
As seen from the model, it is obvious to see that Word2 and Word3 are the most significant variables that cause the classification between Author A and Author B.
My question is how can I identify this using R.
Basically, your question boils down to having some variables (Word1, Word2, and Word3 in your example) and a binary outcome (Author in your example) and wanting to know the importance of different variables in determining that outcome. A natural approach would be training a regression model to predict the outcome using the variables and to check the variable importance in that model. I'll include two approaches (logistic regression and random forest) here, but many others could be used.
Let's start with a slightly larger example, in which the outcome only depends on Word2 and Word3, and Word2 has a much larger effect than Word3:
set.seed(144)
dat <- data.frame(Word1=rnorm(10000), Word2=rnorm(10000), Word3=rnorm(10000))
dat$Author <- ifelse(runif(10000) < 1/(1+exp(-10*dat$Word2+dat$Word3)), "A", "B")
We can use the summary of the logistic regression model predicting Author to determine the most important variables:
summary(glm(I(Author=="A")~., data=dat, family="binomial"))
# [snip]
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 0.05117 0.04935 1.037 0.300
# Word1 -0.02123 0.04926 -0.431 0.666
# Word2 9.52679 0.26895 35.422 <2e-16 ***
# Word3 -0.97022 0.05629 -17.236 <2e-16 ***
From the p-values, we can see that Word2 has a large positive effect and Word3 has a large negative effect. From the coefficients we can see that Word2 has a higher magnitude of effect on the outcome (since by construction we know all the variables are on the same scale).
We can use the variable importance from a random forest predicting the Author outcome similarly:
library(randomForest)
rf <- randomForest(as.factor(Author)~., data=dat)
rf$importance
# MeanDecreaseGini
# Word1 294.9039
# Word2 4353.2107
# Word3 351.3268
We can identify Word2 as by far the most important variable. This tells us something else that's interesting -- given that we know Word2, Word3 actually isn't too much more useful than Word1 in predicting the outcome (and Word1 shouldn't be too useful because it wasn't used to compute the outcome).