Actually there are 2 questions, one is more advanced than the other.
Q1: I am looking for a method that similar to corrplot()
but can deal with factors.
I originally tried to use chisq.test()
then calculate the p-value and Cramer's V as correlation, but there too many columns to figure out.
So could anyone tell me if there is a quick way to create a "corrplot" that each cell contains the value of Cramer's V, while the colour is rendered by p-value. Or any other kind of similar plot.
Regarding Cramer's V, let's say tbl
is a 2-dimensional factor data frame.
chi2 <- chisq.test(tbl, correct=F)
Cramer_V <- sqrt(chi2$/nrow(tbl))
I prepared a test data frame with factors:
df <- data.frame(
group = c('A', 'A', 'A', 'A', 'A', 'B', 'C'),
student = c('01', '01', '01', '02', '02', '01', '02'),
exam_pass = c('Y', 'N', 'Y', 'N', 'Y', 'Y', 'N'),
subject = c('Math', 'Science', 'Japanese', 'Math', 'Science', 'Japanese', 'Math')
)
Q2: Then I would like to compute a correlation/association matrix on a mixed-types dataframe e.g.:
df <- data.frame(
group = c('A', 'A', 'A', 'A', 'A', 'B', 'C'),
student = c('01', '01', '01', '02', '02', '01', '02'),
exam_pass = c('Y', 'N', 'Y', 'N', 'Y', 'Y', 'N'),
subject = c('Math', 'Science', 'Japanese', 'Math', 'Science', 'Japanese', 'Math')
)
df$group <- factor(df$group, levels = c('A', 'B', 'C'), ordered = T)
df$student <- as.integer(df$student)
If you want to have a genuine correlation plot for factors or mixed-type, you can also use
model.matrix
to one-hot encode all non-numeric variables. This is quite different than calculating Cramér's V as it will consider your factor as separate variables, as many regression models do.You can then use your favorite correlation-plot library. I personally like
ggcorrplot
for itsggplot2
compatibility.Here is an example with your dataset:
Here's a
tidyverse
solution:Note that I'm using
lsr
package to calculate Cramers V using thecramersV
function.The solution from @AntoniosK can be improved as suggested by @J.D. to also allow for mixed data-frames including both nominal and numerical attributes. Strength of association is calculated for nominal vs nominal with a bias corrected Cramer's V, numeric vs numeric with Spearman (default) or Pearson correlation, and nominal vs numeric with ANOVA.
Using the method, we can analyse a wide range of mixed variable data-frames easily:
This can also be used along with the excellent
corrr
package, e.g. to draw a correlation network graph:Regarding Q1, you can use ?pairs.table from the vcd package, if you first convert your data frame with ?structable (from the same package). This will give you a plot matrix of mosaic plots. That isn't quite the same as what
corrplot()
does, but I suspect it would be a more useful visualization.There are a variety of other plots that are appropriate for categorical-categorical data, such as sieve plots, association plots, and pressure plots (see my question on Cross Validated here: Alternative to sieve / mosaic plots for contingency tables). You could write your own pairs-based function to put whatever you want in the upper or lower triangle panels (see my question here: Pairs matrix with qq-plots) if you don't prefer mosaic plots. Just remember that while plot matrices are very useful, they only ever display marginal projections (to understand this more fully, see my answers on CV here: Is there a difference between 'controlling for' and 'ignoring' other variables in multiple regression?, and here: Alternatives to three dimensional scatter plot).
Regarding Q2, you would need to write a custom function.