[I'm new to R...] I have this dataframe:
df1 <- data.frame(c(2,1,2), c(1,2,3,4,5,6), seq(141,170)) #create data.frame
names(df1) <- c('gender', 'age', 'height') #column names
I want the df1
's summary in a dataframe object that looks like this:
count mean std min 25% 50% 75% max
age 30.0000 3.5000 1.7370 1.0000 2.0000 3.5000 5.0000 6.0000
gender 30.0000 1.6667 0.4795 1.0000 1.0000 2.0000 2.0000 2.0000
height 30.0000 155.5000 8.8034 141.0000 148.2500 155.5000 162.7500 170.0000
I've generated this in Python with df1.describe().T
. How can I do this in R?
It would be a gratis if my summary dataframe would contain the "dtype", "null" (number of NULL
values), (number of) "unique" and "range" values as well to have a comprehensive summary statistics:
count mean std min 25% 50% 75% max null unique range dtype
age 30.0000 3.5000 1.7370 1.0000 2.0000 3.5000 5.0000 6.0000 0 6 5 int64
gender 30.0000 1.6667 0.4795 1.0000 1.0000 2.0000 2.0000 2.0000 0 2 1 int64
height 30.0000 155.5000 8.8034 141.0000 148.2500 155.5000 162.7500 170.0000 0 30 29 int64
The Python code of above result is:
df1.describe().T.join(pd.DataFrame(df1.isnull().sum(), columns=['null']))\
.join(pd.DataFrame.from_dict({i:df1[i].nunique() for i in df1.columns}, orient='index')\
.rename(columns={0:'unique'}))\
.join(pd.DataFrame.from_dict({i:(df1[i].max() - df1[i].min()) for i in df1.columns}, orient='index')\
.rename(columns={0:'range'}))\
.join(pd.DataFrame(df1.dtypes, columns=['dtype']))
Thank you!
I commonly use a little function (adapted from a script found on the net) to do this kind of transformation:
sumstats = function(x) {
null.k <- function(x) sum(is.na(x))
unique.k <- function(x) {if (sum(is.na(x)) > 0) length(unique(x)) - 1
else length(unique(x))}
range.k <- function(x) max(x, na.rm=TRUE) - min(x, na.rm=TRUE)
mean.k=function(x) {if (is.numeric(x)) round(mean(x, na.rm=TRUE), digits=2)
else "N*N"}
sd.k <- function(x) {if (is.numeric(x)) round(sd(x, na.rm=TRUE), digits=2)
else "N*N"}
min.k <- function(x) {if (is.numeric(x)) round(min(x, na.rm=TRUE), digits=2)
else "N*N"}
q05 <- function(x) quantile(x, probs=.05, na.rm=TRUE)
q10 <- function(x) quantile(x, probs=.1, na.rm=TRUE)
q25 <- function(x) quantile(x, probs=.25, na.rm=TRUE)
q50 <- function(x) quantile(x, probs=.5, na.rm=TRUE)
q75 <- function(x) quantile(x, probs=.75, na.rm=TRUE)
q90 <- function(x) quantile(x, probs=.9, na.rm=TRUE)
q95 <- function(x) quantile(x, probs=.95, na.rm=TRUE)
max.k <- function(x) {if (is.numeric(x)) round(max(x, na.rm=TRUE), digits=2)
else "N*N"}
sumtable <- cbind(as.matrix(colSums(!is.na(x))), sapply(x, null.k), sapply(x, unique.k), sapply(x, range.k), sapply(x, mean.k), sapply(x, sd.k),
sapply(x, min.k), sapply(x, q05), sapply(x, q10), sapply(x, q25), sapply(x, q50),
sapply(x, q75), sapply(x, q90), sapply(x, q95), sapply(x, max.k))
sumtable <- as.data.frame(sumtable); names(sumtable) <- c('count', 'null', 'unique',
'range', 'mean', 'std', 'min', '5%', '10%', '25%', '50%', '75%', '90%',
'95%', 'max')
return(sumtable)
}
sumstats(df1)
count null unique range mean std var min 5% 10% 25% 50% 75% 90% 95% max
gender 30.00 0.00 2.00 1.00 1.67 0.48 0.23 1.00 1.00 1.00 1.00 2.00 2.00 2.00 2.00 2.00
age 30.00 0.00 6.00 5.00 3.50 1.74 3.02 1.00 1.00 1.00 2.00 3.50 5.00 6.00 6.00 6.00
height 30.00 0.00 30.00 29.00 155.50 8.80 77.50 141.00 142.45 143.90 148.25 155.50 162.75 167.10 168.55 170.00
You might easily adapt it to add more descriptive columns, such as quantiles, nulls, range, etc. It does return a data.frame. You also might want to specify in advance the behaviour with NAs in the arguments.
Hope it helps.
you can do this quite easily and readable with these libraries - tidyr
, dplyr
library("tidyr")
library("dplyr")
df1 <- data.frame(c(2,1,2), c(1,2,3,4,5,6), seq(141,170)) #create data.frame
names(df1) <- c('gender', 'age', 'height') #column names
df2<- gather(df1,"attributes","value")
df2 %>% group_by(attributes) %>% summarise(count = n(), mean = mean(value), med = median(value),sd = sd(value), min = min(value), max = max(value))
# A tibble: 3 x 7
# attributes count mean med sd min max
# <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 age 30 3.500000 3.5 1.7370208 1 6
# 2 gender 30 1.666667 2.0 0.4794633 1 2
# 3 height 30 155.500000 155.5 8.8034084 141 170