I have a huge data set with genotypic information from different populations. I would like to sort the data by population, but I don't know how.
I would like to sort by "pedigree_dhl". I was using the following code, but I kept getting error messages.
newdata <- project[pedigree_dhl == CCB133$*1, ]
My problem is also, that 'pedigree-dhl' contains all the names of the individual genotypes. Only the first 7 letters in the column 'pedigree-dhl' are the population name.In this example:CCB133. How can I tell R, that I want to extract the data for all columns, that contain CCB133?
Allele1 Allele2 SNP_name gs_entry pedigree_dhl
1 T T ZM011407_0151 656 CCB133$*1
2 T T ZM009374_0354 656 CCB133$*1
3 C C ZM003499_0591 656 CCB133$*1
4 A A ZM003898_0594 656 CCB133$*1
5 C C ZM004887_0313 656 CCB133$*1
6 G G ZM000583_1096 656 CCB133$*1
You may want to consider grep
as in the answer on Using regexp to select rows in R dataframe. Adapted to your data:
df <- read.table(text=" Allele1 Allele2 SNP_name gs_entry pedigree_dhl
1 T T ZM011407_0151 656 CCB133$*1
2 T T ZM009374_0354 656 CCB133$*1
3 C C ZM003499_0591 656 CCB133$*1
4 A A ZM003898_0594 656 CCB133$*1
5 C C ZM004887_0313 656 CCB133$*1
6 G G ZM000583_1096 656 CCB133$*1", header=T)
# put into df1 all rows where pedigree_dhl starts with CCB133$
p1 <- 'CCB133$'
df1 <- subset(df, grepl(p1, pedigree_dhl) )
But your question implies that you may want to select out the seven letter name, or just to sort the rows by pedigree name and it may be easier to keep all rows together in a sorted dataframe. All these three operations: sub-setting, extracting a new column, or sorting, may be carried out independently.
# If you want to create a new column based
# on the first seven letter of SNP_name (or any other variable)
df$SNP_7 <- substr(df$SNP_name, start=1, stop=7)
# If you want to order by pedigree_dhl
# then you don't need to select out the rows into a new dataframe
df <- df[ with(df, order(df$pedigree_dhl)), ]
All this may be obvious -- I add them simply for completeness.