How to show frequencies in discrete categories wit

2019-05-15 12:54发布

站内文章 / 前沿技术

21 0

贪生不怕死

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm trying to sort out a bunch of data that I have about dinosaurs and their age ranges. So far, my data consists of a column of names, and then two columns of maximum and minimum dates in millions of years in the past, as you can see here:

GENUS           ma_max  ma_min  ma_mid    
Abydosaurus     109     94.3    101.65    
Achelousaurus   84.9    70.6    77.75    
Acheroraptor    70.6    66.043  68.3215

Geological time is split into different ages (such as the Jurassic and Cretaceous) and these are also subdivided into stage. These stages have specific age ranges and I have made a dataframe to display these:

Stage          ma_max ma_min ma_mid
Hettangian      201.6  197.0 199.30
Sinemurian      197.0  190.0 193.50
Pliensbachian   190.0  183.0 186.50
Toarcian        183.0  176.0 179.50
Aalenian        176.0  172.0 174.00
Bajocian        172.0  168.0 170.00
Bathonian       168.0  165.0 166.50
Callovian       165.0  161.0 163.00
Oxfordian       161.0  156.0 158.50
Kimmeridgian    156.0  151.0 153.50
Tithonian       151.0  145.5 148.25
Berriasian      145.5  140.0 142.75
Valanginian     140.0  136.0 138.00
Hauterivian     136.0  130.0 133.00
Barremian       130.0  125.0 127.50
Aptian          125.0  112.0 118.50
Albian          112.0   99.6 105.80
Cenomanian      99.6   93.5  96.55
Turonian        93.5   89.3  91.40
Coniacian       89.3   85.8  87.55
Santonian       85.8   83.5  84.65
Campanian       83.5   70.6  77.05
Maastrichtian   70.6   66.5  68.05

I'm trying to find out how many genus' are in each stage. Problem is the range - for example, a genus can have a range that spans 3 or more stages, and I want each of those stages to record the presence of a genus. Is there any simple way to do this? I thought about using 'shingle' from the lattice packages as suggested in a similar discussion on here, but I'm very new to R and not sure if it can be implemented in a way where data has range.

回答1:

Assuming your data frames are called genus and stage, first create a list that contains, for each Stage, the names of the genera that lived during that Stage. Then we'll add that to the stage data frame and also add another column that counts the number of genera living during each Stage.

In the code below, sapply takes each value of Stage in turn and tests what values of GENUS fall within that Stage's time range by comparing the Stage's ma_max and ma_min with the ma_max and ma_min for each GENUS.

# List of genera that lived during each Stage
stages.genus = sapply(stage$Stage, function(x){
  genus$GENUS[which((stage$ma_max[stage$Stage==x] <= genus$ma_max & 
                       stage$ma_max[stage$Stage==x] >= genus$ma_min) |
                      (stage$ma_min[stage$Stage==x] >= genus$ma_min & 
                         stage$ma_min[stage$Stage==x] <= genus$ma_max))]
})

For each element of stages.genus, paste together all values of GENUS that apply to that Stage, separated by a comma, giving us vector containing the genera that go with each value of Stage. Assign that vector as a new column of stage that we'll call genera.

# Add list of genera by stage to the stage data frame
stage$genera = lapply(stages.genus, paste, sep=", ")

To get a count of the number of genera in each Stage, just count the number of genera in each element of stages.genus and assign that to a new column of stage that we'll call Ngenera:

# Add count of genera for each Stage to the stage data frame
stage$Ngenera = lapply(stages.genus, length)

And here's the result:

> stage

           Stage ma_max ma_min ma_mid                      genera Ngenera
1     Hettangian  201.6  197.0 199.30                                   0
2     Sinemurian  197.0  190.0 193.50                                   0
...
16        Aptian  125.0  112.0 118.50                                   0
17        Albian  112.0   99.6 105.80                 Abydosaurus       1
18    Cenomanian   99.6   93.5  96.55                 Abydosaurus       1
19      Turonian   93.5   89.3  91.40                                   0
20     Coniacian   89.3   85.8  87.55                                   0
21     Santonian   85.8   83.5  84.65               Achelousaurus       1
22     Campanian   83.5   70.6  77.05 Achelousaurus, Acheroraptor       2
23 Maastrichtian   70.6   66.5  68.05 Achelousaurus, Acheroraptor       2

An additional option is to create a column in stage for each GENUS and set the value to 1 if the GENUS lived during that stage or zero otherwise:

stage[, genus$GENUS] = lapply(genus$GENUS, function(x) {
  ifelse(grepl(x, stages.genus), 1, 0)
})

Here are the additional columns we just added:

> stage[ , c(1,7:9)]   # Just show the Stage plus the three new GENUS columns

           Stage Abydosaurus Achelousaurus Acheroraptor
1     Hettangian           0             0            0
2     Sinemurian           0             0            0
...
16        Aptian           0             0            0
17        Albian           1             0            0
18    Cenomanian           1             0            0
19      Turonian           0             0            0
20     Coniacian           0             0            0
21     Santonian           0             1            0
22     Campanian           0             1            1
23 Maastrichtian           0             1            1

The last step will also set you up for a nice visualization of genera by stage. For example:

library(reshape2)
library(ggplot2)

# Melt data into long format
stage.m = melt(stage[,c(1:4,7:9)], id.var=1:4)

# Tile plot where height of each Stage is proportional to how long it lasted
ggplot(stage.m, aes(variable, ma_mid, fill=factor(value))) +
  geom_tile(aes(height=ma_max - ma_min), colour="grey20", lwd=0.2) +
  scale_fill_manual(values=c("white","blue")) +
  scale_y_continuous(breaks=stage$ma_mid, labels=stage$Stage) +
  xlab("Genus") + ylab("Stage") +
  theme_bw(base_size=15) +
  guides(fill=FALSE)

The previous code can also be modified to use time ranges from both the stage and genus data frames if you want the blue coloring to cover only the time-range when each GENUS lived, rather than the full range of each Stage in which they lived.

回答2:

I would recommend sqldf package.

library(sqldf)

Lets assume that your GENUS data is located in genus data frame and Stage located in stage data frame.

res <- sqldf("select count(*) as countDinos , s.Stage, GROUP_CONCAT(g.GENUS) as names from genus g,stage s where (g.ma_max>=s.ma_min AND g.ma_max<=s.ma_max)  OR  (g.ma_min>=s.ma_min AND g.ma_min<=s.ma_max) OR (g.ma_max>s.ma_max AND g.ma_min<s.ma_min)   group by s.Stage order by s.ma_mid DESC  ")

Should give you response like this:

countDinos  Stage         names
   1        Albian                         Abydosaurus   
   1        Cenomanian                     Abydosaurus   
   1        Santonian                      Achelousaurus 
   2        Campanian       Achelousaurus ,Acheroraptor  
   2        Maastrichtian   Achelousaurus ,Acheroraptor

回答3:

You could also consider using the function foverlaps from the latest data.table package.

# the setup is straight forward
library(data.table)               # need version 1.9.5+ **
                                  # can only download the latest version from  
                                  # https://github.com/Rdatatable/data.table  
                                  # at the time of posting this

setDT(genus); setDT(stage)        # This line sets your data frames to data tables
setkey(genus, ma_min, ma_max)     # This keys the start and end time of your time frame
setkey(stage, ma_min, ma_max)     # This keys the start and end time of your time frame

# the opration
matches <- foverlaps(genus, stage, type="any", nomatch=0L)
matches
#            Stage ma_max ma_min ma_mid         GENUS i.ma_max i.ma_min i.ma_mid
# 1: Maastrichtian   70.6   66.5  68.05  Acheroraptor     70.6   66.043  68.3215
# 2:     Campanian   83.5   70.6  77.05  Acheroraptor     70.6   66.043  68.3215
# 3: Maastrichtian   70.6   66.5  68.05 Achelousaurus     84.9   70.600  77.7500
# 4:     Campanian   83.5   70.6  77.05 Achelousaurus     84.9   70.600  77.7500
# 5:     Santonian   85.8   83.5  84.65 Achelousaurus     84.9   70.600  77.7500
# 6:    Cenomanian   99.6   93.5  96.55   Abydosaurus    109.0   94.300 101.6500
# 7:        Albian  112.0   99.6 105.80   Abydosaurus    109.0   94.300 101.6500

# This line below gives the frequency count (see column N)
matches[, N := length(GENUS), by=Stage][]

#            Stage ma_max ma_min ma_mid         GENUS i.ma_max i.ma_min i.ma_mid N
# 1: Maastrichtian   70.6   66.5  68.05  Acheroraptor     70.6   66.043  68.3215 2
# 2:     Campanian   83.5   70.6  77.05  Acheroraptor     70.6   66.043  68.3215 2
# 3: Maastrichtian   70.6   66.5  68.05 Achelousaurus     84.9   70.600  77.7500 2
# 4:     Campanian   83.5   70.6  77.05 Achelousaurus     84.9   70.600  77.7500 2
# 5:     Santonian   85.8   83.5  84.65 Achelousaurus     84.9   70.600  77.7500 1
# 6:    Cenomanian   99.6   93.5  96.55   Abydosaurus    109.0   94.300 101.6500 1
# 7:        Albian  112.0   99.6 105.80   Abydosaurus    109.0   94.300 101.6500 1

# Of course you could also chain the two lines of code into one:
foverlaps(genus, stage, type="any", nomatch=0L)[, N := length(GENUS), by=Stage][]

# If you prefer simplify the output by removing a few columns (6th to 8th), you could
foverlaps(genus, stage, type="any", nomatch=0L)[, N := length(GENUS), by=Stage][,6:8 := NULL][]

#            Stage ma_max ma_min ma_mid         GENUS N
# 1: Maastrichtian   70.6   66.5  68.05  Acheroraptor 2
# 2:     Campanian   83.5   70.6  77.05  Acheroraptor 2
# 3: Maastrichtian   70.6   66.5  68.05 Achelousaurus 2
# 4:     Campanian   83.5   70.6  77.05 Achelousaurus 2
# 5:     Santonian   85.8   83.5  84.65 Achelousaurus 1
# 6:    Cenomanian   99.6   93.5  96.55   Abydosaurus 1
# 7:        Albian  112.0   99.6 105.80   Abydosaurus 1

Recommend to check this out ?foverlaps

回答4:

How about looking for max and min ranges of exceedance? Here's a small dataset which I hope mimics your real data well enough-- the "epochs" are contiguous and my two "dinosaurs" are of random time-range.

 dino
            dmax dmin
[1,] 1 0.5500000 0.11
[2,] 2 0.3721239 0.05
> epoch
      [,1] [,2]
 [1,]  0.1  0.0
 [2,]  0.2  0.1
 [3,]  0.3  0.2
 [4,]  0.4  0.3
 [5,]  0.5  0.4
 [6,]  0.6  0.5
 [7,]  0.7  0.6
 [8,]  0.8  0.7
 [9,]  0.9  0.8
[10,]  1.0  0.9
[11,]  1.1  1.0
> max(which(epoch[,2]<dino[1,3])):min(which(epoch[,1]>dino[1,2]))
[1] 2 3 4 5 6
> max(which(epoch[,2]<dino[2,3])):min(which(epoch[,1]>dino[2,2]))
[1] 1 2 3 4

So those last two lines identify the row numbers in the epoch matrix for which the selected dinosaur (row from the dino matrix) existed. If you loop over all rows, i.e. for (j in 1:nrow(GENUS_matrix) ) {do that "max(which...)) stuff}

回答5:

You can use "findInterval" to determine the stage:

> stages <- read.table(text = "Stage          ma_max ma_min ma_mid
+ Hettangian      201.6  197.0 199.30
+ Sinemurian      197.0  190.0 193.50
+ Pliensbachian   190.0  183.0 186.50
+ Toarcian        183.0  176.0 179.50
+ Aalenian        176.0  172.0 174.00
+ Bajocian        172.0  168.0 170.00
+ Bathonian       168.0  165.0 166.50
+ Callovian       165.0  161.0 163.00
+ Oxfordian       161.0  156.0 158.50
+ Kimmeridgian    156.0  151.0 153.50
+ Tithonian       151.0  145.5 148.25
+ Berriasian      145.5  140.0 142.75
+ Valanginian     140.0  136.0 138.00
+ Hauterivian     136.0  130.0 133.00
+ Barremian       130.0  125.0 127.50
+ Aptian          125.0  112.0 118.50
+ Albian          112.0   99.6 105.80
+ Cenomanian      99.6   93.5  96.55
+ Turonian        93.5   89.3  91.40
+ Coniacian       89.3   85.8  87.55
+ Santonian       85.8   83.5  84.65
+ Campanian       83.5   70.6  77.05
+ Maastrichtian   70.6   66.5  68.05", as.is = TRUE, header = TRUE)
> 
> myData <- read.table(text = "GENUS           ma_max  ma_min  ma_mid    
+ Abydosaurus     109     94.3    101.65    
+ Achelousaurus   84.9    70.6    77.75    
+ Acheroraptor    70.6    66.043  68.3215    ", as.is = TRUE, header = TRUE)
> 
> # flip around to create the intervals
> stages <- stages[rev(seq(nrow(stages))), ]
> interval <- c(stages$ma_min, tail(stages$ma_max, 1))  # create interval
> 
> # for each item get the start/end stages
> start <- findInterval(myData$ma_max,interval, all.inside = TRUE)
> end <- findInterval(myData$ma_min, interval, all.inside = TRUE)
> 
> myData$stages <- apply(cbind(start, end), 1L, function(.row){
+     paste(stages$Stage[.row[1L]:.row[2L]], collapse = ', ')
+ })
> myData
          GENUS ma_max ma_min   ma_mid                   stages
1   Abydosaurus  109.0 94.300 101.6500       Albian, Cenomanian
2 Achelousaurus   84.9 70.600  77.7500     Santonian, Campanian
3  Acheroraptor   70.6 66.043  68.3215 Campanian, Maastrichtian