可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I'm trying to sort out a bunch of data that I have about dinosaurs and their age ranges. So far, my data consists of a column of names, and then two columns of maximum and minimum dates in millions of years in the past, as you can see here:
GENUS ma_max ma_min ma_mid
Abydosaurus 109 94.3 101.65
Achelousaurus 84.9 70.6 77.75
Acheroraptor 70.6 66.043 68.3215
Geological time is split into different ages (such as the Jurassic and Cretaceous) and these are also subdivided into stage. These stages have specific age ranges and I have made a dataframe to display these:
Stage ma_max ma_min ma_mid
Hettangian 201.6 197.0 199.30
Sinemurian 197.0 190.0 193.50
Pliensbachian 190.0 183.0 186.50
Toarcian 183.0 176.0 179.50
Aalenian 176.0 172.0 174.00
Bajocian 172.0 168.0 170.00
Bathonian 168.0 165.0 166.50
Callovian 165.0 161.0 163.00
Oxfordian 161.0 156.0 158.50
Kimmeridgian 156.0 151.0 153.50
Tithonian 151.0 145.5 148.25
Berriasian 145.5 140.0 142.75
Valanginian 140.0 136.0 138.00
Hauterivian 136.0 130.0 133.00
Barremian 130.0 125.0 127.50
Aptian 125.0 112.0 118.50
Albian 112.0 99.6 105.80
Cenomanian 99.6 93.5 96.55
Turonian 93.5 89.3 91.40
Coniacian 89.3 85.8 87.55
Santonian 85.8 83.5 84.65
Campanian 83.5 70.6 77.05
Maastrichtian 70.6 66.5 68.05
I'm trying to find out how many genus' are in each stage. Problem is the range - for example, a genus can have a range that spans 3 or more stages, and I want each of those stages to record the presence of a genus. Is there any simple way to do this? I thought about using 'shingle' from the lattice packages as suggested in a similar discussion on here, but I'm very new to R and not sure if it can be implemented in a way where data has range.
回答1:
Assuming your data frames are called genus
and stage
, first create a list that contains, for each Stage
, the names of the genera that lived during that Stage
. Then we'll add that to the stage
data frame and also add another column that counts the number of genera living during each Stage
.
In the code below, sapply
takes each value of Stage
in turn and tests what values of GENUS
fall within that Stage
's time range by comparing the Stage
's ma_max
and ma_min
with the ma_max
and ma_min
for each GENUS
.
# List of genera that lived during each Stage
stages.genus = sapply(stage$Stage, function(x){
genus$GENUS[which((stage$ma_max[stage$Stage==x] <= genus$ma_max &
stage$ma_max[stage$Stage==x] >= genus$ma_min) |
(stage$ma_min[stage$Stage==x] >= genus$ma_min &
stage$ma_min[stage$Stage==x] <= genus$ma_max))]
})
For each element of stages.genus
, paste together all values of GENUS
that apply to that Stage
, separated by a comma, giving us vector containing the genera that go with each value of Stage
. Assign that vector as a new column of stage
that we'll call genera
.
# Add list of genera by stage to the stage data frame
stage$genera = lapply(stages.genus, paste, sep=", ")
To get a count of the number of genera in each Stage
, just count the number of genera in each element of stages.genus
and assign that to a new column of stage
that we'll call Ngenera
:
# Add count of genera for each Stage to the stage data frame
stage$Ngenera = lapply(stages.genus, length)
And here's the result:
> stage
Stage ma_max ma_min ma_mid genera Ngenera
1 Hettangian 201.6 197.0 199.30 0
2 Sinemurian 197.0 190.0 193.50 0
...
16 Aptian 125.0 112.0 118.50 0
17 Albian 112.0 99.6 105.80 Abydosaurus 1
18 Cenomanian 99.6 93.5 96.55 Abydosaurus 1
19 Turonian 93.5 89.3 91.40 0
20 Coniacian 89.3 85.8 87.55 0
21 Santonian 85.8 83.5 84.65 Achelousaurus 1
22 Campanian 83.5 70.6 77.05 Achelousaurus, Acheroraptor 2
23 Maastrichtian 70.6 66.5 68.05 Achelousaurus, Acheroraptor 2
An additional option is to create a column in stage
for each GENUS
and set the value to 1 if the GENUS
lived during that stage or zero otherwise:
stage[, genus$GENUS] = lapply(genus$GENUS, function(x) {
ifelse(grepl(x, stages.genus), 1, 0)
})
Here are the additional columns we just added:
> stage[ , c(1,7:9)] # Just show the Stage plus the three new GENUS columns
Stage Abydosaurus Achelousaurus Acheroraptor
1 Hettangian 0 0 0
2 Sinemurian 0 0 0
...
16 Aptian 0 0 0
17 Albian 1 0 0
18 Cenomanian 1 0 0
19 Turonian 0 0 0
20 Coniacian 0 0 0
21 Santonian 0 1 0
22 Campanian 0 1 1
23 Maastrichtian 0 1 1
The last step will also set you up for a nice visualization of genera by stage. For example:
library(reshape2)
library(ggplot2)
# Melt data into long format
stage.m = melt(stage[,c(1:4,7:9)], id.var=1:4)
# Tile plot where height of each Stage is proportional to how long it lasted
ggplot(stage.m, aes(variable, ma_mid, fill=factor(value))) +
geom_tile(aes(height=ma_max - ma_min), colour="grey20", lwd=0.2) +
scale_fill_manual(values=c("white","blue")) +
scale_y_continuous(breaks=stage$ma_mid, labels=stage$Stage) +
xlab("Genus") + ylab("Stage") +
theme_bw(base_size=15) +
guides(fill=FALSE)
The previous code can also be modified to use time ranges from both the stage
and genus
data frames if you want the blue coloring to cover only the time-range when each GENUS
lived, rather than the full range of each Stage
in which they lived.
回答2:
I would recommend sqldf package.
library(sqldf)
Lets assume that your GENUS data is located in genus data frame and Stage located in stage data frame.
res <- sqldf("select count(*) as countDinos , s.Stage, GROUP_CONCAT(g.GENUS) as names from genus g,stage s where (g.ma_max>=s.ma_min AND g.ma_max<=s.ma_max) OR (g.ma_min>=s.ma_min AND g.ma_min<=s.ma_max) OR (g.ma_max>s.ma_max AND g.ma_min<s.ma_min) group by s.Stage order by s.ma_mid DESC ")
Should give you response like this:
countDinos Stage names
1 Albian Abydosaurus
1 Cenomanian Abydosaurus
1 Santonian Achelousaurus
2 Campanian Achelousaurus ,Acheroraptor
2 Maastrichtian Achelousaurus ,Acheroraptor
回答3:
You could also consider using the function foverlaps
from the latest data.table
package.
# the setup is straight forward
library(data.table) # need version 1.9.5+ **
# can only download the latest version from
# https://github.com/Rdatatable/data.table
# at the time of posting this
setDT(genus); setDT(stage) # This line sets your data frames to data tables
setkey(genus, ma_min, ma_max) # This keys the start and end time of your time frame
setkey(stage, ma_min, ma_max) # This keys the start and end time of your time frame
# the opration
matches <- foverlaps(genus, stage, type="any", nomatch=0L)
matches
# Stage ma_max ma_min ma_mid GENUS i.ma_max i.ma_min i.ma_mid
# 1: Maastrichtian 70.6 66.5 68.05 Acheroraptor 70.6 66.043 68.3215
# 2: Campanian 83.5 70.6 77.05 Acheroraptor 70.6 66.043 68.3215
# 3: Maastrichtian 70.6 66.5 68.05 Achelousaurus 84.9 70.600 77.7500
# 4: Campanian 83.5 70.6 77.05 Achelousaurus 84.9 70.600 77.7500
# 5: Santonian 85.8 83.5 84.65 Achelousaurus 84.9 70.600 77.7500
# 6: Cenomanian 99.6 93.5 96.55 Abydosaurus 109.0 94.300 101.6500
# 7: Albian 112.0 99.6 105.80 Abydosaurus 109.0 94.300 101.6500
# This line below gives the frequency count (see column N)
matches[, N := length(GENUS), by=Stage][]
# Stage ma_max ma_min ma_mid GENUS i.ma_max i.ma_min i.ma_mid N
# 1: Maastrichtian 70.6 66.5 68.05 Acheroraptor 70.6 66.043 68.3215 2
# 2: Campanian 83.5 70.6 77.05 Acheroraptor 70.6 66.043 68.3215 2
# 3: Maastrichtian 70.6 66.5 68.05 Achelousaurus 84.9 70.600 77.7500 2
# 4: Campanian 83.5 70.6 77.05 Achelousaurus 84.9 70.600 77.7500 2
# 5: Santonian 85.8 83.5 84.65 Achelousaurus 84.9 70.600 77.7500 1
# 6: Cenomanian 99.6 93.5 96.55 Abydosaurus 109.0 94.300 101.6500 1
# 7: Albian 112.0 99.6 105.80 Abydosaurus 109.0 94.300 101.6500 1
# Of course you could also chain the two lines of code into one:
foverlaps(genus, stage, type="any", nomatch=0L)[, N := length(GENUS), by=Stage][]
# If you prefer simplify the output by removing a few columns (6th to 8th), you could
foverlaps(genus, stage, type="any", nomatch=0L)[, N := length(GENUS), by=Stage][,6:8 := NULL][]
# Stage ma_max ma_min ma_mid GENUS N
# 1: Maastrichtian 70.6 66.5 68.05 Acheroraptor 2
# 2: Campanian 83.5 70.6 77.05 Acheroraptor 2
# 3: Maastrichtian 70.6 66.5 68.05 Achelousaurus 2
# 4: Campanian 83.5 70.6 77.05 Achelousaurus 2
# 5: Santonian 85.8 83.5 84.65 Achelousaurus 1
# 6: Cenomanian 99.6 93.5 96.55 Abydosaurus 1
# 7: Albian 112.0 99.6 105.80 Abydosaurus 1
Recommend to check this out ?foverlaps
回答4:
How about looking for max and min ranges of exceedance? Here's a small dataset which I hope mimics your real data well enough-- the "epochs" are contiguous and my two "dinosaurs" are of random time-range.
dino
dmax dmin
[1,] 1 0.5500000 0.11
[2,] 2 0.3721239 0.05
> epoch
[,1] [,2]
[1,] 0.1 0.0
[2,] 0.2 0.1
[3,] 0.3 0.2
[4,] 0.4 0.3
[5,] 0.5 0.4
[6,] 0.6 0.5
[7,] 0.7 0.6
[8,] 0.8 0.7
[9,] 0.9 0.8
[10,] 1.0 0.9
[11,] 1.1 1.0
> max(which(epoch[,2]<dino[1,3])):min(which(epoch[,1]>dino[1,2]))
[1] 2 3 4 5 6
> max(which(epoch[,2]<dino[2,3])):min(which(epoch[,1]>dino[2,2]))
[1] 1 2 3 4
So those last two lines identify the row numbers in the epoch
matrix for which the selected dinosaur (row from the dino matrix) existed. If you loop over all rows, i.e. for (j in 1:nrow(GENUS_matrix) ) {do that "max(which...)) stuff}
回答5:
You can use "findInterval" to determine the stage:
> stages <- read.table(text = "Stage ma_max ma_min ma_mid
+ Hettangian 201.6 197.0 199.30
+ Sinemurian 197.0 190.0 193.50
+ Pliensbachian 190.0 183.0 186.50
+ Toarcian 183.0 176.0 179.50
+ Aalenian 176.0 172.0 174.00
+ Bajocian 172.0 168.0 170.00
+ Bathonian 168.0 165.0 166.50
+ Callovian 165.0 161.0 163.00
+ Oxfordian 161.0 156.0 158.50
+ Kimmeridgian 156.0 151.0 153.50
+ Tithonian 151.0 145.5 148.25
+ Berriasian 145.5 140.0 142.75
+ Valanginian 140.0 136.0 138.00
+ Hauterivian 136.0 130.0 133.00
+ Barremian 130.0 125.0 127.50
+ Aptian 125.0 112.0 118.50
+ Albian 112.0 99.6 105.80
+ Cenomanian 99.6 93.5 96.55
+ Turonian 93.5 89.3 91.40
+ Coniacian 89.3 85.8 87.55
+ Santonian 85.8 83.5 84.65
+ Campanian 83.5 70.6 77.05
+ Maastrichtian 70.6 66.5 68.05", as.is = TRUE, header = TRUE)
>
> myData <- read.table(text = "GENUS ma_max ma_min ma_mid
+ Abydosaurus 109 94.3 101.65
+ Achelousaurus 84.9 70.6 77.75
+ Acheroraptor 70.6 66.043 68.3215 ", as.is = TRUE, header = TRUE)
>
> # flip around to create the intervals
> stages <- stages[rev(seq(nrow(stages))), ]
> interval <- c(stages$ma_min, tail(stages$ma_max, 1)) # create interval
>
> # for each item get the start/end stages
> start <- findInterval(myData$ma_max,interval, all.inside = TRUE)
> end <- findInterval(myData$ma_min, interval, all.inside = TRUE)
>
> myData$stages <- apply(cbind(start, end), 1L, function(.row){
+ paste(stages$Stage[.row[1L]:.row[2L]], collapse = ', ')
+ })
> myData
GENUS ma_max ma_min ma_mid stages
1 Abydosaurus 109.0 94.300 101.6500 Albian, Cenomanian
2 Achelousaurus 84.9 70.600 77.7500 Santonian, Campanian
3 Acheroraptor 70.6 66.043 68.3215 Campanian, Maastrichtian