选择其中一个列具有类似字符串的行“HSA ..”（部分字符串匹配）(Selecting rows w

我有含有微小RNA数据的371MB文本文件。从本质上讲，我想只选择那些有关于人类微小RNA的信息行。

我已经使用函数read.table一个文件中读取。通常情况下，我会完成我会跟sqldf想 - 如果它有一个“喜欢”的语法（从SELECT * <>其中的miRNA像“HSA”）。不幸的是 - sqldf不支持这种语法。

我怎样才能做到这一点R中？我环顾四周计算器，并没有看到的我该怎么办部分字符串匹配的例子。我甚至安装了stringr包 - 但它不相当有我需要什么。

我想什么做的，是这样的-哪里哪里HSA-*所有行被选中。

selectedRows <- conservedData[, conservedData$miRNA %like% "hsa-"]

这当然是不正确的语法。

可有人请帮助我？非常感谢阅读。

阿斯达

Answer 1:

我注意到你提到的功能%like%在目前的做法。我不知道这是在参考%like%来自“data.table”，但如果是这样，你可以按如下绝对使用它。

请注意，对象并不一定是一个data.table （还记得子集的方法data.frame S和data.table s为不相同）：

library(data.table)
mtcars[rownames(mtcars) %like% "Merc", ]
iris[iris$Species %like% "osa", ]

如果这是你有什么，那么也许你刚混了数据子集的行和列位置。

如果你不希望加载一个包，你可以尝试使用grep()来搜索您要匹配的字符串。下面是用一个例子mtcars数据集，我们是匹配的所有行行名称包括“芝加哥商业交易所”：

mtcars[grep("Merc", rownames(mtcars)), ]
             mpg cyl  disp  hp drat   wt qsec vs am gear carb
# Merc 240D   24.4   4 146.7  62 3.69 3.19 20.0  1  0    4    2
# Merc 230    22.8   4 140.8  95 3.92 3.15 22.9  1  0    4    2
# Merc 280    19.2   6 167.6 123 3.92 3.44 18.3  1  0    4    4
# Merc 280C   17.8   6 167.6 123 3.92 3.44 18.9  1  0    4    4
# Merc 450SE  16.4   8 275.8 180 3.07 4.07 17.4  0  0    3    3
# Merc 450SL  17.3   8 275.8 180 3.07 3.73 17.6  0  0    3    3
# Merc 450SLC 15.2   8 275.8 180 3.07 3.78 18.0  0  0    3    3

而且，另一个例子，使用iris数据集搜索的字符串osa ：

irisSubset <- iris[grep("osa", iris$Species), ]
head(irisSubset)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa
# 4          4.6         3.1          1.5         0.2  setosa
# 5          5.0         3.6          1.4         0.2  setosa
# 6          5.4         3.9          1.7         0.4  setosa

对于你的问题的尝试：

selectedRows <- conservedData[grep("hsa-", conservedData$miRNA), ]

Answer 2:

尝试str_detect()从stringr包，其检测图案的字符串中的存在或不存在。

下面是还采用了一种方法%>%管和filter()从dplyr包：

library(stringr)
library(dplyr)

CO2 %>%
  filter(str_detect(Treatment, "non"))

   Plant        Type  Treatment conc uptake
1    Qn1      Quebec nonchilled   95   16.0
2    Qn1      Quebec nonchilled  175   30.4
3    Qn1      Quebec nonchilled  250   34.8
4    Qn1      Quebec nonchilled  350   37.2
5    Qn1      Quebec nonchilled  500   35.3
...

此过滤器的行的样品CO2数据集（附带R），其中所述治疗变量包含串“非”。您可以调整是否str_detect找到固定的火柴或使用正则表达式-请参阅stringr包的文档。

Answer 3:

LIKE应在SQLite的工作：

require(sqldf)
df <- data.frame(name = c('bob','robert','peter'),id=c(1,2,3))
sqldf("select * from df where name LIKE '%er%'")
    name id
1 robert  2
2  peter  3

文章来源: Selecting rows where a column has a string like 'hsa..' (partial string match)