How to search PubMed or other databases using R

2019-03-16 01:48发布

问题:

I have recently been using the excellent rplos package, which makes it very easy to search through papers hosted on the Public Library of Science (PLOS) API. I've hit a snag, in that the API itself seems to have some missing information - a major one being that there are at least 2012 papers on the API for which there is no information in the "journal" field. I have the DOIs of each paper, so it is simple to Google the DOI and show that these are real papers published in real journals, usually PLoS ONE. Obviously it would be silly to do that 2000 times.

I was wondering if anyone knows how to find the source journal, if I have the list of DOIs? I looked into the RISmed package, which can apparently search PubMed from within R, but I could not work out how to make it give useful information (just the number of search hits, and some PubMed IDs that probably lead to the info I want).

Anyone know how to turn the list of DOIs into source journal names?

EDIT: I just thought of another easy solution. DOIs contain an abbreviation of the journal name, and for a case like this where there are only a handful of journals, one can just use regular expressions to read the DOIs and pick which journal they are from. Example: 10.1371/journal.pone.0046711 is from PLoS ONE.

回答1:

Here's an answer based on Thomas' suggestion to try rpubmed. It starts with a list of the problematic DOIs, finds the matching PubMed ID numbers using the EUtilsSummary function in RISmed, and then getting the journal data associated with these using code modified from the Github for rpubmed and reproduced below. Sorry for editing the rpubmed code, but the objects on line 44 do not seem to be defined or essential so I took them out.

library(RCurl); library(XML); library(RISmed); library(multicore)

# dummy list of 5 DOIs. I actually have 2012, hence all the multicoring below
dois <- c("10.1371/journal.pone.0046711", "10.1371/journal.pone.0046681", "10.1371/journal.pone.0046643", "10.1371/journal.pone.0041465", "10.1371/journal.pone.0044562")

# Get the PubMed IDs
res <- mclapply(1:length(dois), function(x) EUtilsSummary(dois[x]))
ids<-sapply(res,QueryId)


######## rpubmed functions from https://github.com/rOpenHealth/rpubmed/blob/master/R/rpubmed_fetch.R
fetch_in_chunks <- function(ids, chunk_size = 500, delay = 0, ...){
  Sys.sleep(delay * 3600) # Wait for appropriate time for the server.
  chunks <- chunker(ids, chunk_size)
  Reduce(append, lapply(chunks, function(x) pubmed_fetch(x, ...)))
}

pubmed_fetch <- function(ids, file_format = "xml", as_r_object = TRUE, ...){

  args <- c(id = paste(ids, collapse = ","), db = "pubmed", rettype = file_format, ...)

  url_args <- paste(paste(names(args), args, sep="="), collapse = "&")
  base_url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?retmode=full"
  url_string <- paste(base_url, url_args, sep = "&")
  records <- getURL(url_string)
  #NCBI limits requests to three per second
  Sys.sleep(0.33)
  if(as_r_object){
    return(xmlToList(xmlTreeParse(records, useInternalNodes = TRUE)))
  } else return(records)
}

chunker <- function(v, chunk_size){
  split(v, ceiling(seq_along(v)/chunk_size))
}
###### End of rpubmed functions

d<-fetch_in_chunks(ids)
j<-character(0)
for(i in 1:2012) j[i]<-as.character(d[[i]][[1]][[5]][[1]][[3]]) # the tortuous path to the journal name


回答2:

this is the rplos creator...

Check out the dataset that comes with the package plosfields which gives you the fields that can be searched, and given back

library(rplos)
head(plosfields)

            field                     description                           note
1              id DOI (Digital Object Identifier) Extended for partial documents
2      everything         All text in the article      Includes Meta information
3           title                   Article Title                        no note
4   title_display                   Article Title      For display purposes only
5 alternate_title               Alternative Title                        no note
6          author                          Author       Can have multiple values

Two fields of interest for journal name are journal and cross_published_journal_key. For example,

searchplos('science', 'id,publication_date,cross_published_journal_key,journal', limit = 2)

                            id cross_published_journal_key      journal     publication_date
1 10.1371/journal.pbio.0020122                 PLoSBiology PLoS Biology 2004-04-13T00:00:00Z
2 10.1371/journal.pbio.1001166                 PLoSBiology PLoS Biology 2011-10-04T00:00:00Z

Does this do what you want?

In terms of getting more information from DOIs rmetadata is in development, but could be of use. Also we're working on a package for Crossref, rcrossref. (https://github.com/ropensci/rcrossref) - but it seems like the above does what you want easier, getting the journal name.



回答3:

Here is my solution, which can be used in a for loop or other approaches to extract titles from DOIs:

library(RISmed)
data(myeloma)
ArticleId(myeloma)
res <- EUtilsSummary(ArticleId(myeloma)[10])
fetch <- EUtilsGet(res, type = "efetch", db = "pubmed")
fetch@Title

Hope it helps!