-->

Want to pull a journal title from an RCSB Page usi

2019-09-17 20:05发布

问题:

I am trying to get specific information about the original citing paper in the Protein Data Bank given only the 4 letter PDBID of the protein.

To do this I am using the python libraries requests and BeautifulSoup. To try and build the code, I went to the page for a particular protein, in this case 1K48, and also save the HTML for the page (by hitting command+s and saving the HTML to my desktop).

First things to note:

1) The url for this page is: http://www.rcsb.org/pdb/explore.do?structureId=1K48

2) You can get to the page for any protein by replacing the last four characters with the appropriate PDBID.

3) I am going to want to perform this procedure on many PDBIDs, in order to sort a large list by the Journal they originally appeared in.

4) Searching through the HTML, one finds the journal title located inside a form here:

<form action="http://www.rcsb.org/pdb/search/smartSubquery.do" method="post" name="queryForm">  
    <p><span id="se_abstractTitle"><a onclick="c(0);">Refined</a> <a onclick="c(1);">structure</a> <a onclick="c(2);">and</a> <a onclick="c(3);">metal</a> <a onclick="c(4);">binding</a> <a onclick="c(5);">site</a> of the <a onclick="c(8);">kalata</a> <a onclick="c(9);">B1</a> <a onclick="c(10);">peptide.</a></span></p>                                                        
    <p><a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor(&#39;Skjeldal, L.&#39;);">Skjeldal, L.</a>,&nbsp;&nbsp;<a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor(&#39;Gran, L.&#39;);">Gran, L.</a>,&nbsp;&nbsp;<a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor(&#39;Sletten, K.&#39;);">Sletten, K.</a>,&nbsp;&nbsp;<a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor(&#39;Volkman, B.F.&#39;);">Volkman, B.F.</a></p> 
    <p>
        <b>Journal:</b>     
        (2002)
        <span class="se_journal">Arch.Biochem.Biophys.</span>
        <span class="se_journal"><b>399: </b>142-148</span>         
    </p>

A lot more is in the form but it is not relevant. What I do know is that my journal title, "Arch.Biochem.Biophys", is located within a span tag with class "se_journal".

And so I wrote the following code:

def JournalLookup():
    PDBID= '1K48'

    import requests
    from bs4 import BeautifulSoup

    session = requests.session()

    req = session.get('http://www.rcsb.org/pdb/explore.do?structureId=%s' %PDBID)

    doc = BeautifulSoup(req.content)
    Journal = doc.findAll('span', class_="se_journal")

Ideally I'd be able to use find instead of findAll as these are the only two in the document, but I used findAll to at least verify I'm getting an empty list. I assumed that it would return a list containing the two span tags with class "se_journal", but it instead returns an empty list.

After spending several hours going through possible solutions, including a piece of code that printed every span in doc, I have concluded that the requests doc does not include the lines I want at all.

Does anybody know why this is the case, and what I could possibly do to fix it?

Thanks.

回答1:

The content you are interested in is provided by the javascript. It's easy to find out, visit the same URL on browser with javascript disabled and you will not see that specific info. It also displays a friendly message:

"This browser is either not Javascript enabled or has it turned off. This site will not function correctly without Javascript."

For javascript driven pages, you cannot use Python Requests. There are some alternatives, one being dryscape.

PS: Do not import libraries/modules within a function. Python does not recommend it and PEP08 says that:

Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants.

This SO question explains why it's not recomended way to do it.



回答2:

I just published a Python package called PyPDB that can do exactly this task. The repository can be found here, but it is also available on PyPI

pip install pypdb

For your application, I'd try the function describe_pdb, which takes a four-character PDB ID as an input and returns a dictionary containing the metadata associated with the entry:

my_desc = describe_pdb('4lza')

There's fields in my_desc for 'citation_authors', 'structure_authors', and 'title', but not all entries appear to have journal titles associated with them. The other options are to use the broader function get_all_info('4lza') or get (and parse) the entire raw .pdb file using get_pdb_file('4lza', filetype='cif', compression=True)