Parsing XML file with UTF-8 encoding and bytestrin

This question already has an answer here:

python… encoding issue when using linux > [duplicate] 3 answers

I have the following complete XML file (actual file downloadable here):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE MedlineCitationSet PUBLIC "-//NLM//DTD Medline Citation, 1st January, 2014//EN"
                                    "http://www.nlm.nih.gov/databases/dtd/nlmmedlinecitationset_140101.dtd">
<MedlineCitationSet>
<MedlineCitation Owner="NLM" Status="In-Data-Review">
<PMID Version="1">24560200</PMID>
<Article PubModel="Print-Electronic">
<Journal>
<ISSN IssnType="Print">1166-7087</ISSN>
<JournalIssue CitedMedium="Internet">
<Volume>24</Volume>
<Issue>3</Issue>
<PubDate>
<Year>2014</Year>
<Month>Mar</Month>
</PubDate>
</JournalIssue>
<Title>Progrès en urologie : journal de l'Association française d'urologie et de la Société française d'urologie</Title>
<ISOAbbreviation>Prog. Urol.</ISOAbbreviation>
</Journal>
<ArticleTitle>[Multiparametric 3T MRI in the routine staging of prostate cancer].</ArticleTitle>
<Pagination>
<MedlinePgn>145-53</MedlinePgn>
</Pagination>
<Abstract>
<AbstractText Label="RESULTS" NlmCategory="RESULTS">Five hundred and ninety-two octants were considered with 124 significant tumors (volume≥0.1cm(3)). The general ability of tumor detection had a sensitivity, specificity, PPV and NPV respectively to 72.3%, 87.4%, 83.2% and 78.5%. The estimate of the CC and ECE had a high negative predictive power with specificities and VPN respectively to 96.4% and 95.4% for CC, and 97.5 and 97.7% for ECE.</AbstractText>
<CopyrightInformation>Copyright © 2013 Elsevier Masson SAS. All rights reserved.</CopyrightInformation>
</Abstract>
</Article>
</MedlineCitation>
</MedlineCitationSet>

What I want to do is simply to parse the data and print the PMID and title. This is the code that I have:

#!/usr/bin/env python
import xml.etree.ElementTree as ET

def parse_xml(xmlfile):
    """docstring for parse_xml"""
    tree = ET.parse(xmlfile)
    root = tree.getroot()
    for medcit in root.findall('MedlineCitation'):
        pmid = medcit.find('PMID').text
        title = medcit.find('Article/Journal/Title').text
        #year = medcit.find('Article/Journal/JournalIssue/PubDate/Year')
        #medlinedate = medcit.find('Article/Journal/JournalIssue/MedlineDate')
        print pmid, title

if __name__ == '__main__'
    filename = "myxmlfile.xml'
    parse_xml(filename)

However it gave me the following Error message:

24560200 Traceback (most recent call last):
  File "./parse_xml.py", line 41, in <module>
    parse_xml(fvar)
  File "./parse_xml.py", line 29, in parse_xml
    print pmid, title
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 5: ordinal not in range(128)

What's the correct way to parse and print it?

标签： python xml utf-8

1条回答

萌系小妹纸

2楼-- · 2019-08-31 05:42

Already answered here: try

    print pmid.encode('utf8'), title.encode('utf8')

instead of

    print pmid, title

0人赞添加讨论(0) 举报

Parsing XML file with UTF-8 encoding and bytestrin

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间