Parse multiple xml files in Python

2019-07-21 20:31发布

问题:

I am stuck with a problem here. So I want to parse multiple xml files with the same structure within it. I was already able to get all the locations for each file and save them into three different lists, since there are three different types of xml structures. Now I want to create three functions (for each list), which is looping through the lists and parse the information I need. Somehow I am not able to do it. Anybody here who could give me a hint how to do it?

import os
import glob
import xml.etree.ElementTree as ET
import fnmatch
import re
import sys


#### Get the location of each XML file and save them into a list ####

all_xml_list =[]                                                                                                                                       

def locate(pattern,root=os.curdir):
    for path, dirs, files in os.walk(os.path.abspath(root)):
        for filename in fnmatch.filter(files,pattern):
            yield os.path.join(path,filename)

for files in locate('*.xml',r'C:\Users\Lars\Documents\XML-Files'):
    all_xml_list.append(files)


#### Create lists by GameDay Events ####


xml_GameDay_Player   = [x for x in all_xml_list if 'Player' in x]                                                             
xml_GameDay_Team     = [x for x in all_xml_list if 'Team' in x]                                                             
xml_GameDay_Match    = [x for x in all_xml_list if 'Match' in x]  

The XML file looks like this:

<sports-content xmlns:imp="url">
  <sports-metadata date-time="20160912T000000+0200" doc-id="sports_event_" publisher="somepublisher" language="en_EN" document-class="player-statistics">
    <sports-title>player-statistics-165483</sports-title>
  </sports-metadata>
  <sports-event>
    <event-metadata id="E_165483" event-key="165483" event-status="post-event" start-date-time="20160827T183000+0200" start-weekday="saturday" heat-number="1" site-attendance="52183" />
    <team>
      <team-metadata id="O_17" team-key="17">
        <name full="TeamName" nickname="NicknameoftheTeam" imp:dfl-3-letter-code="NOT" official-3-letter-code="" />
      </team-metadata>
      <player>
        <player-metadata player-key="33201" uniform-number="1">
          <name first="Max" last="Mustermann" full="Max Mustermann" nickname="Mäxchen" imp:extensive="Name" />
        </player-metadata>
        <player-stats stats-coverage="standard" date-coverage-type="event" minutes-played="90" score="0">
          <rating rating-type="standard" imp:rating-value-goalie="7.6" imp:rating-value-defenseman="5.6" imp:rating-value-mid-fielder="5.8" imp:rating-value-forward="5.0" />
          <rating rating-type="grade" rating-value="2.2" />
          <rating rating-type="index" imp:rating-value-goalie="7.6" imp:rating-value-defenseman="3.7" imp:rating-value-mid-fielder="2.5" imp:rating-value-forward="1.2" />
          <rating rating-type="bemeister" rating-value="16.04086" />
          <player-stats-soccer imp:duels-won="1" imp:duels-won-ground="0" imp:duels-won-header="1" imp:duels-lost-ground="0" imp:duels-lost-header="0" imp:duels-lost="0" imp:duels-won-percentage="100" imp:passes-completed="28" imp:passes-failed="4" imp:passes-completions-percentage="87.5" imp:passes-failed-percentage="12.5" imp:passes="32" imp:passes-short-total="22" imp:balls-touched="50" imp:tracking-distance="5579.80" imp:tracking-average-speed="3.41" imp:tracking-max-speed="23.49" imp:tracking-sprints="0" imp:tracking-sprints-distance="0.00" imp:tracking-fast-runs="3" imp:tracking-fast-runs-distance="37.08" imp:tracking-offensive-runs="0" imp:tracking-offensive-runs-distance="0.00" dfl-distance="5579.80" dfl-average-speed="3.41" dfl-max-speed="23.49">
            <stats-soccer-defensive saves="5" imp:catches-punches-crosses="3" imp:catches-punches-corners="0" goals-against-total="1" imp:penalty-saves="0" imp:clear-cut-chance="0" />
            <stats-soccer-offensive shots-total="0" shots-on-goal-total="0" imp:shots-off-post="0" offsides="0" corner-kicks="0" imp:crosses="0" assists-total="0" imp:shot-assists="0" imp:freekicks="3" imp:miss-chance="0" imp:throw-in="0" imp:punt="2" shots-penalty-shot-scored="0" shots-penalty-shot-missed="0" dfl-assists-total="0" imp:shots-total-outside-box="0" imp:shots-total-inside-box="0" imp:shots-foot-inside-box="0" imp:shots-foot-outside-box="0" imp:shots-total-header="0" />
            <stats-soccer-foul fouls-commited="0" fouls-suffered="0" imp:yellow-red-cards="0" imp:red-cards="0" imp:yellow-cards="0" penalty-caused="0" />
          </player-stats-soccer>
        </player-stats>
      </player>
    </team>
  </sports-event>
</sports-content>

I want to extract everything which is within the "player meta tag" and "player-stats coverage" and "player stats soccer" tag.

回答1:

Improving on @Gnudiff's answer, here is a more resilient approach:

import os
from glob import glob
from lxml import etree

xml_GameDay = {
    'Player': [],
    'Team': [],
    'Match': [],
}

# sort all files into the right buckets
for filename in glob(r'C:\Users\Lars\Documents\XML-Files\*.xml'):
    for key in xml_GameDay.keys():
        if key in os.path.basename(filename):
            xml_GameDay[key].append(filename)
            break

def select_first(context, path):
    result = context.xpath(path)
    if len(result):
        return result[0]
    return None

# extract data from Player files
for filename in xml_GameDay['Player']:
    tree = etree.parse(filename)

    for player in tree.xpath('.//player'):        
        player_data = {
            'key': select_first(player, './player-metadata/@player-key'),
            'lastname': select_first(player, './player-metadata/name/@last'),
            'firstname': select_first(player, './player-metadata/name/@first'),
            'nickname': select_first(player, './player-metadata/name/@nickname'),
        }
        print(player_data)
        # ...

XML files can come in a variety of byte encodings and are prefixed by the XML declaration, which declares the encoding of the rest of the file.

<?xml version="1.0" encoding="UTF-8"?>

UTF-8 is a common encoding for XML files (it also is the default), but in reality it can be anything. It's impossible to predict and it's very bad practice to hard-code your program to expect a certain encoding.

XML parsers are designed to deal with this peculiarity in a transparent way, so you don't really have to worry about it, unless you do it wrong.

This is a good example of doing it wrong:

# BAD CODE, DO NOT USE
def file_get_contents(filename):
    with open(filename) as f:
        return f.read()

tree = etree.XML(file_get_contents('some_filename.xml'))

What happens here is this:

  1. Python opens filename as a text file f
  2. f.read() returns a string
  3. etree.XML() parses that string and creates a DOM object tree

Doesn't sound so wrong, does it? But if the XML is like this:

<?xml version="1.0" encoding="UTF-8"?>
<Player nickname="Mäxchen">...</Player>

then the DOM you will end up with will be:

Player
    @nickname="Mäxchen"

You have just destroyed the data. And unless the XML contained an "extended" character like ä, you would not even have noticed that this approach is borked. This can easily slip into production unnoticed.

There is exactly one correct way of opening an XML file (and it's also simpler than the code above): Give the file name to the parser.

tree = etree.parse('some_filename.xml')

This way the parser can figure out the file encoding before it reads the data and you don't have to care about those details.



回答2:

This won't be a complete solution for your particular case, because this is a bit of task to do and also I don't have keyboard, working from tablet.

In general, you can do it several ways, depending on whether you really need all data or extract specific subset, and whether you know all the possible structures in advance.

For example, one way:

from lxml import etree
Playerdata=[] 
for F in xml_Gameday_Player:
                tree=etree.XML(file_get_contents(F)) 
                for player in tree.xpath('.//player'):
                        row=[] 
                        row['player']=player.xpath('./player-metadata/name/@Last/text()')       
                        for plrdata in player.xpath('.//player-stats'):
                               #do stuff with player data
                         Playerdata+=row

This is adapted from my existing script, however, it is more tailored to extracting only a specific subset of xml. If you need all data, it would probably be better to use some xml tree walker.

file_get_contents is a small helper function :

def file_get_contents(filename):
    with open(filename) as f:
        return f.read()

Xpath is a powerful language for finding nodes within xml. Note that depending on Xpath you use, the result may be either an xml node as in "for player in..." statement, or a string, as in "row['player']=" statement.