Parse XML file to fetch required data and store it

2019-07-04 04:02发布

问题:

I have an XML file which looks like this:

XML File

I want to fetch following information from this file for all the events:

Under category event:

  • start_date
  • end_date
  • title

Under category venue:

  • address
  • address_2/
  • city
  • latitude
  • longitude
  • name
  • postal_code

and then store this information in a mongodb database. I don't have much experience in parsing. Can someone please help me with this! Thanks !

回答1:

from pymongo import MongoClient
import xml.etree.ElementTree as ET
from urllib2 import urlopen

cl = MongoClient()
coll = cl["dbname"]["collectionname"]

tree = ET.parse("https://www.eventbrite.com/xml/event_search?app_key=USO53E2ZHT6LM4D5RA&country=DE&max=100&date=Future&page=1")
root = tree.getroot()

for event in root.findall("./event"):
    doc = {}
    for c in event.getchildren():
        if c.tag in ("start_date", "end_date", "title"):
            doc[c.tag] = c.text
        elif c.tag == "venue":
            doc[c.tag] = {}
            for v in c.getchildren():
                if v.tag in ("address", "address_2", "city", "latitude", "longitude", "name", "postal_code"):
                    doc[c.tag][v.tag] = v.text

    coll.insert(doc)


回答2:

Here's an example that parses an xml from the url using lxml and inserts the data into mongodb using pymongo:

from urllib2 import urlopen
import pymongo
from lxml import etree


# parse xml file
root = etree.parse(urlopen('https://www.eventbrite.com/xml/event_search?app_key=USO53E2ZHT6LM4D5RA&country=DE&max=100&date=Future&page=1'))

events = []
for event in root.xpath('.//event'):
    event = {'start_date': event.find('start_date').text,
             'end_date': event.find('end_date').text,
             'title': event.find('title').text}
    events.append(event)


# insert the date into MongoDB
db = pymongo.MongoClient()
collection = db.test

collection.insert(events)

Parsing venue items is left for you as a "homework".

Note that there are other xml parsers out there, like:

  • elementTree from stdlib
  • BeautifulSoup

Hope that helps.