This is the sample xml document :
<bookstore>
<book category="COOKING">
<title lang="english">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>300.00</price>
</book>
<book category="CHILDREN">
<title lang="english">Harry Potter</title>
<author>J K. Rowling </author>
<year>2005</year>
<price>625.00</price>
</book>
</bookstore>
I want to extract the text without specifying the elements how can i do this , because i have 10 such documents. I want so because my problem is that user is entering some word which I don't know , it has to be searched in all of the 10 xml documents in their respective text portions. For this to happen I should know where the text lies without knowing about the element. One more thing that all these documents are different.
Please Help!!
You could simply strip out any tags:
>>> import re
>>> txt = """<bookstore>
... <book category="COOKING">
... <title lang="english">Everyday Italian</title>
... <author>Giada De Laurentiis</author>
... <year>2005</year>
... <price>300.00</price>
... </book>
...
... <book category="CHILDREN">
... <title lang="english">Harry Potter</title>
... <author>J K. Rowling </author>
... <year>2005</year>
... <price>625.00</price>
... </book>
... </bookstore>"""
>>> exp = re.compile(r'<.*?>')
>>> text_only = exp.sub('',txt).strip()
>>> text_only
'Everyday Italian\n Giada De Laurentiis\n 2005\n 300.00\n
\n\n \n Harry Potter\n J K. Rowling \n 2005\n 6
25.00'
But if you just want to search files for some text in Linux, you can use grep
:
burhan@sandbox:~$ grep "Harry Potter" file.xml
<title lang="english">Harry Potter</title>
If you want to search in a file, use the grep
command above, or open the file and search for it in Python:
>>> import re
>>> exp = re.compile(r'<.*?>')
>>> with open('file.xml') as f:
... lines = ''.join(line for line in f.readlines())
... text_only = exp.sub('',lines).strip()
...
>>> if 'Harry Potter' in text_only:
... print 'It exists'
... else:
... print 'It does not'
...
It exists
Using the lxml library with an xpath query is possible:
xml="""<bookstore>
<book category="COOKING">
<title lang="english">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>300.00</price>
</book>
<book category="CHILDREN">
<title lang="english">Harry Potter</title>
<author>J K. Rowling </author>
<year>2005</year>
<price>625.00</price>
</book>
</bookstore>
"""
from lxml import etree
root = etree.fromstring(xml).getroot()
root.xpath('/bookstore/book/*/text()')
# ['Everyday Italian', 'Giada De Laurentiis', '2005', '300.00', 'Harry Potter', 'J K. Rowling ', '2005', '625.00']
Although you don't get the category....
If you want to call grep from inside python, see the discussion here, especially this post.
If you want to search through all the files in a directory you could try something like this using the glob module:
import glob
import os
import re
p = re.compile('>.*<')
os.chdir("./")
for files in glob.glob("*.xml"):
file = open(files, "r")
line = file.read()
list = map(lambda x:x.lstrip('>').rstrip('<'), p.findall(line))
print list
print
This searches iterates through all the files in the directory, opens each file and exteacts text matching the regexp.
Output:
['Everyday Italian', 'Giada De Laurentiis', '2005', '300.00', 'Harry Potter', 'J
K. Rowling ', '2005', '625.00']
EDIT: Updated code to extract only the text elements from the xml.