Retrieving first urban dictionary result for a ter

2019-08-13 08:09发布

问题:

I have written a pretty simple code to get the first result for any term on urbandictionary.com. I started by writing a simple thing to see how their code is formatted.

def parseudtest(searchurl):    
    url = 'http://www.urbandictionary.com/define.php?term=%s' %searchurl
    url_info = urllib.urlopen(url)
    for lines in url_info:
        print lines

For a test, I searched for 'cats', and used that as the variable searchurl. The output I receive is of course a gigantic page, but here is the part I care about:

<meta content='He set us up the bomb. Also took all our base.' name='Description' />

<meta content='He set us up the bomb. Also took all our base.' property='og:description' />

<meta content='cats' property='og:title' />

<meta content="http://static3.urbandictionary.com/rel-1e0b481/images/og_image.png" property="og:image" />

<meta content='Urban Dictionary' property='og:site_name' />

As you can see, the first time the element "meta content" appears on the site, it is the first definition for the search term. So I wrote this code to retrieve it:

def parseud(searchurl):    
    url = 'http://www.urbandictionary.com/define.php?term=%s' %searchurl
    url_info = urllib.urlopen(url)
    if (url_info):
        xmldoc = minidom.parse(url_info)
    if (xmldoc):
        definition = xmldoc.getElementsByTagName('meta content')[0].firstChild.data
        print definition

For some reason the parsing doesn't seem to be working and invariably encounters an error every time. It is especially confusing since the site appears to use basically the same format as other sites I have successfully retrieved specific data from. If anyone could help me figure out what I am messing up here, it would be greatly appreciated.

回答1:

As you don't give the traceback for the errors that occur it's hard to be specific, but I assume that although the site claims to be XHTML it's not actually valid XML. You'd be better off using Beautiful Soup as it is designed for parsing HTML and will correctly handle broken markup.



回答2:

I never used the minidom parser, but I think the problem is that you call:

xmldoc.getElementsByTagName('meta content')

while tha tag name is meta, content is just the first attribute (as shown pretty well by the highlighting of your html code).

Try to replace that bit with:

xmldoc.getElementsByTagName('meta')