I am trying to parse Twitter RSS feeds and put the information in a sqlite database, using Python. Here's an example:
MiamiPete: today's "Last Call" is now up http://bit.ly/MGDzu #stocks #stockmarket #finance #money
What I want to do is create one column for the main content (Miami Pete…now up
), one column for the URL (http://bit.ly/MGDzu
), and four separate columns for the hashtags (stocks, stockmarket, finance, money). I've been playing around with how to do this.
Any advice would be greatly appreciated!
P.S. Some code I've been playing around with is below--you can see I tried initially creating a variable called "tiny_url" and splitting it, which it does seem to do, but this feeble attempt is not anywhere close to solving the problem noted above. :)
def store_feed_items(id, items):
""" Takes a feed_id and a list of items and stored them in the DB """
for entry in items:
c.execute('SELECT entry_id from RSSEntries WHERE url=?', (entry.link,))
tinyurl = entry.summary ### I added this in
print tinyurl.split('http') ### I added this in
if len(c.fetchall()) == 0:
c.execute('INSERT INTO RSSEntries (id, url, title, content, tinyurl, date, tiny) VALUES (?,?,?,?,?,?,?)', (id, entry.link, entry.title, entry.summary, tinyurl, strftime("%Y-%m-%d %H:%M:%S",entry.updated_parsed), tiny ))
It seems like your data-driven design is rather flawed. Unless all your entries have a text part, an url and up to 4 tags, it's not going to work.
You also need to separate saving to db from parsing. Parsing could be easily done with a regexep (or even string methods):
>>> s = your_string
>>> s.split()
['MiamiPete:', "today's", '"Last', 'Call"', 'is', 'now', 'up', 'http://bit.ly/MGDzu', '#stocks', '#stockmarket', '#finance', '#money']
>>> url = [i for i in s.split() if i.startswith('http://')]
>>> url
['http://bit.ly/MGDzu']
>>> tags = [i for i in s.split() if i.startswith('#')]
>>> tags
['#stocks', '#stockmarket', '#finance', '#money']
>>> ' '.join(i for i in s.split() if i not in url+tags)
'MiamiPete: today\'s "Last Call" is now up'
Single-table db design would probably have to go, though.
Also, you can parse your strings using regexps:
>>> s = (u'MiamiPete: today\'s "Last Call" is now up http://bit.ly/MGDzu '
'#stocks #stockmarket #finance #money')
>>> re.match(r'(.*) (http://[^ ]+)', s).groups()
(u'MiamiPete: today\'s "Last Call" is now up', u'http://bit.ly/MGDzu')
>>> re.findall(r'(#\w+)', s)
[u'#stocks', u'#stockmarket', u'#finance', u'#money']
Twitter has an api that may be easier for you to use here, http://apiwiki.twitter.com/Twitter-API-Documentation.
You can get the results as JSON or XML and use one of the many Python libraries to parse the results.
Or if you must your the RSS there are Python feed parsers like, http://www.feedparser.org/.
I would highly recommend using the Twitter API. There are actually two APIs, one for the main twitter server and one for the search server. They are used for different things.
You can find sample code, pytwitter on svn. Add simplejson and you can be doing very powerful things in a matter of minutes.
Good luck