-->

FeedParser, Removing Special Characters and Writin

2019-08-18 00:06发布

问题:

I'm learning Python. I've set myself a wee goal of building a RSS scraper. I'm trying to gather the Author, Link and Title. From there I want to write to a CSV.

I'm encountering some problems. I've search for the answer since last night but can't seem to find a solution. I do have a feeling that is a bit of knowledge that I'm missing between what feedparser is parsing and moving it to a CSV but I don't have the vocabulary yet to know what to Google.

  1. How do I remove special characters such as '[' and '''?
  2. How do I a write author, link and title to a new row when I'm creating the new file?

1 Special Characters

rssurls = 'http://feeds.feedburner.com/TechCrunch/'

techart = feedparser.parse(rssurls)
# feeds = []

# for url in rssurls:
#     feedparser.parse(url)
# for feed in feeds:
#     for post in feed.entries:
#         print(post.title)

# print(feed.entires)

techdeets = [post.author + " , " + post.title + " , " + post.link  for post in techart.entries]
techdeets = [y.strip() for y in techdeets]
techdeets

Output: I get the information I need but the .strip tag doesn't strip.

['Darrell Etherington , Spin launches first city-sanctioned dockless bike sharing in Bay Area , http://feedproxy.google.com/~r/Techcrunch/~3/BF74UZWBinI/', 'Ryan Lawler , With $5.3 million in funding, CarDash wants to change how you get your car serviced , http://feedproxy.google.com/~r/Techcrunch/~3/pkamfdPAhhY/', 'Ron Miller , AlienVault plug-in searches for stolen passwords on Dark Web , http://feedproxy.google.com/~r/Techcrunch/~3/VbmdS0ODoSo/', 'Lucas Matney , Firefox for Windows gets native WebVR support, performance bumps in latest update , http://feedproxy.google.com/~r/Techcrunch/~3/j91jQJm-f2E/',...]

2) Writing to CSV

import csv

savedfile = open('/test1.txt', 'w')
savedfile.write(str(techdeets) + "/n")
savedfile.close()

import pandas as pd
df = pd.read_csv('/test1.txt', encoding='cp1252')
df

Output: The output was a dataframe with only 1 row and multiple columns.

回答1:

You are almost there :-)

How about using pandas to create a dataframe first then save it, something like this "continuing from your code":

df = pd.DataFrame(columns=['author', 'title', 'link'])
for i, post in enumerate(techart.entries):
    df.loc[i] = post.author, post.title, post.link

then you can save it:

df.to_csv('myfilename.csv', index=False)

OR

you can also write into the dataframe straight from the feedparser entries:

>>> import feedparser
>>> import pandas as pd
>>>
>>> rssurls = 'http://feeds.feedburner.com/TechCrunch/'
>>> techart = feedparser.parse(rssurls)
>>>
>>> df = pd.DataFrame()
>>>
>>> df['author'] = [post.author for post in techart.entries]
>>> df['title'] = [post.title for post in techart.entries]
>>> df['link'] = [post.link for post in techart.entries]