I am using urllib to get a string of html from a website and need to put each word in the html document into a list.
Here is the code I have so far. I keep getting an error. I have also copied the error below.
import urllib.request
url = input("Please enter a URL: ")
z=urllib.request.urlopen(url)
z=str(z.read())
removeSpecialChars = str.replace("!@#$%^&*()[]{};:,./<>?\|`~-=_+", " ")
words = removeSpecialChars.split()
print ("Words list: ", words[0:20])
Here is the error.
Please enter a URL: http://simleyfootball.com
Traceback (most recent call last):
File "C:\Users\jeremy.KLUG\My Documents\LiClipse Workspace\Python Project 2\Module2.py", line 7, in <module>
removeSpecialChars = str.replace("!@#$%^&*()[]{};:,./<>?\|`~-=_+", " ")
TypeError: replace() takes at least 2 arguments (1 given)
str.replace is the wrong function for what you want to do (apart from it being used incorrectly). You want to replace any character of a set with a space, not the whole set with a single space (the latter is what replace does). You can use translate like this:
This creates a mapping which maps every character in your list of special characters to a space, then calls translate() on the string, replacing every single character in the set of special characters with a space.
You need to call
replace
onz
and not onstr
, since you want to replace characters located in the string variablez
But this will not work, as replace looks for a substring, you will most likely need to use regular expression module
re
with thesub
function:Don't forget the
[]
, which indicates that this is a set of characters to be replaced.replace operates on a specific string, so you need to call it like this
but this is probably not what you need, since this will look for a single string containing all that characters in the same order. you can do it with a regexp, as Danny Michaud pointed out.
as a side note, you might want to look for BeautifulSoup, which is a library for parsing messy HTML formatted text like what you usually get from scaping websites.
You can replace the special characters with the desired characters as follows,
One way is to use re.sub, that's my preferred way.
Output:
Another way is to use re.escape:
Output:
Just a small tip about parameters style in python by PEP-8 parameters should be
remove_special_chars
and notremoveSpecialChars
Also if you want to keep the spaces just change
[^a-zA-Z0-9 \n\.]
to[^a-zA-Z0-9\n\.]