I currently am trying to get the code from this website: http://netherkingdom.netai.net/pycake.html Then I have a python script parse out all code in html div tags, and finally write the text from between the div tags to a file. The problem is it adds a bunch of \r and \n to the file. How can I either avoid this or remove the \r and \n. Here is my code:
import urllib.request
from html.parser import HTMLParser
import re
page = urllib.request.urlopen('http://netherkingdom.netai.net/pycake.html')
t = page.read()
class MyHTMLParser(HTMLParser):
def handle_data(self, data):
print(data)
f = open('/Users/austinhitt/Desktop/Test.py', 'r')
t = f.read()
f = open('/Users/austinhitt/Desktop/Test.py', 'w')
f.write(t + '\n' + data)
f.close()
parser = MyHTMLParser()
t = t.decode()
parser.feed(t)
And here is the resulting file it makes:
b'
import time as t\r\n
from os import path\r\n
import os\r\n
\r\n
\r\n
\r\n
\r\n
\r\n'
Preferably I would also like to have the beginning b' and last ' removed. I am using Python 3.5.1 on a Mac.
A simple solution is to strip trailing whitespace:
The advantage of
rstrip()
over using a[:-2]
slice is that this is safe for UNIX style files as well.However, if you only want to get rid of
\r
and they might not be at the end-of-line, thenstr.replace()
is your friend:If you have a byte object (that's the leading
b'
) the you can convert it to a native Python 3 string using:One simple solution is just to strip off the last two characters of each line: