This is my code:
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import urllib2
import sys
import urlparse
import io
url = ""
#url = ""
#r = requests.get(url)
html = urllib2.urlopen(url)
soup = BeautifulSoup(html, "html.parser")
#soup = BeautifulSoup(r.text,'lxml')
if url.find("") != -1:
div = soup.find('td', valign='top')
div = soup.find('div',id='content')
f = open('path/file_name.html', 'w')
Scraping those webpages i've found some nonAScii characters into the html file written from this script that i need to remove or solve into a readable chars. Any advice? Thanks
To remove non
characters from text.Try to normalize the string and then
encode it ignoring errors.characters are 8 byte (0-255), ascii chars are 7 byte (0-127), so you can simply drop all chars with a ord value below 128
chr convert a integer to a character, ord converts a character to an integer.
this should be your final code