Here is my first question on this site as I have tried many ways to get what I want but I didnt succeed.. I try to extract 2 types of data from a french website similar to CraigList. My need is simple and I manage to get those information but I still have tags and other signs in my extract. I also have issue with encoding even if using .encode(utf-8).
# -*- coding: utf-8 -*-
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import csv
csvfile=open("test.csv", 'w+')
html=urlopen("http://www.leboncoin.fr/annonces/offres/ile_de_france/")
bsObj=BeautifulSoup(html)
article= bsObj.findAll("h2",{"class":"title"})
prix=bsObj.findAll("div",{"class":"price"})
for art in article:
art=art.text.encode('utf-8')
print(article)
for prix1 in prix:
prix1=prix1.text.encode('utf-8')
print(prix1)
#Pour merger 2 listes (en deux colonnes, pas a la suite)
table_2=list(zip(article,prix))
try:
writer=csv.writer(csvfile)
writer.writerow(('Article', 'Prix'))
for i in table_2:
writer.writerow([i])
finally:
csvfile.close()
When running this code:
- My output contains , etc.. although I have run:
for art in article: art=art.text.encode('utf-8')
- Sometimes the encoding does not work due to "€" or "-" signs in the name o the product
My questions are:
- Why do the ".text.encode()" does not clean the tags from my article object?
- Why do I still get issue with encoding?
I guess I am not using the function as expected but despite my tests I do not get to the result..
Thank you in advance for your insights.
Cheers
Jo