Extracting properly data with bs4?

2019-09-16 14:18发布

问题:

Here is my first question on this site as I have tried many ways to get what I want but I didnt succeed.. I try to extract 2 types of data from a french website similar to CraigList. My need is simple and I manage to get those information but I still have tags and other signs in my extract. I also have issue with encoding even if using .encode(utf-8).

# -*- coding: utf-8 -*-
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import csv

csvfile=open("test.csv", 'w+')

html=urlopen("http://www.leboncoin.fr/annonces/offres/ile_de_france/")

bsObj=BeautifulSoup(html)
article= bsObj.findAll("h2",{"class":"title"})
prix=bsObj.findAll("div",{"class":"price"})

for art in article:
    art=art.text.encode('utf-8')

print(article)

for prix1 in prix:
    prix1=prix1.text.encode('utf-8')
    print(prix1)

#Pour merger 2 listes (en deux colonnes, pas a la suite)
table_2=list(zip(article,prix))

try:
    writer=csv.writer(csvfile)
    writer.writerow(('Article', 'Prix'))

    for i in table_2:
        writer.writerow([i])

finally:
    csvfile.close()

When running this code:

  • My output contains , etc.. although I have run:

for art in article: art=art.text.encode('utf-8')

  • Sometimes the encoding does not work due to "€" or "-" signs in the name o the product

My questions are:

  • Why do the ".text.encode()" does not clean the tags from my article object?
  • Why do I still get issue with encoding?

I guess I am not using the function as expected but despite my tests I do not get to the result..

Thank you in advance for your insights.

Cheers

Jo

回答1:

You might have already realised your mistake. You were zipping and outputting the NavigationElements and not the element texts. I corrected your code below:

# -*- coding: utf-8 -*-
from urllib2 import urlopen
from bs4 import BeautifulSoup
import re
import csv

csvfile=open("test.csv", 'w+')

html=urlopen("http://www.leboncoin.fr/annonces/offres/ile_de_france/")

bsObj=BeautifulSoup(html, "html.parser")
article= bsObj.findAll("h2",{"class":"title"})
prix=bsObj.findAll("div",{"class":"price"})

articles = []
for art in article:
    articles.append(art.text.encode('utf-8').strip())

print(art)

prices = []
for prix1 in prix:
    prices.append(prix1.text.encode('utf-8').strip())

#Pour merger 2 listes (en deux colonnes, pas a la suite)
table_2=list(zip(articles,prices))

try:
    writer=csv.writer(csvfile)
    writer.writerow(('Article', 'Prix'))

    for i in table_2:

        writer.writerow([i])

finally:
    csvfile.close()

Also try not to leave comments in French next time ;)