Scrapy tbody tag return an empty answer but has te

2019-07-14 01:34发布

I try to scrap and crawl a website. The data is in the tbody tag (event names). When I check the google console, the tbody tag has text data, but when I try to scrap it it returns an empty answer (also tested in scrapy shell). I checked for an AJAX method, because it can bug the script, but it seems does not have it.

Do you have any idea why the answer is empty whereas the tbody tag has text inside source code ?

Here is my code

nom_robot = 'ListeCAP' 
domaine = ['www.justrunlah.com'] 
base_url = [
    "https://www.justrunlah.com/running-events-calendar-malaysia",
    "https://www.justrunlah.com/running-events-calendar-australia",
]

class ListeCourse_level1(scrapy.Spider):
    name = nom_robot
    allowed_domains = domaine
    start_urls = base_url 

    def parse(self, response):    

        selector = Selector(response)

        for unElement in response.xpath('//*[@id="td-outer-wrap"]/div[3]/div/div/div[1]/div/div[2]/div[3]/table/tbody/tr'): 
            loader = ItemLoader(JustrunlahItem(), selector=unElement)
            loader.add_xpath('eve_nom_evenement', './/td[2]/div/div[1]/div/a/text()')

        # define processors
            loader.default_input_processor = MapCompose(string) 
            loader.default_output_processor = Join()
            yield loader.load_item()
            yield loader.load_item()
            if response.xpath('//a[@class="smallpagination"]'):
                next_page = response.meta.get('page_number', 1) + 1
                next_page_url = '{}?page={}'.format(base_url, next_page)
                yield scrapy.Request(next_page_url, callback=self.parse, meta={'page_number': next_page}) 

The terminal window

['https://www.justrunlah.com/running-events-calendar-malaysia/', 'https://www.justrunlah.com/running-events-calendar-australia/']
-----------------------------
2018-03-08 12:34:56 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: justrunlah)
2018-03-08 12:34:56 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'justrunlah', 'NEWSPIDER_MODULE': 'justrunlah.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['justrunlah.spiders']}
2018-03-08 12:34:56 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2018-03-08 12:34:57 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-03-08 12:34:57 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
NOM TABLE EN SORTIE :
import_brut_['www.justrunlah.com']
2018-03-08 12:34:57 [scrapy.middleware] INFO: Enabled item pipelines:
['justrunlah.pipelines.JustrunlahPipeline']
2018-03-08 12:34:57 [scrapy.core.engine] INFO: Spider opened
2018-03-08 12:34:57 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-03-08 12:34:57 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2018-03-08 12:34:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.justrunlah.com/robots.txt> (referer: None)
2018-03-08 12:34:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.justrunlah.com/running-events-calendar-malaysia/> (referer: None)
2018-03-08 12:34:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.justrunlah.com/running-events-calendar-australia/> (referer: None)
--------------------------------------------------
                SCRAPING DES ELEMENTS EVENTS
--------------------------------------------------
--------------------------------------------------
                SCRAPING DES ELEMENTS EVENTS
--------------------------------------------------
2018-03-08 12:34:58 [scrapy.core.engine] INFO: Closing spider (finished)
2018-03-08 12:34:58 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 849,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 76317,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 3, 8, 11, 34, 58, 593309),
 'log_count/DEBUG': 4,
 'log_count/INFO': 7,
 'response_received_count': 3,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2018, 3, 8, 11, 34, 57, 419191)}
2018-03-08 12:34:58 [scrapy.core.engine] INFO: Spider closed (finished)

And the scrapy shell

Scrapy shell

3条回答
Juvenile、少年°
2楼-- · 2019-07-14 02:26

It's a common problem: sometimes there is no tbody tag in source HTML for tables (modern browsers add it to the DOM automatically). So always check HTML source code:

//*[@class="cal2table"]//tr/td[2]/div/div[1]/div/a/text()
查看更多
爷的心禁止访问
3楼-- · 2019-07-14 02:27

Just remove tbody from you xpath or css expression and it will work.

Modern browsers are known for adding tbody elements to tables. Scrapy, on the other hand, does not modify the original page HTML, so you won’t be able to extract any data if you use tbody in your XPath expressions.

查看更多
够拽才男人
4楼-- · 2019-07-14 02:35

I assume you are trying to select all the event names, if so you can use this as your xpath //*[@class="cal2table"]/tbody/tr/td[2]/div/div[1]/div/a/text()

So I believe the issue you are having is due to your xpath definitions, without any further information on what you're trying to select this is the best answer I can give.

A tip, you can use the following command in Chrome/Firefox console to test your xpath:
$x('//*[@class="cal2table"]/tbody/tr/td[2]/div/div[1]/div/a/text()')

To use this as you currently are trying to load the items in, then try the following snippet instead. I haven't tested this so you may need to make small adjustments.

for unElement in response.xpath('//*[@class="cal2table"]//tr'): loader = ItemLoader(JustrunlahItem(), selector=unElement) loader.add_xpath('eve_nom_evenement', './/td[2]/div/div[1]/div/a/text()')

查看更多
登录 后发表回答