I try to crawler data form a website with Scrapy (1.5.0)- Python
Project directory :
stack/
scrapy.cfg
stack/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
stack_spider.py
Here is my items.py
import scrapy
class StackItem(scrapy.Item):
title = scrapy.Field()
and here is stack_spider.py
from scrapy import Spider
from scrapy.selector import Selector
from stack.items import StackItem
class StackSpider(Spider):
name = "stack"
allowed_domains = ["batdongsan.com.vn"]
start_urls = [
"https://batdongsan.com.vn/nha-dat-ban",
]
def parse(self, response):
questions = Selector(response).xpath('//div[@class="p-title"]/h3')
for question in questions:
item = StackItem()
item['title'] = question.xpath(
'a/text()').extract()[0]
yield item
I don't know why i can't crawler the data, i really need your help. Thanks
Set User Agent
goto your scrapy projects settings.py
and paste this in,
If you just want to crawl the website and get the Source Code, this might help.
To parse each page you need to add a little bit code.
Replace to your domain. Also I didn't use Item class in my code.
found the answer: http://edmundmartin.com/random-user-agent-requests-python/ need set header to pass prevent crawl