How to find ALL the jobs listed in a website?

2019-06-14 18:36发布

I would like to get all the jobs posted in the website https://www.germanystartupjobs.com using the scrapy. As the jobs loaded by POST request, I put start_urls= ['https://www.germanystartupjobs.com/jm-ajax/get_listings/']. I have found this URL in the 1ST PAGE from the network tab using the command method:POST using the Chrome dev tool.

I thought that in the 2nd page, I will get different URL but, it seems not the case here. I also tried with

start_urls= ['https://www.germanystartupjobs.com/jm-ajax/get_listings/' + str(i) for i in range(1, 5)]

to generate more pages with indexes which doesn't help. The current version of my code is here:

import scrapy
import json
import re
import textwrap 


class GermanyStartupJobs(scrapy.Spider):

    name = 'gsjobs'
    start_urls= ['https://www.germanystartupjobs.com/jm-ajax/get_listings/' + str(i) for i in range(1, 5)]

    def parse(self, response):

        data = json.loads(response.body)
        html = data['html']
        selector = scrapy.Selector(text=data['html'], type="html")
        hrefs = selector.xpath('//a/@href').extract()

        print "LENGTH = ", len(hrefs)

        for href in hrefs:
            yield scrapy.Request(href, callback=self.parse_detail)


    def parse_detail(self, response):

        try:
            full_d  = str(response.xpath\
                ('//div[@class="col-sm-5 justify-text"]//*/text()').extract()) 

            full_des_li = full_d.split(',')
            full_des_lis = []

            for f in full_des_li:
                ff = "".join((f.strip().replace('\n', '')).split())
                if len(ff) < 3:
                    continue 
                full_des_lis.append(f)

            full = 'u'+ str(full_des_lis)

            length = len(full)
            full_des_list = textwrap.wrap(full, length/3)[:-1]

            full_des_list.reverse()


            # get the job title             
            try:
                title = response.css('.job-title').xpath('./text()').extract_first().strip()
            except:
                print "No title"
                title = ''

            # get the company name
            try:
                company_name = response.css('.company-title').xpath('./normal/text()').extract_first().strip()
            except:
                print "No company name"
                company_name = ''


            # get the company location  
            try:
                company_location = response.xpath('//a[@class="google_map_link"]/text()').extract_first().strip()
            except:
                print 'No company location'
                company_location = ''

            # get the job poster email (if available)            
            try:
                pattern = re.compile(r"(\w(?:[-.+]?\w+)+\@(?:[a-z0-9](?:[-+]?\w+)*\.)+[a-z]{2,})", re.I)

                for text in full_des_list:
                    email = pattern.findall(text)[-1]
                    if email is not None:
                        break   
            except:
                print 'No email'
                email = ''

            # get the job poster phone number(if available)                        
            try:
                r = re.compile(".*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?", re.S)
                phone = r.findall(full_des_list[0])[-1]

                if phone is not None:
                    phone = '+49-' +phone

            except:
                print 'no phone'
                phone = ''

            yield {
                'title': title,
                'company name': company_name,
                'company_location': company_location, 
                'email': email,
                'phone': phone,
                'source': u"Germany Startup Job" 
            }

        except:
            print 'Not valid'
            # raise Exception("Think better!!")

I would like to get the similar info from at least first 17 pages from the website. How could I achieve that and to improve my code ? After getting the required info, I plan to use multi-threading to speed up the process and nltk to search for the poster name (if available).

标签: python scrapy
1条回答
Evening l夕情丶
2楼-- · 2019-06-14 18:52

You'll have to actually figure out how that data's passed between client and server to scrape the site that way by looking at content. The page of data you want, so to speek, probably can't be expressed in the URL.

Have you analyzed the network connections the site makes when you visit it in a URL? It might be pulling content from URLs that you, too, can access to retrieve data in a computer readable fashion. That'd be a lot easier than scraping the site.

查看更多
登录 后发表回答