I would like to get all the jobs posted in the website https://www.germanystartupjobs.com
using the scrapy. As the jobs loaded by POST request, I put start_urls= ['https://www.germanystartupjobs.com/jm-ajax/get_listings/']
. I have found this URL
in the 1ST PAGE from the network
tab using the command method:POST
using the Chrome dev tool
.
I thought that in the 2nd page, I will get different URL
but, it seems not the case here. I also tried with
start_urls= ['https://www.germanystartupjobs.com/jm-ajax/get_listings/' + str(i) for i in range(1, 5)]
to generate more pages with indexes which doesn't help. The current version of my code is here:
import scrapy
import json
import re
import textwrap
class GermanyStartupJobs(scrapy.Spider):
name = 'gsjobs'
start_urls= ['https://www.germanystartupjobs.com/jm-ajax/get_listings/' + str(i) for i in range(1, 5)]
def parse(self, response):
data = json.loads(response.body)
html = data['html']
selector = scrapy.Selector(text=data['html'], type="html")
hrefs = selector.xpath('//a/@href').extract()
print "LENGTH = ", len(hrefs)
for href in hrefs:
yield scrapy.Request(href, callback=self.parse_detail)
def parse_detail(self, response):
try:
full_d = str(response.xpath\
('//div[@class="col-sm-5 justify-text"]//*/text()').extract())
full_des_li = full_d.split(',')
full_des_lis = []
for f in full_des_li:
ff = "".join((f.strip().replace('\n', '')).split())
if len(ff) < 3:
continue
full_des_lis.append(f)
full = 'u'+ str(full_des_lis)
length = len(full)
full_des_list = textwrap.wrap(full, length/3)[:-1]
full_des_list.reverse()
# get the job title
try:
title = response.css('.job-title').xpath('./text()').extract_first().strip()
except:
print "No title"
title = ''
# get the company name
try:
company_name = response.css('.company-title').xpath('./normal/text()').extract_first().strip()
except:
print "No company name"
company_name = ''
# get the company location
try:
company_location = response.xpath('//a[@class="google_map_link"]/text()').extract_first().strip()
except:
print 'No company location'
company_location = ''
# get the job poster email (if available)
try:
pattern = re.compile(r"(\w(?:[-.+]?\w+)+\@(?:[a-z0-9](?:[-+]?\w+)*\.)+[a-z]{2,})", re.I)
for text in full_des_list:
email = pattern.findall(text)[-1]
if email is not None:
break
except:
print 'No email'
email = ''
# get the job poster phone number(if available)
try:
r = re.compile(".*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?", re.S)
phone = r.findall(full_des_list[0])[-1]
if phone is not None:
phone = '+49-' +phone
except:
print 'no phone'
phone = ''
yield {
'title': title,
'company name': company_name,
'company_location': company_location,
'email': email,
'phone': phone,
'source': u"Germany Startup Job"
}
except:
print 'Not valid'
# raise Exception("Think better!!")
I would like to get the similar info from at least first 17 pages from the website. How could I achieve that and to improve my code ? After getting the required info, I plan to use multi-threading
to speed up the process and nltk
to search for the poster name (if available).
You'll have to actually figure out how that data's passed between client and server to scrape the site that way by looking at content. The page of data you want, so to speek, probably can't be expressed in the URL.
Have you analyzed the network connections the site makes when you visit it in a URL? It might be pulling content from URLs that you, too, can access to retrieve data in a computer readable fashion. That'd be a lot easier than scraping the site.