Heroku and Web scraping

2019-04-28 19:47发布

问题:

I have a nokigiri web scraper that publishes to a database that I'm trying to publish to heroku. I have a sinatra application frontend that I want to have pull in from the database. I'm new to Heroku and web development, and don't know the best way to handle something like this.

Do I have to place the web scraper script that uploads to the database under a sinatra route (like mywebsite.com/scraper ) and just make it so obscure that no one visits it? In the end, I'd like to have the sinatra part be a rest api that pulls from the database.

Thanks for all input

回答1:

There are two approaches you can take.

The first one is to use One-off dynos by running the scraper through the console using heroku run YOURCMD. Just make sure scraper don't write to disk but uses database.

More information: https://devcenter.heroku.com/articles/one-off-dynos

The second is differentiating between scraper and web process in a way that you have web process for normal UI interaction and a scraper process which web process can spawn/talk to. If you take this route it's up to you how to protect it from rest of the world (auth/url obfuscation etc.).

More information: https://devcenter.heroku.com/articles/background-jobs-queueing



回答2:

I did it by creating a rake task and using the one-off dynos as mentioned by XLII

Here is my rake task file

require 'bundler/setup'
Bundler.require

desc "Scrape Site"
 task :scrape, [:companyname]  => :environment do |t, args|
    puts "Company Name is :" + args[:companyname]

    agent = Mechanize.new
    agent.user_agent_alias = 'Mac Safari'
    puts "Agent (Mac Safari Created)"
        # MORE SCRAPING CODE

 end

You can simply run it by call

heroku run rake scrape[google]