I have a nokigiri web scraper that publishes to a database that I'm trying to publish to heroku. I have a sinatra application frontend that I want to have pull in from the database. I'm new to Heroku and web development, and don't know the best way to handle something like this.
Do I have to place the web scraper script that uploads to the database under a sinatra route (like mywebsite.com/scraper ) and just make it so obscure that no one visits it? In the end, I'd like to have the sinatra part be a rest api that pulls from the database.
Thanks for all input
There are two approaches you can take.
The first one is to use One-off dynos by running the scraper through the console using heroku run YOURCMD
. Just make sure scraper don't write to disk but uses database.
More information:
https://devcenter.heroku.com/articles/one-off-dynos
The second is differentiating between scraper and web process in a way that you have web process for normal UI interaction and a scraper process which web process can spawn/talk to. If you take this route it's up to you how to protect it from rest of the world (auth/url obfuscation etc.).
More information:
https://devcenter.heroku.com/articles/background-jobs-queueing
I did it by creating a rake task and using the one-off dynos as mentioned by XLII
Here is my rake task file
require 'bundler/setup'
Bundler.require
desc "Scrape Site"
task :scrape, [:companyname] => :environment do |t, args|
puts "Company Name is :" + args[:companyname]
agent = Mechanize.new
agent.user_agent_alias = 'Mac Safari'
puts "Agent (Mac Safari Created)"
# MORE SCRAPING CODE
end
You can simply run it by call
heroku run rake scrape[google]