I am trying to parse og meta tags using the HTTParty gem using this code:
link = http://www.usatoday.com/story/gameon/2013/01/08/nfl-jets-tony-sparano-fired/1817037/
# link = http://news.yahoo.com/chicago-lottery-winners-death-ruled-homicide-181627271.html
resp = HTTParty.get(link)
ret_body = resp.body
# title
og_title = ret_body.match(/\<[Mm][Ee][Tt][Aa] property\=\"og:title\"\ content\=\"(.*?)\"\/\>/)
og_title = og_title[1].to_s
The problem is that it worked on some sites (yahoo!) but not others (usa today)
Perhaps I can offer an easier solution? Check out the OpenGraph gem.
It's a simple library for parsing Open Graph protocol information from web sites and should solve your problem.
Don't parse HTML with regular expressions, because they're too fragile for anything but the simplest problems. A tiny change to the HTML can break the pattern, causing you to begin a slow battle of maintaining an ever expanding pattern. It's a war you won't win.
Instead, use a HTML parser. Ruby has Nokogiri, which is excellent. Here's how I'd do what you want:
Which outputs:
Solution:
Trailing whitespace messed up the parsing so make sure to check for that. I added an OR clause to the regex to allow for both trailing and non trailing whitespace.