Regex to extract all Starred Items URLs from Googl

2019-06-11 22:48发布

问题:

Sadly it was announced that Google Reader will be shutdown mid of the year. Since I have a large amount of starred items in Google Reader I'd like to back them up. This is possible via Google Reader takeout. It produces a file in JSON format.

Now I would like to extract all of the article urls out of this several MB large file.

At first I thought it would be best to use a regex for url but it seems to be better to extract the needed article urls by a regex to find just the article urls. This will prevent to also extract other urls that are not needed.

Here is a short example how parts of the json file looks:

"published" : 1359723602,
"updated" : 1359723602,
"canonical" : [ {
  "href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"
} ],
"alternate" : [ {
  "href" : "http://feeds.arstechnica.com/~r/arstechnica/everything/~3/EphJmT-xTN4/",
  "type" : "text/html"
} ],

I just need the urls you can find here:

 "canonical" : [ {
  "href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"
} ],

Perhaps anyone is in the mood to say how a regex have to look like to extract all these urls?

The benefit would be to have a quick and dirty way to extract starred items urls from Google Reader to import them in services like pocket or evernote, once processed.

回答1:

I know you asked about regex, but I think there's a better way to handle this problem. Multi-line regular expressions are a PITA, and in this case there's no need for that kind of brain damage.

I would start with grep, rather than a regex. The -A1 parameter says "return the line that matches, and one after":

grep -A1 "canonical" <file>

This will return lines like this:

"canonical" : [ {
    "href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"

Then, I'd grep again for the href:

grep -A1 "canonical" <file> | grep "href"

giving

"href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"

now I can use awk to get just the url:

grep -A1 "canonical" <file> | grep "href" | awk -F'" : "' '{ print $2 }' 

which strips out the first quote on the url:

http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"

Now I just need to get rid of the extra quote:

grep -A1 "canonical" <file> | grep "href" | awk -F'" : "' '{ print $2 }' | tr -d '"'

That's it!

http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/