Sadly it was announced that Google Reader will be shutdown mid of the year.
Since I have a large amount of starred items in Google Reader I'd like to back them up.
This is possible via Google Reader takeout. It produces a file in JSON
format.
Now I would like to extract all of the article urls out of this several MB large file.
At first I thought it would be best to use a regex for url but it seems to be better to extract the needed article urls by a regex to find just the article urls. This will prevent to also extract other urls that are not needed.
Here is a short example how parts of the json file looks:
"published" : 1359723602,
"updated" : 1359723602,
"canonical" : [ {
"href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"
} ],
"alternate" : [ {
"href" : "http://feeds.arstechnica.com/~r/arstechnica/everything/~3/EphJmT-xTN4/",
"type" : "text/html"
} ],
I just need the urls you can find here:
"canonical" : [ {
"href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"
} ],
Perhaps anyone is in the mood to say how a regex have to look like to extract all these urls?
The benefit would be to have a quick and dirty way to extract starred items urls from Google Reader to import them in services like pocket or evernote, once processed.