How to skip known entries when syncing with Google

for writing an offline client to the Google Reader service I would like to know how to best sync with the service.

There doesn't seem to be official documentation yet and the best source I found so far is this: http://code.google.com/p/pyrfeed/wiki/GoogleReaderAPI

Now consider this: With the information from above I can download all unread items, I can specify how many items to download and using the atom-id I can detect duplicate entries that I already downloaded.

What's missing for me is a way to specify that I just want the updates since my last sync. I can say give me the 10 (parameter n=10) latest (parameter r=d) entries. If I specify the parameter r=o (date ascending) then I can also specify parameter ot=[last time of sync], but only then and the ascending order doesn't make any sense when I just want to read some items versus all items.

Any idea how to solve that without downloading all items again and just rejecting duplicates? Not a very economic way of polling.

Someone proposed that I can specify that I only want the unread entries. But to make that solution work in the way that Google Reader will not offer this entries again, I would need to mark them as read. In turn that would mean that I need to keep my own read/unread state on the client and that the entries are already marked as read when the user logs on to the online version of Google Reader. That doesn't work for me.

Cheers, Mariano

回答1:

To get the latest entries, use the standard from-newest-date-descending download, which will start from the latest entries. You will receive a "continuation" token in the XML result, looking something like this:

<gr:continuation>CArhxxjRmNsC</gr:continuation>`

Scan through the results, pulling out anything new to you. You should find that either all results are new, or everything up to a point is new, and all after that are already known to you.

In the latter case, you're done, but in the former you need to find the new stuff older than what you've already retrieved. Do this by using the continuation to get the results starting from just after the last result in the set you just retrieved by passing it in the GET request as the c parameter, e.g.:

http://www.google.com/reader/atom/user/-/state/com.google/reading-list?c=CArhxxjRmNsC

Continue this way until you have everything.

The n parameter, which is a count of the number of items to retrieve, works well with this, and you can change it as you go. If the frequency of checking is user-set, and thus could be very frequent or very rare, you can use an adaptive algorithm to reduce network traffic and your processing load. Initially request a small number of the latest entries, say five (add n=5 to the URL of your GET request). If all are new, in the next request, where you use the continuation, ask for a larger number, say, 20. If those are still all new, either the feed has a lot of updates or it's been a while, so continue on in groups of 100 or whatever.

However, and correct me if I'm wrong here, you also want to know, after you've downloaded an item, whether its state changes from "unread" to "read" due to the person reading it using the Google Reader interface.

One approach to this would be:

Update the status on google of any items that have been read locally.
Check and save the unread count for the feed. (You want to do this before the next step, so that you guarantee that new items have not arrived between your download of the newest items and the time you check the read count.)
Download the latest items.
Calculate your read count, and compare that to google's. If the feed has a higher read count than you calculated, you know that something's been read on google.
If something has been read on google, start downloading read items and comparing them with your database of unread items. You'll find some items that google says are read that your database claims are unread; update these. Continue doing so until you've found a number of these items equal to the difference between your read count and google's, or until the downloads get unreasonable.
If you didn't find all of the read items, c'est la vie; record the number remaining as an "unfound unread" total which you also need to include in your next calculation of the local number you think are unread.

If the user subscribes to a lot of different blogs, it's also likely he labels them extensively, so you can do this whole thing on a per-label basis rather than for the entire feed, which should help keep the amount of data down, since you won't need to do any transfers for labels where the user didn't read anything new on google reader.

This whole scheme can be applied to other statuses, such as starred or unstarred, as well.

Now, as you say, this

...would mean that I need to keep my own read/unread state on the client and that the entries are already marked as read when the user logs on to the online version of Google Reader. That doesn't work for me.

True enough. Neither keeping a local read/unread state (since you're keeping a database of all of the items anyway) nor marking items read in google (which the API supports) seems very difficult, so why doesn't this work for you?

There is one further hitch, however: the user may mark something read as unread on google. This throws a bit of a wrench into the system. My suggestion there, if you really want to try to take care of this, is to assume that the user in general will be touching only more recent stuff, and download the latest couple hundred or so items every time, checking the status on all of them. (This isn't all that bad; downloading 100 items took me anywhere from 0.3s for 300KB, to 2.5s for 2.5MB, albeit on a very fast broadband connection.)

Again, if the user has a large number of subscriptions, he's also probably got a reasonably large number of labels, so doing this on a per-label basis will speed things up. I'd suggest, actually, that not only do you check on a per-label basis, but you also spread out the checks, checking a single label each minute rather than everything once every twenty minutes. You can also do this "big check" for status changes on older items less often than you do a "new stuff" check, perhaps once every few hours, if you want to keep bandwidth down.

This is a bit of bandwidth hog, mainly because you need to download the full article from Google merely to check the status. Unfortunately, I can't see any way around that in the API docs that we have available to us. My only real advice is to minimize the checking of status on non-new items.

回答2:

The Google API hasn't yet been released, at which point this answer may change.

Currently, you would have to call the API and dis-regard items already downloaded, which as you said isn't terribly efficient as you will be re-downloading items every time, even if you already have them.