Apple Push Notifications in Bulk

2019-03-09 01:49发布

问题:

I have an app that involves sending Apple Push Notifications to ~1M users periodically. The setup for doing so has been built and tested for small numbers of notifications. Since there is no way I can test sending at that scale, I am interested in knowing whether there are any gotchas in sending bulk push notifications. I have scripts written in Python that open a single connection to the push server and send all notifications over that connection. Apple recommends keeping it open for as long as possible. But I have also seen that the connection terminates and you need to reestablish it.

All in all, it is disconcerting that successful sends are not acknowledged, only erroneous ones are flagged. From a programmer's standpoint instead of simply checking one thing "if (success)" you now need to watch for numerous things that could go wrong.

My question is: What are the typical set of errors that you need to watch out for to make sure your messages don't silently disappear into oblivion? The connection closing is an easy one. Are there others?

回答1:

I completely agree with you that this API is very frustrating, and if they would have sent a response for each notification it would have been much easier to implement.

That said, here's what Apple say you should do (from Technical Note) :

Push Notification Throughput and Error Checking

There are no caps or batch size limits for using APNs. The iOS 6.1 press release stated that APNs has sent over 4 trillion push notifications since it was established. It was announced at WWDC 2012 that APNs is sending 7 billion notifications daily.

If you're seeing throughput lower than 9,000 notifications per second, your server might benefit from improved error handling logic.

Here's how to check for errors when using the enhanced binary interface. Keep writing until a write fails. If the stream is ready for writing again, resend the notification and keep going. If the stream isn't ready for writing, see if the stream is available for reading.

If it is, read everything available from the stream. If you get zero bytes back, the connection was closed because of an error such as an invalid command byte or other parsing error. If you get six bytes back, that's an error response that you can check for the response code and the ID of the notification that caused the error. You'll need to send every notification following that one again.

Once everything has been sent, do one last check for an error response.

It can take a while for the dropped connection to make its way from APNs back to your server just because of normal latency. It's possible to send over 500 notifications before a write fails because of the connection being dropped. Around 1,700 notifications writes can fail just because the pipe is full, so just retry in that case once the stream is ready for writing again.

Now, here's where the tradeoffs get interesting. You can check for an error response after every write, and you'll catch the error right away. But this causes a huge increase in the time it takes to send a batch of notifications.

Device tokens should almost all be valid if you've captured them correctly and you're sending them to the correct environment. So it makes sense to optimize assuming failures will be rare. You'll get way better performance if you wait for write to fail or the batch to complete before checking for an error response, even counting the time to send the dropped notifications again.

None of this is really specific to APNs, it applies to most socket-level programming.

If your development tool of choice supports multiple threads or interprocess communication, you could have a thread or process waiting for an error response all the time and let the main sending thread or process know when it should give up and retry.



回答2:

Just wanted to chime in with a first person perspective, as we send millions of APNS notifications every day.

The reference @Eran quotes is unfortunately about the best resource we have for how Apple manages APNS sockets. It's fine for low volume, but Apple's documentation overall is very skewed towards the casual, low volume developer. You will see plenty of undocumented behavior once you get to scale.

The part of that document about doing error detection asynchronously is critical for high throughput. If you insist on blocking for errors on every send, then you'll need to heavily parallelize your workers to keep up throughput. The recommended way, however, is to just send as fast as you can send, and whenever you do get and error: repair and replay.

The part of that post I take exception to is:

Device tokens should almost all be valid if you've captured them correctly and you're sending them to the correct environment. So it makes sense to optimize assuming failures will be rare.

To predicate that advice with such a huge "IF" seems hugely misleading. I can almost guarantee that most developers are not capturing tokens and processing Apple's feedback service 100% "correctly". Even if they were, the system is inherently lossy, so drift is going to happen.

We see a non-zero number of error #8 responses (invalid device token) which I attribute to rooted phones, client bugs, or users intentionally spoofing their tokens to us. We have also seen a number of error #7 (invalid payload size) in the past, which we tracked down to improperly encoded messages that a developer added on our end. That was our fault of course, but that's my point--saying "optimize assuming failures will be rare" is the wrong message to send to learning developers. What I would say instead would be:

Assume errors will happen.

Hope that they happen infrequently, but code defensively in case they don't.

If you optimize assuming errors will be rare, you may be putting your infrastructure at risk whenever the APNS service goes down and every message you send returns an error #10.

The trouble comes when trying to figure out how to properly respond to errors. Documentation is ambiguous or absent regarding how to properly handle and recover from different errors. This is left as an exercise for the reader apparently.