I have an app that involves sending Apple Push Notifications to ~1M users periodically. The setup for doing so has been built and tested for small numbers of notifications. Since there is no way I can test sending at that scale, I am interested in knowing whether there are any gotchas in sending bulk push notifications. I have scripts written in Python that open a single connection to the push server and send all notifications over that connection. Apple recommends keeping it open for as long as possible. But I have also seen that the connection terminates and you need to reestablish it.
All in all, it is disconcerting that successful sends are not acknowledged, only erroneous ones are flagged. From a programmer's standpoint instead of simply checking one thing "if (success)" you now need to watch for numerous things that could go wrong.
My question is: What are the typical set of errors that you need to watch out for to make sure your messages don't silently disappear into oblivion? The connection closing is an easy one. Are there others?
I completely agree with you that this API is very frustrating, and if they would have sent a response for each notification it would have been much easier to implement.
That said, here's what Apple say you should do (from Technical Note) :
Push Notification Throughput and Error Checking
There are no caps or batch size limits for using APNs. The iOS 6.1
press release stated that APNs has sent over 4 trillion push
notifications since it was established. It was announced at WWDC 2012
that APNs is sending 7 billion notifications daily.
If you're seeing throughput lower than 9,000 notifications per second,
your server might benefit from improved error handling logic.
Here's how to check for errors when using the enhanced binary
interface. Keep writing until a write fails. If the stream is ready
for writing again, resend the notification and keep going. If the
stream isn't ready for writing, see if the stream is available for
reading.
If it is, read everything available from the stream. If you get zero
bytes back, the connection was closed because of an error such as an
invalid command byte or other parsing error. If you get six bytes
back, that's an error response that you can check for the response
code and the ID of the notification that caused the error. You'll need
to send every notification following that one again.
Once everything has been sent, do one last check for an error
response.
It can take a while for the dropped connection to make its way from
APNs back to your server just because of normal latency. It's possible
to send over 500 notifications before a write fails because of the
connection being dropped. Around 1,700 notifications writes can fail
just because the pipe is full, so just retry in that case once the
stream is ready for writing again.
Now, here's where the tradeoffs get interesting. You can check for an
error response after every write, and you'll catch the error right
away. But this causes a huge increase in the time it takes to send a
batch of notifications.
Device tokens should almost all be valid if you've captured them
correctly and you're sending them to the correct environment. So it
makes sense to optimize assuming failures will be rare. You'll get way
better performance if you wait for write to fail or the batch to
complete before checking for an error response, even counting the time
to send the dropped notifications again.
None of this is really specific to APNs, it applies to most
socket-level programming.
If your development tool of choice supports multiple threads or
interprocess communication, you could have a thread or process waiting
for an error response all the time and let the main sending thread or
process know when it should give up and retry.
Just wanted to chime in with a first person perspective, as we send millions of APNS notifications every day.
The reference @Eran quotes is unfortunately about the best resource we have for how Apple manages APNS sockets. It's fine for low volume, but Apple's documentation overall is very skewed towards the casual, low volume developer. You will see plenty of undocumented behavior once you get to scale.
The part of that document about doing error detection asynchronously is critical for high throughput. If you insist on blocking for errors on every send, then you'll need to heavily parallelize your workers to keep up throughput. The recommended way, however, is to just send as fast as you can send, and whenever you do get and error: repair and replay.
The part of that post I take exception to is:
Device tokens should almost all be valid if you've captured them
correctly and you're sending them to the correct environment. So it
makes sense to optimize assuming failures will be rare.
To predicate that advice with such a huge "IF" seems hugely misleading. I can almost guarantee that most developers are not capturing tokens and processing Apple's feedback service 100% "correctly". Even if they were, the system is inherently lossy, so drift is going to happen.
We see a non-zero number of error #8 responses (invalid device token) which I attribute to rooted phones, client bugs, or users intentionally spoofing their tokens to us. We have also seen a number of error #7 (invalid payload size) in the past, which we tracked down to improperly encoded messages that a developer added on our end. That was our fault of course, but that's my point--saying "optimize assuming failures will be rare" is the wrong message to send to learning developers. What I would say instead would be:
Assume errors will happen.
Hope that they happen infrequently, but
code defensively in case they don't.
If you optimize assuming errors will be rare, you may be putting your infrastructure at risk whenever the APNS service goes down and every message you send returns an error #10.
The trouble comes when trying to figure out how to properly respond to errors. Documentation is ambiguous or absent regarding how to properly handle and recover from different errors. This is left as an exercise for the reader apparently.