We have recently setted up continuous integration/deployment/delivery of a nodejs webapp on Google App Engine. The CI server (GitLabCI) runs dependencies installation, build, tests and deployment to integration/prod depending on the branch (develop/master).
At the day of today, the only bugs we've faced to was during the dependencies step, and so we didn't care much about it. But yesterday (21/10/16), there was a wide-scale DNS outage and the pipeline failed in the middle of the deployment step, breaking down the prod. Simply re-run the pipeline has made the job, but the problem can reproduce at any time.
My questions are:
- How can we handle this sort of network issues, in the continuous deployment process ?
- Is the continuous deployment on Google App Engine really a good idea ?
- If so, what is the App Engine deployment methodo ? I don't find any relevant doc about it...
For the moment we have only two versions "dev" and "prod" that are updated after commits, but at random times I could observe strange behaviours.
Any response/suggestions/feedback is very welcome !
Example of stacktrace concerning the networking issues I am talking about:
DEBUG: Error sending result: 'MetadataServerException(HTTPError(),)'. Reason: 'PicklingError("Can't pickle <type 'cStringIO.StringO'>: attribute lookup cStringIO.StringO failed",)'
Traceback (most recent call last):
File "/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py", line 733, in Execute
resources = args.calliope_command.Run(cli=self, args=args)
File "/google-cloud-sdk/lib/googlecloudsdk/calliope/backend.py", line 1630, in Run
resources = command_instance.Run(args)
File "/google-cloud-sdk/lib/surface/app/deploy.py", line 53, in Run
return deploy_util.RunDeploy(self, args)
File "/google-cloud-sdk/lib/googlecloudsdk/command_lib/app/deploy_util.py", line 387, in RunDeploy
all_services)
File "/google-cloud-sdk/lib/googlecloudsdk/command_lib/app/deploy_util.py", line 247, in Deploy
manifest = _UploadFiles(service, code_bucket_ref)
File "/google-cloud-sdk/lib/googlecloudsdk/command_lib/app/deploy_util.py", line 115, in _UploadFiles
service, code_bucket_ref)
File "/google-cloud-sdk/lib/googlecloudsdk/api_lib/app/deploy_app_command_util.py", line 277, in CopyFilesToCodeBucketNoGsUtil
_UploadFiles(files_to_upload, bucket_ref)
File "/google-cloud-sdk/lib/googlecloudsdk/api_lib/app/deploy_app_command_util.py", line 219, in _UploadFiles
results = pool.map(_UploadFile, tasks)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
raise self._value
MaybeEncodingError: Error sending result: 'MetadataServerException(HTTPError(),)'. Reason: 'PicklingError("Can't pickle <type 'cStringIO.StringO'>: attribute lookup cStringIO.StringO failed",)'
DEBUG: Exception captured in Error
Traceback (most recent call last):
File "/google-cloud-sdk/lib/googlecloudsdk/core/metrics.py", line 411, in Wrapper
return func(*args, **kwds)
TypeError: Error() takes exactly 3 arguments (1 given)
ERROR: gcloud crashed (MaybeEncodingError): Error sending result: 'MetadataServerException(HTTPError(),)'. Reason: 'PicklingError("Can't pickle <type 'cStringIO.StringO'>: attribute lookup cStringIO.StringO failed",)'
Traceback (most recent call last):
File "/google-cloud-sdk/lib/gcloud.py", line 65, in <module>
main()
File "/google-cloud-sdk/lib/gcloud.py", line 61, in main
sys.exit(googlecloudsdk.gcloud_main.main())
File "/google-cloud-sdk/lib/googlecloudsdk/gcloud_main.py", line 145, in main
crash_handling.HandleGcloudCrash(err)
File "/google-cloud-sdk/lib/googlecloudsdk/command_lib/crash_handling.py", line 107, in HandleGcloudCrash
_ReportError(err)
File "/google-cloud-sdk/lib/googlecloudsdk/command_lib/crash_handling.py", line 86, in _ReportError
util.ErrorReporting().ReportEvent(error_message=stacktrace,
File "/google-cloud-sdk/lib/googlecloudsdk/api_lib/error_reporting/util.py", line 28, in __init__
self._API_NAME, self._API_VERSION)
File "/google-cloud-sdk/lib/googlecloudsdk/core/apis.py", line 254, in GetClientInstance
http_client = http.Http()
File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/http.py", line 60, in Http
creds = store.Load()
File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/store.py", line 282, in Load
if account in c_gce.Metadata().Accounts():
File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 122, in Accounts
gce_read.GOOGLE_GCE_METADATA_ACCOUNTS_URI + '/')
File "/google-cloud-sdk/lib/googlecloudsdk/core/util/retry.py", line 160, in TryFunc
return func(*args, **kwargs), None
File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 45, in _ReadNoProxyWithCleanFailures
raise MetadataServerException(e)
googlecloudsdk.core.credentials.gce.MetadataServerException: HTTP Error 503: Service Unavailable
DEBUG: Uploading [/builds/apps/webapp/lib/jinja2/defaults.pyc] to [151c77b4e5bdd2c38b6a2bf914fffa3a6ffa71a6]
INFO: Uploading [/builds/apps/webapp/lib/jinja2/defaults.pyc] to [151c77b4e5bdd2c38b6a2bf914fffa3a6ffa71a6]
INFO: Refreshing access_token
Good/bad? Subjective - thus off-topic for SO. Assuming the question is how to make continuous deployment reliable :)
Well, the trouble is that you're using app versions as your CI environments, which means you can't avoid breakages due to a specific version being bad. You can only hope to recover as fast as possible by re-deploying the version (when the outage ends) - this can be automated.
You should not have your production site running directly off the version overwritten by the CI
production
pipeline, otherwise you risk site outage on a bad deployment. Instead you could use a new/unique version for each execution of the CIproduction
pipeline and only after that completes successfully you finally switch site traffic to its version using the flow described below (which can also be used inside the CI pipelines if using different apps instead of app versions as CI environments)From Deploying your program:
So make sure you never deploy a version and make that version the default traffic destination one in the same step (possibly not atomic if driven from the client side). Especially for the production app. Instead:
gcloud app deploy --no-promote --version ...
)gcloud app versions ...
) and check that it worksgcloud app services set-traffic ...
)This way the only critical operation is traffic switching, which (hopefully) is an atomic operation which is either successful or it's completely rolled back on GAE side (if not it's a GAE bug). If this step fails the app should still continue to work with the old version.
Of course, this assumes the networking issues are only in between you and GAE, if they're also affecting GAE's internal ops all bets are off (but those I trust should be fixed rather timely).