Continuous integration/deployment/delivery on Goog

2019-01-07 00:28发布

问题:

We have recently setted up continuous integration/deployment/delivery of a nodejs webapp on Google App Engine. The CI server (GitLabCI) runs dependencies installation, build, tests and deployment to integration/prod depending on the branch (develop/master).

At the day of today, the only bugs we've faced to was during the dependencies step, and so we didn't care much about it. But yesterday (21/10/16), there was a wide-scale DNS outage and the pipeline failed in the middle of the deployment step, breaking down the prod. Simply re-run the pipeline has made the job, but the problem can reproduce at any time.

My questions are:

  • How can we handle this sort of network issues, in the continuous deployment process ?
  • Is the continuous deployment on Google App Engine really a good idea ?
  • If so, what is the App Engine deployment methodo ? I don't find any relevant doc about it...

For the moment we have only two versions "dev" and "prod" that are updated after commits, but at random times I could observe strange behaviours.

Any response/suggestions/feedback is very welcome !

Example of stacktrace concerning the networking issues I am talking about:

DEBUG: Error sending result: 'MetadataServerException(HTTPError(),)'. Reason: 'PicklingError("Can't pickle <type 'cStringIO.StringO'>: attribute lookup cStringIO.StringO failed",)'
Traceback (most recent call last):
  File "/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py", line 733, in Execute
    resources = args.calliope_command.Run(cli=self, args=args)
  File "/google-cloud-sdk/lib/googlecloudsdk/calliope/backend.py", line 1630, in Run
    resources = command_instance.Run(args)
  File "/google-cloud-sdk/lib/surface/app/deploy.py", line 53, in Run
    return deploy_util.RunDeploy(self, args)
  File "/google-cloud-sdk/lib/googlecloudsdk/command_lib/app/deploy_util.py", line 387, in RunDeploy
    all_services)
  File "/google-cloud-sdk/lib/googlecloudsdk/command_lib/app/deploy_util.py", line 247, in Deploy
    manifest = _UploadFiles(service, code_bucket_ref)
  File "/google-cloud-sdk/lib/googlecloudsdk/command_lib/app/deploy_util.py", line 115, in _UploadFiles
    service, code_bucket_ref)
  File "/google-cloud-sdk/lib/googlecloudsdk/api_lib/app/deploy_app_command_util.py", line 277, in CopyFilesToCodeBucketNoGsUtil
    _UploadFiles(files_to_upload, bucket_ref)
  File "/google-cloud-sdk/lib/googlecloudsdk/api_lib/app/deploy_app_command_util.py", line 219, in _UploadFiles
    results = pool.map(_UploadFile, tasks)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
    raise self._value
MaybeEncodingError: Error sending result: 'MetadataServerException(HTTPError(),)'. Reason: 'PicklingError("Can't pickle <type 'cStringIO.StringO'>: attribute lookup cStringIO.StringO failed",)'
DEBUG: Exception captured in Error
Traceback (most recent call last):
  File "/google-cloud-sdk/lib/googlecloudsdk/core/metrics.py", line 411, in Wrapper
    return func(*args, **kwds)
TypeError: Error() takes exactly 3 arguments (1 given)
ERROR: gcloud crashed (MaybeEncodingError): Error sending result: 'MetadataServerException(HTTPError(),)'. Reason: 'PicklingError("Can't pickle <type 'cStringIO.StringO'>: attribute lookup cStringIO.StringO failed",)'
Traceback (most recent call last):
  File "/google-cloud-sdk/lib/gcloud.py", line 65, in <module>
    main()
  File "/google-cloud-sdk/lib/gcloud.py", line 61, in main
    sys.exit(googlecloudsdk.gcloud_main.main())
  File "/google-cloud-sdk/lib/googlecloudsdk/gcloud_main.py", line 145, in main
    crash_handling.HandleGcloudCrash(err)
  File "/google-cloud-sdk/lib/googlecloudsdk/command_lib/crash_handling.py", line 107, in HandleGcloudCrash
    _ReportError(err)
  File "/google-cloud-sdk/lib/googlecloudsdk/command_lib/crash_handling.py", line 86, in _ReportError
    util.ErrorReporting().ReportEvent(error_message=stacktrace,
  File "/google-cloud-sdk/lib/googlecloudsdk/api_lib/error_reporting/util.py", line 28, in __init__
    self._API_NAME, self._API_VERSION)
  File "/google-cloud-sdk/lib/googlecloudsdk/core/apis.py", line 254, in GetClientInstance
    http_client = http.Http()
  File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/http.py", line 60, in Http
    creds = store.Load()
  File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/store.py", line 282, in Load
    if account in c_gce.Metadata().Accounts():
  File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 122, in Accounts
    gce_read.GOOGLE_GCE_METADATA_ACCOUNTS_URI + '/')
  File "/google-cloud-sdk/lib/googlecloudsdk/core/util/retry.py", line 160, in TryFunc
    return func(*args, **kwargs), None
  File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 45, in _ReadNoProxyWithCleanFailures
    raise MetadataServerException(e)
googlecloudsdk.core.credentials.gce.MetadataServerException: HTTP Error 503: Service Unavailable
DEBUG: Uploading [/builds/apps/webapp/lib/jinja2/defaults.pyc] to [151c77b4e5bdd2c38b6a2bf914fffa3a6ffa71a6]
INFO: Uploading [/builds/apps/webapp/lib/jinja2/defaults.pyc] to [151c77b4e5bdd2c38b6a2bf914fffa3a6ffa71a6]
INFO: Refreshing access_token

回答1:

Good/bad? Subjective - thus off-topic for SO. Assuming the question is how to make continuous deployment reliable :)

Well, the trouble is that you're using app versions as your CI environments, which means you can't avoid breakages due to a specific version being bad. You can only hope to recover as fast as possible by re-deploying the version (when the outage ends) - this can be automated.

You should not have your production site running directly off the version overwritten by the CI production pipeline, otherwise you risk site outage on a bad deployment. Instead you could use a new/unique version for each execution of the CI production pipeline and only after that completes successfully you finally switch site traffic to its version using the flow described below (which can also be used inside the CI pipelines if using different apps instead of app versions as CI environments)

From Deploying your program:

By default the deploy command automatically generates a new version ID each time that you use it and will route any traffic to the new version.

To override this behavior, you can specify the version ID with the version flag:

gcloud app deploy --version myID

You can also specify not to send all traffic to the new version immediatey with the --no-promote flag:

gcloud app deploy --no-promote

So make sure you never deploy a version and make that version the default traffic destination one in the same step (possibly not atomic if driven from the client side). Especially for the production app. Instead:

  • deploy the new version (gcloud app deploy --no-promote --version ...)
  • start the new version (gcloud app versions ... ) and check that it works
  • if it works fine switch real traffic to it (gcloud app services set-traffic ... )

This way the only critical operation is traffic switching, which (hopefully) is an atomic operation which is either successful or it's completely rolled back on GAE side (if not it's a GAE bug). If this step fails the app should still continue to work with the old version.

Of course, this assumes the networking issues are only in between you and GAE, if they're also affecting GAE's internal ops all bets are off (but those I trust should be fixed rather timely).