Dockerfile.1
executes multiple RUN
:
FROM busybox
RUN echo This is the A > a
RUN echo This is the B > b
RUN echo This is the C > c
Dockerfile.2
joins them:
FROM busybox
RUN echo This is the A > a &&\
echo This is the B > b &&\
echo This is the C > c
Each RUN
creates a layer, so I always assumed that less layers is better and thus Dockerfile.2
is better.
This is obviously true when a RUN
removes something added by a previous RUN
(i.e. yum install nano && yum clean all
), but in cases where every RUN
adds something, there are a few points we need to consider:
Layers are supposed to just add a diff above the previous one, so if the later layer does not remove something added in a previous one, there should be not much disk space saving advantage between both methods...
Layers are pulled in parallel from Docker Hub, so Dockerfile.1
, although probably slightly bigger, would theoretically get downloaded faster.
If adding a 4th sentence (i.e. echo This is the D > d
) and locally rebuilding, Dockerfile.1
would build faster thanks to cache, but Dockerfile.2
would have to run all 4 commands again.
So, the question: Which is a better way to do a Dockerfile?
When possible, I always merge together commands that create files with commands that delete those same files into a single RUN
line. This is because each RUN
line adds a layer to the image, the output is quite literally the filesystem changes that you could view with docker diff
on the temporary container it creates. If you delete a file that was created in a different layer, all the union filesystem does is register the filesystem change in a new layer, the file still exists in the previous layer and is shipped over the networked and stored on disk. So if you download source code, extract it, compile it into a binary, and then delete the tgz and source files at the end, you really want this all done in a single layer to reduce image size.
Next, I personally split up layers based on their potential for reuse in other images and expected caching usage. If I have 4 images, all with the same base image (e.g. debian), I may pull a collection of common utilities to most of those images into the first run command so the other images benefit from caching.
Order in the Dockerfile is important when looking at image cache reuse. I look at any components that will update very rarely, possibly only when the base image updates and put those high up in the Dockerfile. Towards the end of the Dockerfile, I include any commands that will run quick and may change frequently, e.g. adding a user with a host specific UID or creating folders and changing permissions. If the container includes interpreted code (e.g. JavaScript) that is being actively developed, that gets added as late as possible so that a rebuild only runs that single change.
In each of these groups of changes, I consolidate as best I can to minimize layers. So if there are 4 different source code folders, those get placed inside a single folder so it can be added with a single command. Any package installs from something like apt-get are merged into a single RUN when possible to minimize the amount of package manager overhead (updating and cleaning up).
Update for multi-stage builds:
I worry much less about reducing image size in the non-final stages of a multi-stage build. When these stages aren't tagged and shipped to other nodes, you can maximize the likelihood of a cache reuse by splitting each command to a separate RUN
line.
However, this isn't a perfect solution to squashing layers since all you copy between stages are the files, and not the rest of the image meta-data like environment variable settings, entrypoint, and command. And when you install packages in a linux distribution, the libraries and other dependencies may be scattered throughout the filesystem, making a copy of all the dependencies difficult.
Because of this, I use multi-stage builds as a replacement for building binaries on a CI/CD server, so that my CI/CD server only needs to have the tooling to run docker build
, and not have a jdk, nodejs, go, and any other compile tools installed.
Official answer listed in their best practices ( official images MUST adhere to these )
Minimize the number of layer
You need to find the balance between
readability (and thus long-term maintainability) of the Dockerfile and
minimizing the number of layers it uses. Be strategic and cautious
about the number of layers you use.
Since docker 1.10 the COPY
, ADD
and RUN
statements add a new layer to your image. Be cautious when using these statements. Try to combine commands into a single RUN
statement. Separate this only if it's required for readability.
More info: https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/#/minimize-the-number-of-layers
Update: Multi stage in docker >17.05
With multi-stage builds you can use multiple FROM
statements in your Dockerfile. Each FROM
statement is a stage and can have its own base image. In the final stage you use a minimal base image like alpine, copy the build artefacts from previous stages and install runtime requirements. The end result of this stage is your image. So this is where you worry about the layers as described earlier.
As usual, docker has great docs on multi-stage builds. Here's a quick excerpt:
With multi-stage builds, you use multiple FROM statements in your
Dockerfile. Each FROM instruction can use a different base, and each
of them begins a new stage of the build. You can selectively copy
artifacts from one stage to another, leaving behind everything you
don’t want in the final image.
A great blog post about this can be found here: https://blog.alexellis.io/mutli-stage-docker-builds/
To answer your points:
Yes, layers are sort of like diffs. I don't think there are layers added if there's absolutely zero changes. The problem is that once you install / download something in layer #2, you can not remove it in layer #3. So once something is written in a layer, the image size can not be decreased anymore by removing that.
Although layers can be pulled in parallel, making it potentially faster, each layer undoubtedly increases the image size, even if they're removing files.
Yes, caching is useful if you're updating your docker file. But it works in one direction. If you have 10 layers, and you change layer #6, you'll still have to rebuild everything from layer #6-#10. So it's not too often that it will speed the build process up, but it's guaranteed to unnecessarily increase the size of your image.
Thanks to @Mohan for reminding me to update this answer.
It seems the answers above are outdated. The docs note:
Prior to Docker 17.05, and even more, prior to Docker 1.10, it was important to minimize the number of layers in your image. The
following improvements have mitigated this need:
[...]
Docker 17.05 and higher add support for multi-stage builds, which
allow you to copy only the artifacts you need into the final image.
This allows you to include tools and debug information in your
intermediate build stages without increasing the size of the final
image.
https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/#minimize-the-number-of-layers
and
Notice that this example also artificially compresses two RUN commands
together using the Bash && operator, to avoid creating an additional
layer in the image. This is failure-prone and hard to maintain.
https://docs.docker.com/engine/userguide/eng-image/multistage-build/
Best practice seems to have changed to using multistage builds and keeping the Dockerfile
s readable.
It depends on waht you include in your image layers.
The key point is sharing as many layers as possible:
Bad Example:
Dockerfile.1
RUN yum install big-package && yum install package1
Dockerfile.2
RUN yum install big-package && yum install package2
Good Example:
Dockerfile.1
RUN yum install big-package
RUN yum install package1
Dockerfile.2
RUN yum install big-package
RUN yum install package2
Another suggestion is deleting is not so useful only if it happens on the same layer as the adding/installing action.