Dockerfile.1
executes multiple RUN
:
FROM busybox
RUN echo This is the A > a
RUN echo This is the B > b
RUN echo This is the C > c
Dockerfile.2
joins them:
FROM busybox
RUN echo This is the A > a &&\
echo This is the B > b &&\
echo This is the C > c
Each RUN
creates a layer, so I always assumed that less layers is better and thus Dockerfile.2
is better.
This is obviously true when a RUN
removes something added by a previous RUN
(i.e. yum install nano && yum clean all
), but in cases where every RUN
adds something, there are a few points we need to consider:
Layers are supposed to just add a diff above the previous one, so if the later layer does not remove something added in a previous one, there should be not much disk space saving advantage between both methods...
Layers are pulled in parallel from Docker Hub, so
Dockerfile.1
, although probably slightly bigger, would theoretically get downloaded faster.If adding a 4th sentence (i.e.
echo This is the D > d
) and locally rebuilding,Dockerfile.1
would build faster thanks to cache, butDockerfile.2
would have to run all 4 commands again.
So, the question: Which is a better way to do a Dockerfile?
Official answer listed in their best practices ( official images MUST adhere to these )
Since docker 1.10 the
COPY
,ADD
andRUN
statements add a new layer to your image. Be cautious when using these statements. Try to combine commands into a singleRUN
statement. Separate this only if it's required for readability.More info: https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/#/minimize-the-number-of-layers
Update: Multi stage in docker >17.05
With multi-stage builds you can use multiple
FROM
statements in your Dockerfile. EachFROM
statement is a stage and can have its own base image. In the final stage you use a minimal base image like alpine, copy the build artefacts from previous stages and install runtime requirements. The end result of this stage is your image. So this is where you worry about the layers as described earlier.As usual, docker has great docs on multi-stage builds. Here's a quick excerpt:
A great blog post about this can be found here: https://blog.alexellis.io/mutli-stage-docker-builds/
To answer your points:
Yes, layers are sort of like diffs. I don't think there are layers added if there's absolutely zero changes. The problem is that once you install / download something in layer #2, you can not remove it in layer #3. So once something is written in a layer, the image size can not be decreased anymore by removing that.
Although layers can be pulled in parallel, making it potentially faster, each layer undoubtedly increases the image size, even if they're removing files.
Yes, caching is useful if you're updating your docker file. But it works in one direction. If you have 10 layers, and you change layer #6, you'll still have to rebuild everything from layer #6-#10. So it's not too often that it will speed the build process up, but it's guaranteed to unnecessarily increase the size of your image.
Thanks to @Mohan for reminding me to update this answer.
When possible, I always merge together commands that create files with commands that delete those same files into a single
RUN
line. This is because eachRUN
line adds a layer to the image, the output is quite literally the filesystem changes that you could view withdocker diff
on the temporary container it creates. If you delete a file that was created in a different layer, all the union filesystem does is register the filesystem change in a new layer, the file still exists in the previous layer and is shipped over the networked and stored on disk. So if you download source code, extract it, compile it into a binary, and then delete the tgz and source files at the end, you really want this all done in a single layer to reduce image size.Next, I personally split up layers based on their potential for reuse in other images and expected caching usage. If I have 4 images, all with the same base image (e.g. debian), I may pull a collection of common utilities to most of those images into the first run command so the other images benefit from caching.
Order in the Dockerfile is important when looking at image cache reuse. I look at any components that will update very rarely, possibly only when the base image updates and put those high up in the Dockerfile. Towards the end of the Dockerfile, I include any commands that will run quick and may change frequently, e.g. adding a user with a host specific UID or creating folders and changing permissions. If the container includes interpreted code (e.g. JavaScript) that is being actively developed, that gets added as late as possible so that a rebuild only runs that single change.
In each of these groups of changes, I consolidate as best I can to minimize layers. So if there are 4 different source code folders, those get placed inside a single folder so it can be added with a single command. Any package installs from something like apt-get are merged into a single RUN when possible to minimize the amount of package manager overhead (updating and cleaning up).
Update for multi-stage builds:
I worry much less about reducing image size in the non-final stages of a multi-stage build. When these stages aren't tagged and shipped to other nodes, you can maximize the likelihood of a cache reuse by splitting each command to a separate
RUN
line.However, this isn't a perfect solution to squashing layers since all you copy between stages are the files, and not the rest of the image meta-data like environment variable settings, entrypoint, and command. And when you install packages in a linux distribution, the libraries and other dependencies may be scattered throughout the filesystem, making a copy of all the dependencies difficult.
Because of this, I use multi-stage builds as a replacement for building binaries on a CI/CD server, so that my CI/CD server only needs to have the tooling to run
docker build
, and not have a jdk, nodejs, go, and any other compile tools installed.It seems the answers above are outdated. The docs note:
https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/#minimize-the-number-of-layers
and
https://docs.docker.com/engine/userguide/eng-image/multistage-build/
Best practice seems to have changed to using multistage builds and keeping the
Dockerfile
s readable.It depends on waht you include in your image layers.
The key point is sharing as many layers as possible:
Bad Example:
Dockerfile.1
Dockerfile.2
Good Example:
Dockerfile.1
Dockerfile.2
Another suggestion is deleting is not so useful only if it happens on the same layer as the adding/installing action.