I want to prepare custom image (based on offical Postges image) with two tasks:
- Download data (eg. get CSV file by wget),
- Load data into database (creating tables, inserts).
I want to do both steps during building image, not during running container, because each of them takes a lot of time, and I want to build image once and run many containers quickly.
I know how to do step 1 (download data) during building image, but I don't know how to load data into database during building image instead of run container (step 2).
Example:
(download - during building image, load - during running container)
Dockerfile
:
FROM postgres:10.7
RUN apt-get update \
&& apt-get install -y wget \
&& rm -rf /var/lib/apt/lists/*
COPY download.sh /download.sh
RUN /download.sh
download.sh
:
#!/bin/bash
cd /docker-entrypoint-initdb.d/
wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/northwindextended/northwind.postgre.sql
To download data I run script myself. To load data I use "initialization scripts" utility from official Postgres image.
Building image:
docker build -t mydbimage .
Running image:
docker run --name mydbcontainer -p 5432:5432 -e POSTGRES_PASSWORD=postgres -d mydbimage
After running, you can see how much loading data takes time:
docker logs mydbcontainer
This example dataset is small, but with bigger, long time running container is awkward.
You can dissect the upstream Dockerfile and its docker-entrypoint.sh and just pick the needed snippets to initialize your database:
FROM postgres:10.7
ENV PGDATA /var/lib/postgresql/datap-in-image
RUN mkdir -p "$PGDATA" && chown -R postgres:postgres "$PGDATA" && chmod 777 "$PGDATA" # this 777 will be replaced by 700 at runtime (allows semi-arbitrary "--user" values)
RUN set -x \
&& apt-get update && apt-get install -y --no-install-recommends ca-certificates wget && rm -rf /var/lib/apt/lists/* \
&& wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/northwindextended/northwind.postgre.sql \
-O /docker-entrypoint-initdb.d/northwind.postgre.sql \
&& cp ./docker-entrypoint.sh ./docker-entrypoint-init-only.sh \
&& sed -ri '/exec "\$@"/d' ./docker-entrypoint-init-only.sh \
&& ./docker-entrypoint-init-only.sh postgres \
&& rm ./docker-entrypoint-initdb.d/northwind.postgre.sql ./docker-entrypoint-init-only.sh \
&& apt-get purge -y --auto-remove ca-certificates wget
Build, run and test:
docker build -t mydbimage .
# bring up the database
docker run --rm mydbimage --name pgtest
# run this in another terminal to check for the imported data
docker exec -ti pgtest psql -v ON_ERROR_STOP=1 --username "postgres" --no-password --dbname postgres --command "\d"
Caveats:
- With this setup there is no password set for the database. You can add it during the build, but then it would be persisted in the image. You would need to take precaution that no one gets access to your image. Depending on your setup this might be hard to achieve, maybe that is even impossible.
- The second problem is that writes to your database are ephemeral. There is no volume during building to persist the imported data. That is why
PGDATA
is changed to a directory that is not declared as a volume.
Basically these are the reasons why importing is handled when starting the container instead of during building in the upstream repository. In case you have non secret data that is used as read only it might still make sense to import during the build to save time and for easier handling during the startup of the container.