Populating Postgres Docker image during building (

2019-09-03 04:24发布

问题:

I want to prepare custom image (based on offical Postges image) with two tasks:

  1. Download data (eg. get CSV file by wget),
  2. Load data into database (creating tables, inserts).

I want to do both steps during building image, not during running container, because each of them takes a lot of time, and I want to build image once and run many containers quickly.

I know how to do step 1 (download data) during building image, but I don't know how to load data into database during building image instead of run container (step 2).

Example:

(download - during building image, load - during running container)

Dockerfile:

FROM postgres:10.7

RUN  apt-get update \
  && apt-get install -y wget \
  && rm -rf /var/lib/apt/lists/* 

COPY download.sh /download.sh
RUN /download.sh

download.sh:

#!/bin/bash

cd /docker-entrypoint-initdb.d/
wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/northwindextended/northwind.postgre.sql

To download data I run script myself. To load data I use "initialization scripts" utility from official Postgres image.

Building image:

docker build -t mydbimage .

Running image:

docker run --name mydbcontainer -p 5432:5432 -e POSTGRES_PASSWORD=postgres -d mydbimage 

After running, you can see how much loading data takes time:

docker logs mydbcontainer

This example dataset is small, but with bigger, long time running container is awkward.

回答1:

You can dissect the upstream Dockerfile and its docker-entrypoint.sh and just pick the needed snippets to initialize your database:

FROM postgres:10.7

ENV PGDATA /var/lib/postgresql/datap-in-image
RUN mkdir -p "$PGDATA" && chown -R postgres:postgres "$PGDATA" && chmod 777 "$PGDATA" # this 777 will be replaced by 700 at runtime (allows semi-arbitrary "--user" values)

RUN set -x \
  && apt-get update && apt-get install -y --no-install-recommends ca-certificates wget && rm -rf /var/lib/apt/lists/* \
  && wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/northwindextended/northwind.postgre.sql \ 
    -O /docker-entrypoint-initdb.d/northwind.postgre.sql \
  && cp ./docker-entrypoint.sh ./docker-entrypoint-init-only.sh \
  && sed -ri '/exec "\$@"/d' ./docker-entrypoint-init-only.sh \
  && ./docker-entrypoint-init-only.sh postgres \
  && rm ./docker-entrypoint-initdb.d/northwind.postgre.sql ./docker-entrypoint-init-only.sh \
  && apt-get purge -y --auto-remove ca-certificates wget

Build, run and test:

docker build -t mydbimage .

# bring up the database
docker run --rm mydbimage --name pgtest

# run this in another terminal to check for the imported data 
docker exec -ti pgtest psql -v ON_ERROR_STOP=1 --username "postgres" --no-password --dbname postgres --command "\d"

Caveats:

  • With this setup there is no password set for the database. You can add it during the build, but then it would be persisted in the image. You would need to take precaution that no one gets access to your image. Depending on your setup this might be hard to achieve, maybe that is even impossible.
  • The second problem is that writes to your database are ephemeral. There is no volume during building to persist the imported data. That is why PGDATA is changed to a directory that is not declared as a volume.

Basically these are the reasons why importing is handled when starting the container instead of during building in the upstream repository. In case you have non secret data that is used as read only it might still make sense to import during the build to save time and for easier handling during the startup of the container.