I have created a docker image of my application when I simply run it from the bash script, it works properly. However, when I run it as part of the docker-compose file the application hangs on the message:
18/06/27 13:17:18 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
And even after I wait for a while streaming heartbeat times out. What may be the reason for such a Spark Streaming+Neo4j application performance with Docker and how it can be improved?
The docker-compose file for my application:
version: '3.3'
services:
consumer-demo:
build:
context: .
dockerfile: Dockerfile
args:
- ARG_CLASS=consumer
- HOST=neo4jdb
volumes:
- ./:/workdir
working_dir: /workdir
restart: always
Overall docker-compose file for all the applications:
version: '3.3'
services:
kafka:
image: spotify/kafka
ports:
- "9092:9092"
networks:
- docker_elk
environment:
- ADVERTISED_HOST=localhost
neo4jdb:
image: neo4j:latest
container_name: neo4jdb
ports:
- "7474:7474"
- "7473:7473"
- "7687:7687"
networks:
- docker_elk
volumes:
- /var/lib/neo4j/import:/var/lib/neo4j/import
- /var/lib/neo4j/data:/data
- /var/lib/neo4j/conf:/conf
environment:
- NEO4J_dbms_active__database=graphImport.db
elasticsearch:
image: elasticsearch:latest
ports:
- "9200:9200"
- "9300:9300"
networks:
- docker_elk
volumes:
- esdata1:/usr/share/elasticsearch/data
kibana:
image: kibana:latest
ports:
- "5601:5601"
networks:
- docker_elk
volumes:
esdata1:
driver: local
networks:
docker_elk:
driver: bridge
The bash script using which an application works properly:
#!/usr/bin/env bash
if [ "$1" = "consumer" ]
then
java -cp "jars/spark_consumer.jar" consumer.SparkConsumer
else
echo "Wrong parameter. It should be consumer or producer, but it is $1"
fi
Application Dockerfile which may be the reason of slowdown of the application execution:
FROM java:8
ARG ARG_CLASS
ARG HOST
ENV MAIN_CLASS $ARG_CLASS
ENV SCALA_VERSION 2.11.8
ENV SBT_VERSION 1.1.1
ENV SPARK_VERSION 2.2.0
ENV SPARK_DIST spark-$SPARK_VERSION-bin-hadoop2.6
ENV SPARK_ARCH $SPARK_DIST.tgz
ENV HOSTNAME bolt://$HOST:7687
VOLUME /workdir
WORKDIR /opt
# Install Scala
RUN \
cd /root && \
curl -o scala-$SCALA_VERSION.tgz http://downloads.typesafe.com/scala/$SCALA_VERSION/scala-$SCALA_VERSION.tgz && \
tar -xf scala-$SCALA_VERSION.tgz && \
rm scala-$SCALA_VERSION.tgz && \
echo >> /root/.bashrc && \
echo 'export PATH=~/scala-$SCALA_VERSION/bin:$PATH' >> /root/.bashrc
# Install SBT
RUN \
curl -L -o sbt-$SBT_VERSION.deb https://dl.bintray.com/sbt/debian/sbt-$SBT_VERSION.deb && \
dpkg -i sbt-$SBT_VERSION.deb && \
rm sbt-$SBT_VERSION.deb
# Install Spark
RUN \
cd /opt && \
curl -o $SPARK_ARCH http://d3kbcqa49mib13.cloudfront.net/$SPARK_ARCH && \
tar xvfz $SPARK_ARCH && \
rm $SPARK_ARCH && \
echo 'export PATH=$SPARK_DIST/bin:$PATH' >> /root/.bashrc
EXPOSE 9851 9852 4040 9092 9200 9300 5601 7474 7687 7473
CMD /workdir/runDemo.sh "$MAIN_CLASS"
The problem was that another Spark process was running on the machine blocking Spark data streaming. I checked all the processes with
ps aux | grep spark
and found another running process. Simply killing that process and restarting Spark Streaming application solved the problem.