Spark job (Java) cannot write data to Elasticsearc

2019-07-28 05:59发布

问题:

I am using Docker 17.04.0-ce and Compose 1.12.0 in Windows in order to deploy Elasticsearch cluster (version 5.4.0) over Docker. So far I have done the following:

1) I have created a single Elasticsearch node via compose with the following configuration

  elasticsearch1:
    build: elasticsearch/
    container_name: es_1
    cap_add:
      - IPC_LOCK
    environment:
      - cluster.name=cp-es-cluster
      - node.name=cloud1
      - node.master=true
      - http.cors.enabled=true
      - http.cors.allow-origin="*"
      - bootstrap.memory_lock=true
      - discovery.zen.minimum_master_nodes=1
      - xpack.security.enabled=false
      - xpack.monitoring.enabled=false
      - "ES_JAVA_OPTS=-Xms1g -Xmx1g"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    volumes:
      - esdata1:/usr/share/elasticsearch/data
    ports:
      - 9200:9200
      - 9300:9300
    networks:
      docker_elk:
        aliases:
          - elasticsearch

This results in the node being deployed, but it is not accessible from Spark. I write data as

JavaEsSparkSQL.saveToEs(aggregators.toDF(), collectionName +"/record");

and I get the following error, although the node is running

I/O exception (java.net.ConnectException) caught when processing request: Connection timed out: connect

2) I found out that this problem is solved if I add the following line in the node configuration

- network.publish_host=${ENV_IP}

3) Then I created similar configurations for 2 additional nodes as

  elasticsearch1:
    build: elasticsearch/
    container_name: es_1
    cap_add:
      - IPC_LOCK
    environment:
      - cluster.name=cp-es-cluster
      - node.name=cloud1
      - node.master=true
      - http.cors.enabled=true
      - http.cors.allow-origin="*"
      - bootstrap.memory_lock=true
      - discovery.zen.minimum_master_nodes=1
      - xpack.security.enabled=false
      - xpack.monitoring.enabled=false
      - "ES_JAVA_OPTS=-Xms1g -Xmx1g"
      - network.publish_host=${ENV_IP}
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    volumes:
      - esdata1:/usr/share/elasticsearch/data
    ports:
      - 9200:9200
      - 9300:9300
    networks:
      docker_elk:
        aliases:
          - elasticsearch

  elasticsearch2:
    build: elasticsearch/
    container_name: es_2
    cap_add:
      - IPC_LOCK
    environment:
      - cluster.name=cp-es-cluster
      - node.name=cloud2
      - http.cors.enabled=true
      - http.cors.allow-origin="*"
      - bootstrap.memory_lock=true
      - discovery.zen.minimum_master_nodes=2
      - xpack.security.enabled=false
      - xpack.monitoring.enabled=false
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - "discovery.zen.ping.unicast.hosts=elasticsearch1"
      - node.master=false
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    volumes:
      - esdata2:/usr/share/elasticsearch/data
    ports:
      - 9201:9200
      - 9301:9300
    networks:
      - docker_elk

  elasticsearch3:
    build: elasticsearch/
    container_name: es_3
    cap_add:
      - IPC_LOCK
    environment:
      - cluster.name=cp-es-cluster
      - node.name=cloud3
      - http.cors.enabled=true
      - http.cors.allow-origin="*"
      - bootstrap.memory_lock=true
      - discovery.zen.minimum_master_nodes=2
      - xpack.security.enabled=false
      - xpack.monitoring.enabled=false
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - "discovery.zen.ping.unicast.hosts=elasticsearch1"
      - node.master=false
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    volumes:
      - esdata3:/usr/share/elasticsearch/data
    ports:
      - 9202:9200
      - 9302:9300
    networks:
      - docker_elk

This results in a cluster of 3 nodes being created successfully. However the same error reappeared in Spark and data cannot be written to the cluster. I get the same behavior even if I add network.publish_host to all nodes.

Regarding Spark, I use elasticsearch-spark-20_2.11 version 5.4.0 (same as ES version). Any ideas how to solve this issue?

回答1:

I managed to solve this. Apart from setting es.nodes and es.port in Spark, the problem goes away if I set es.nodes.wan.only to true.