How to connect (Py)Spark to Postgres database usin

2020-02-11 06:57发布

问题:

I have followed instructions from this posting to read data from an existing Postgres database with table named "objects" as defined and created by the Objects class in SQLalchemy. In my Jupyter notebook, my code is

from pyspark import SparkContext
from pyspark import SparkConf
from random import random

#spark conf
conf = SparkConf()
conf.setMaster("local[*]")
conf.setAppName('pyspark')

sc = SparkContext(conf=conf)

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
properties = {
    "driver": "org.postgresql.Driver"
}
url = 'jdbc:postgresql://PG_USER:PASSWORD@PG_SERVER_IP/db_name'
df = sqlContext.read.jdbc(url=url, table='objects', properties=properties)

the last line results in the following:

Py4JJavaError: An error occurred while calling o25.jdbc.
: java.lang.NullPointerException
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:158)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:117)
    at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:237)
    at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:159)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:211)
    at java.lang.Thread.run(Thread.java:745)

so it looks like it can't resolve the table. How do I test from here to make sure that I am connected to the database properly?

回答1:

Problems with name resolving are indicated by org.postgresql.util.PSQLException and don't result in NPE. The source of the issue is actually a connection string and in particular the way you provide user credentials. At first glance it looks like a bug but if you're looking for a quick solution you can either use URL properties:

url = 'jdbc:postgresql://PG_SERVER_IP/db_name?user=PG_USER&password=PASSWORD'

or properties argument:

properties = {
    "user": "PG_USER",
    "password": "PASSWORD",
    "driver": "org.postgresql.Driver"
}