Why different count results on consecutive reads?

2019-08-03 07:54发布

问题:

I have the following code that is reading a table to a apache spark DataFrame:

val df = spark.read.format("jdbc").option("url", "jdbc:postgresql:host/database").option("dbtable", "tablename").option("user", "username").option("password", "password").load()

When I first invoke df.count() I get a smaller number than the next time I invoke the same count method.

Why this happen?

Doesn't Spark load a snapshot of my table in a DataFrame on my Spark Cluster when I first read that table?

My table on postgres keeps being fed and it seems my data frame is reflecting this behavior.

How should I manage to load just a static snapshot my table to spark's DataFrame by the time read method was invoked?

回答1:

Unless Dataset is cached using reliable storage (standard Spark cache will give you only weak guarantees) database may be accessed multiple times, each time showing the current state of the database. Since

table on postgres keeps being fed

seeing different counts is an expected behavior.

Furthermore, if JDBC source is used in a distributed mode (with partitioning column or predicates), then each executor thread will use its own transaction. As a result the state of the Dataset may not fully consistent.

How should I manage to load just a static snapshot

Don't use JDBC. You can for example

  • COPY data to a file system and load it from there.
  • Use replication solution of your choice to create a replica dedicated for analytics and set and pause the replication while use analyze the data.