The problem is quite simple: You have a local spark instance (either cluster or just running it in local mode) and you want to read from gs://
相关问题
- How to maintain order of key-value in DataFrame sa
- Why do Dataflow steps not start?
- Spark on Yarn Container Failure
- __call__() missing 1 required positional argument:
- In Spark Streaming how to process old data and del
相关文章
- Livy Server: return a dataframe as JSON?
- How do I create a persistent volume claim with Rea
- SQL query Frequency Distribution matrix for produc
- GKE does not scale to/from 0 when autoscaling enab
- How to filter rows for a specific aggregate with s
- How to name file when saveAsTextFile in spark?
- Spark save(write) parquet only one file
- Can't push image to google container registry
I am submitting here the solution I have come up with by combining different resources:
Download the google cloud storage connector : gs-connector and store it in
$SPARK/jars/
folder (Check Alternative 1 at the bottom)Download the
core-site.xml
file from here, or copy it from below. This is a configuration file used by hadoop, (which is used by spark).Store the
core-site.xml
file in a folder. Personally I create the$SPARK/conf/hadoop/conf/
folder and store it there.In the spark-env.sh file indicate the hadoop conf fodler by adding the following line:
export HADOOP_CONF_DIR= =</absolute/path/to/hadoop/conf/>
Create an OAUTH2 key from the respective page of Google (
Google Console-> API-Manager-> Credentials
).Copy the credentials to the
core-site.xml
file.Alternative 1: Instead of copying the file to the
$SPARK/jars
folder, you can store the jar in any folder and add the folder in the spark classpath. One way is to editSPARK_CLASSPATH
in thespark-env.sh``folder but
SPARK_CLASSPATH` is now deprecated. Therefore one can look here on how to add a jar in the spark classpathIn my case on Spark 2.4.3 I needed to do the following to enable GCS access from Spark local. I used a JSON keyfile vs. the
client.id/secret
proposed above.In
$SPARK_HOME/jars/
, use the shadedgcs-connector
jar from here: http://repo2.maven.org/maven2/com/google/cloud/bigdataoss/gcs-connector/hadoop2-1.9.17/ or else I had various failures with transitive dependencies.(Optional) To my
build.sbt
add:In
$SPARK_HOME/conf/spark-defaults.conf
, add:And everything is working.
Considering that it has been awhile since the last answer, I though I would share my recent solution. Note, the following instruction is for Spark 2.4.4.
Make sure that all the environment variables are properly set up for you Spark application to run. This is:
a. SPARK_HOME pointing to the location where you have saved Spark installations.
b. GOOGLE_APPLICATION_CREDENTIALS pointing to the location where json key is. If you have just downloaded it, it will be in your ~/Downloads
c. JAVA_HOME pointing to the location where you have your Java 8* "Home" folder.
If you are on Linux/Mac OS you can use
export VAR=DIR
, where VAR is variable and DIR the location, or if you want to set them up permanently, you can add them to ~/.bash_profile or ~/.zshrc files. For Windows OS users, in cmd writeset VAR=DIR
for shell related operations, orsetx VAR DIR
to store the variables permanently.That has worked for me and I hope it help others too.
* Spark works on Java 8, therefore some of its features might not be compatible with the latest Java Development Kit.