This page was inspiring me to try out spark-csv for reading .csv file in PySpark I found a couple of posts such as this describing how to use spark-csv
But I am not able to initialize the ipython instance by including either the .jar file or package extension in the start-up that could be done through spark-shell.
That is, instead of ipython notebook --profile=pyspark
, I tried out ipython notebook --profile=pyspark --packages com.databricks:spark-csv_2.10:1.0.3
but it is not supported.
Please advise.
I believe you can also add this as a variable to your spark-defaults.conf file. So something like:
This will load the spark-csv library into PySpark every time you launch the driver.
Obviously zero's answer is more flexible because you can add these lines to your PySpark app before you import the PySpark package:
This way you are only importing the packages you actually need for your script.
You can simply pass it in the
PYSPARK_SUBMIT_ARGS
variable. For example:These property can be also set dynamically in your code before
SparkContext
/SparkSession
and corresponding JVM have been started: