I have spent almost 2 days scrolling the internet and I was unable to sort out this problem. I am trying to install the graphframes package (Version: 0.2.0-spark2.0-s_2.11) to run with spark through PyCharm, but, despite my best efforts, it's been impossible.
I have tried almost everything. Please, know that I have checked this site here as well before posting an answer.
Here is the code I am trying to run:
# IMPORT OTHER LIBS --------------------------------------------------------
import os
import sys
import pandas as pd
# IMPORT SPARK ------------------------------------------------------------------------------------#
# Path to Spark source folder
USER_FILE_PATH = "/Users/<username>"
SPARK_PATH = "/PycharmProjects/GenesAssociation"
SPARK_FILE = "/spark-2.0.0-bin-hadoop2.7"
SPARK_HOME = USER_FILE_PATH + SPARK_PATH + SPARK_FILE
os.environ['SPARK_HOME'] = SPARK_HOME
# Append pySpark to Python Path
sys.path.append(SPARK_HOME + "/python")
sys.path.append(SPARK_HOME + "/python" + "/lib/py4j-0.10.1-src.zip")
try:
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext
from pyspark.graphframes import GraphFrame
except ImportError as ex:
print "Can not import Spark Modules", ex
sys.exit(1)
# GLOBAL VARIABLES --------------------------------------------------------- -----------------------#
SC = SparkContext('local')
SQL_CONTEXT = SQLContext(SC)
# MAIN CODE ---------------------------------------------------------------------------------------#
if __name__ == "__main__":
# Main Path to CSV files
DATA_PATH = '/PycharmProjects/GenesAssociation/data/'
FILE_NAME = 'gene_gene_associations_50k.csv'
# LOAD DATA CSV USING PANDAS -----------------------------------------------------------------#
print "STEP 1: Loading Gene Nodes -------------------------------------------------------------"
# Read csv file and load as df
GENES = pd.read_csv(USER_FILE_PATH + DATA_PATH + FILE_NAME,
usecols=['OFFICIAL_SYMBOL_A'],
low_memory=True,
iterator=True,
chunksize=1000)
# Concatenate chunks into list & convert to dataFrame
GENES_DF = pd.DataFrame(pd.concat(list(GENES), ignore_index=True))
# Remove duplicates
GENES_DF_CLEAN = GENES_DF.drop_duplicates(keep='first')
# Name Columns
GENES_DF_CLEAN.columns = ['gene_id']
# Output dataFrame
print GENES_DF_CLEAN
# Create vertices
VERTICES = SQL_CONTEXT.createDataFrame(GENES_DF_CLEAN)
# Show some vertices
print VERTICES.take(5)
print "STEP 2: Loading Gene Edges -------------------------------------------------------------"
# Read csv file and load as df
EDGES = pd.read_csv(USER_FILE_PATH + DATA_PATH + FILE_NAME,
usecols=['OFFICIAL_SYMBOL_A', 'OFFICIAL_SYMBOL_B', 'EXPERIMENTAL_SYSTEM'],
low_memory=True,
iterator=True,
chunksize=1000)
# Concatenate chunks into list & convert to dataFrame
EDGES_DF = pd.DataFrame(pd.concat(list(EDGES), ignore_index=True))
# Name Columns
EDGES_DF.columns = ["src", "dst", "rel_type"]
# Output dataFrame
print EDGES_DF
# Create vertices
EDGES = SQL_CONTEXT.createDataFrame(EDGES_DF)
# Show some edges
print EDGES.take(5)
g = gf.GraphFrame(VERTICES, EDGES)
Needless to say, I have tried including the graphframes directory (look here to understand what I did) into spark's pyspark directory. But it seems like this not enough... Anything else I have tried just failed. Would appreciate some help with this. You can see below the error message I am getting:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/09/19 12:46:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/09/19 12:46:03 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
STEP 1: Loading Gene Nodes -------------------------------------------------------------
gene_id
0 MAP2K4
1 MYPN
2 ACVR1
3 GATA2
4 RPA2
5 ARF1
6 ARF3
8 XRN1
9 APP
10 APLP1
11 CITED2
12 EP300
13 APOB
14 ARRB2
15 CSF1R
16 PRRC2A
17 LSM1
18 SLC4A1
19 BCL3
20 ADRB1
21 BRCA1
25 ARVCF
26 PCBD1
27 PSEN2
28 CAPN3
29 ITPR1
30 MAGI1
31 RB1
32 TSG101
33 ORC1
... ...
49379 WDR26
49380 WDR5B
49382 NLE1
49383 WDR12
49385 WDR53
49386 WDR59
49387 WDR61
49409 CHD6
49422 DACT1
49424 KMT2B
49438 SMARCA1
49459 DCLRE1A
49469 F2RL1
49472 SENP8
49475 TSPY1
49479 SERPINB5
49521 HOXA11
49548 SYF2
49553 FOXN3
49557 MLANA
49608 REPIN1
49609 GMNN
49670 HIST2H2BE
49767 BCL7C
49797 SIRT3
49810 KLF4
49858 RHO
49896 MAGEA2
49907 SUV420H2
49958 SAP30L
[6025 rows x 1 columns]
16/09/19 12:46:08 WARN TaskSetManager: Stage 0 contains a task of very large size (107 KB). The maximum recommended task size is 100 KB.
[Row(gene_id=u'MAP2K4'), Row(gene_id=u'MYPN'), Row(gene_id=u'ACVR1'), Row(gene_id=u'GATA2'), Row(gene_id=u'RPA2')]
STEP 2: Loading Gene Edges -------------------------------------------------------------
src dst rel_type
0 MAP2K4 FLNC Two-hybrid
1 MYPN ACTN2 Two-hybrid
2 ACVR1 FNTA Two-hybrid
3 GATA2 PML Two-hybrid
4 RPA2 STAT3 Two-hybrid
5 ARF1 GGA3 Two-hybrid
6 ARF3 ARFIP2 Two-hybrid
7 ARF3 ARFIP1 Two-hybrid
8 XRN1 ALDOA Two-hybrid
9 APP APPBP2 Two-hybrid
10 APLP1 DAB1 Two-hybrid
11 CITED2 TFAP2A Two-hybrid
12 EP300 TFAP2A Two-hybrid
13 APOB MTTP Two-hybrid
14 ARRB2 RALGDS Two-hybrid
15 CSF1R GRB2 Two-hybrid
16 PRRC2A GRB2 Two-hybrid
17 LSM1 NARS Two-hybrid
18 SLC4A1 SLC4A1AP Two-hybrid
19 BCL3 BARD1 Two-hybrid
20 ADRB1 GIPC1 Two-hybrid
21 BRCA1 ATF1 Two-hybrid
22 BRCA1 MSH2 Two-hybrid
23 BRCA1 BARD1 Two-hybrid
24 BRCA1 MSH6 Two-hybrid
25 ARVCF CDH15 Two-hybrid
26 PCBD1 CACNA1C Two-hybrid
27 PSEN2 CAPN1 Two-hybrid
28 CAPN3 TTN Two-hybrid
29 ITPR1 CA8 Two-hybrid
... ... ... ...
49969 SAP30 HDAC3 Affinity Capture-Western
49970 BRCA1 RBBP8 Co-localization
49971 BRCA1 BRCA1 Biochemical Activity
49972 SET TREX1 Co-purification
49973 SET TREX1 Reconstituted Complex
49974 PLAGL1 EP300 Reconstituted Complex
49975 PLAGL1 CREBBP Reconstituted Complex
49976 EP300 PLAGL1 Affinity Capture-Western
49977 MTA1 ESR1 Reconstituted Complex
49978 SIRT2 EP300 Affinity Capture-Western
49979 EP300 SIRT2 Affinity Capture-Western
49980 EP300 HDAC1 Affinity Capture-Western
49981 EP300 SIRT2 Biochemical Activity
49982 MIER1 CREBBP Reconstituted Complex
49983 SMARCA4 SIN3A Affinity Capture-Western
49984 SMARCA4 HDAC2 Affinity Capture-Western
49985 ESR1 NCOA6 Affinity Capture-Western
49986 ESR1 TOP2B Affinity Capture-Western
49987 ESR1 PRKDC Affinity Capture-Western
49988 ESR1 PARP1 Affinity Capture-Western
49989 ESR1 XRCC5 Affinity Capture-Western
49990 ESR1 XRCC6 Affinity Capture-Western
49991 PARP1 TOP2B Affinity Capture-Western
49992 PARP1 PRKDC Affinity Capture-Western
49993 PARP1 XRCC5 Affinity Capture-Western
49994 PARP1 XRCC6 Affinity Capture-Western
49995 SIRT3 XRCC6 Affinity Capture-Western
49996 SIRT3 XRCC6 Reconstituted Complex
49997 SIRT3 XRCC6 Biochemical Activity
49998 HDAC1 PAX3 Affinity Capture-Western
[49999 rows x 3 columns]
16/09/19 12:46:11 WARN TaskSetManager: Stage 1 contains a task of very large size (1211 KB). The maximum recommended task size is 100 KB.
[Row(src=u'MAP2K4', dst=u'FLNC', rel_type=u'Two-hybrid'), Row(src=u'MYPN', dst=u'ACTN2', rel_type=u'Two-hybrid'), Row(src=u'ACVR1', dst=u'FNTA', rel_type=u'Two-hybrid'), Row(src=u'GATA2', dst=u'PML', rel_type=u'Two-hybrid'), Row(src=u'RPA2', dst=u'STAT3', rel_type=u'Two-hybrid')]
Traceback (most recent call last):
File "/Users/username/PycharmProjects/GenesAssociation/__init__.py", line 99, in <module>
g = gf.GraphFrame(VERTICES, EDGES)
File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/graphframes/graphframe.py", line 62, in __init__
self._jvm_gf_api = _java_api(self._sc)
File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/graphframes/graphframe.py", line 34, in _java_api
return jsc._jvm.Thread.currentThread().getContextClassLoader().loadClass(javaClassName) \
File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o50.loadClass.
: java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
Process finished with exit code 1
Thanks in advance.
You can set
PYSPARK_SUBMIT_ARGS
either in your codeor in PyCharm edit run configuration (Run -> Edit configuration -> Choose configuration -> Select configuration tab -> Choose Environment variables -> Add PYSPARK_SUBMIT_ARGS):
with a minimal working example:
You could also add the
packages
orjars
to yourspark-defaults.conf
.If you use Python 3 with
graphframes
0.2 there is a known issue with extracting Python libraries from JAR so you'll have to do it manually. You can for example download JAR file, unzip it, and make sure that root directory withgraphframes
is on your Python path. This has been fixed ingraphframes
0.3.