I created a 2 columns pandas df with random.int method to generate a second two column dataframe applying groupby operations. df.col1 is a series of lists, df.col2 a series of integers, and elements inside the list are type 'numpy.int64', same for the elements of the second column, as result of random.int.
df.a df.b
3 7
5 2
1 8
...
groupby operations
df.col1 df.col2
[1,2,3...] 1
[2,5,6...] 2
[6,4,....] 3
...
When I try to crete the pyspark.sql dataframe with spark.createDataFrame(df), I get this error: TypeError: not supported type: type 'numpy.int64'.
Going back to the df generation, I tried different methods to convert the elements from numpy.int64 to python int, but none of theme worked:
np_list = np.random.randint(0,2500, size = (10000,2)).astype(IntegerType)
df = pd.DataFrame(np_list,columns = list('ab'), dtype = 'int')
I also tried to map with lambda x: int(x) or x.item() but the type still remains 'numpy.int64'.
According to pyspark.sql documentation, it should be possible to load a pandas dataframe, but it seems not compatible when it comes with numpy values. Any hints?
Thanks!