Converting pandas Dataframe with Numpy values to p

I created a 2 columns pandas df with random.int method to generate a second two column dataframe applying groupby operations. df.col1 is a series of lists, df.col2 a series of integers, and elements inside the list are type 'numpy.int64', same for the elements of the second column, as result of random.int.

df.a        df.b
3            7
5            2
1            8
...

groupby operations 

df.col1        df.col2
[1,2,3...]    1
[2,5,6...]    2
[6,4,....]    3
...

When I try to crete the pyspark.sql dataframe with spark.createDataFrame(df), I get this error: TypeError: not supported type: type 'numpy.int64'.

Going back to the df generation, I tried different methods to convert the elements from numpy.int64 to python int, but none of theme worked:

np_list = np.random.randint(0,2500, size = (10000,2)).astype(IntegerType)
df = pd.DataFrame(np_list,columns = list('ab'), dtype = 'int')

I also tried to map with lambda x: int(x) or x.item() but the type still remains 'numpy.int64'.

According to pyspark.sql documentation, it should be possible to load a pandas dataframe, but it seems not compatible when it comes with numpy values. Any hints?

Thanks!

Well the way how you do it doesn't work. If you have something like this. You will get the error because of the first column. Spark doesn't understand a list with the type numpy.int64

df.col1        df.col2
[1,2,3...]    1
[2,5,6...]    2
[6,4,....]    3
...

If you have something like this. The this should be okay.

df.a        df.b
3            7
5            2
1            8

In terms of your code, try this:

np_list = np.random.randint(0,2500, size = (10000,2))
df = pd.DataFrame(np_list,columns = list('ab'))
spark_df = spark.createDataFrame(df)

You don't really need to cast this as int again and if you want to do it explicitly, then it is array.astype(int). Then just do spark_df.head. This should work!