Converting pandas Dataframe with Numpy values to p

2020-07-30 15:11发布

问题:

I created a 2 columns pandas df with random.int method to generate a second two column dataframe applying groupby operations. df.col1 is a series of lists, df.col2 a series of integers, and elements inside the list are type 'numpy.int64', same for the elements of the second column, as result of random.int.

df.a        df.b
3            7
5            2
1            8
...

groupby operations 

df.col1        df.col2
[1,2,3...]    1
[2,5,6...]    2
[6,4,....]    3
...

When I try to crete the pyspark.sql dataframe with spark.createDataFrame(df), I get this error: TypeError: not supported type: type 'numpy.int64'.

Going back to the df generation, I tried different methods to convert the elements from numpy.int64 to python int, but none of theme worked:

np_list = np.random.randint(0,2500, size = (10000,2)).astype(IntegerType)
df = pd.DataFrame(np_list,columns = list('ab'), dtype = 'int')

I also tried to map with lambda x: int(x) or x.item() but the type still remains 'numpy.int64'.

According to pyspark.sql documentation, it should be possible to load a pandas dataframe, but it seems not compatible when it comes with numpy values. Any hints?

Thanks!

回答1:

Well the way how you do it doesn't work. If you have something like this. You will get the error because of the first column. Spark doesn't understand a list with the type numpy.int64

df.col1        df.col2
[1,2,3...]    1
[2,5,6...]    2
[6,4,....]    3
...

If you have something like this. The this should be okay.

df.a        df.b
3            7
5            2
1            8

In terms of your code, try this:

np_list = np.random.randint(0,2500, size = (10000,2))
df = pd.DataFrame(np_list,columns = list('ab'))
spark_df = spark.createDataFrame(df)

You don't really need to cast this as int again and if you want to do it explicitly, then it is array.astype(int). Then just do spark_df.head. This should work!