How do I extract a value (I want an int not row) f

2019-09-21 11:27发布

问题:

I've got a dataframe lets call it "df" in apache spark with 3 colums and about 1000 rows. One of the colums "stores" a double in each row that either is 1.00 or 0.00 lets call it "column x" I need to get the amount of rows in "column x" that is 1.00 to use as a variable.

I know at least 2 ways of doing it but I can't figure out how to finish either of them.

For the first one I first off made new dataframe and selecting "column x" lets call it df2 (getting rid of the other columns that I dont need for this):

df2 = df.select('column_x')

then I created another dataframe that groups up the 1.00 and 0.00 lets call it grouped_df:

grouped_df = df2.map(lambda label : (label, 1)).reduceByKey(lambda a, b: a +b)

This dataframe now only consist of 2 rows instead of 1000. The first row are the 1.00 rows added together into a double and the second rows 0.00.

Now here is the problem, I have no idea how to "extract" the element into a value so I can use it for a calculation. I only managed to use .take(1) or collect() to display that the dataframes element is correct but I cant make for example simple division with that since it doesnt return an int

The other way of doing this is by just filtering out all the 0.00 in df2 and then use .count() on the filtered dataframe since that seem to return an int I can use.

EDIT: This is what it looks like:

回答1:

Once you have the final dataframe with aggregated counts for column, then you can call 'collect' on that Dataframe, this will return the rows of DataFrame as List of Rows datatype.

From the list of Rows, you can query the access the column value by column name and assign to the variable, as below:

>>> df.show()
+--------+----+
|    col1|col2|
+--------+----+
|column_x|1000|
|column_y|2000|
+--------+----+

>>>
>>> test = df.collect()
>>> test
[Row(col1=u'column_x', col2=1000), Row(col1=u'column_y', col2=2000)]
>>>
>>> count_x = test[0].col2
>>> count_x
1000
>>>
>>> count_y = test[1].col2
>>> count_y
2000
>>>


回答2:

edit I didn't notice that you're asking about python, and I wrote the code in Scala, but in principle the solution should be the same, you should only use the python API

The dataframe is essentially a wrapper on a collection of data. Distributed, but a collection nevertheless. There is an operation org.apache.spark.sql.Dataset#collect, which essentially unwraps that collection into a simple scala Array. When you have an array, you can simply take the n-th element from it, or, since you care only about the first element, you can call head() on an array to get the first element. Since you're using a DataFrame, you've got a collection of org.apache.spark.sql.Row elements. To retrieve the value of an element you'd have to call getDouble or whatever value you want to extract from it.

To summarise this is the code that would do what you want (roughly):

val grouped_df = df2.map(lambda label : (label, 1)).reduceByKey(lambda a, b: a +b)
val collectionOfValues: Array[Row] = grouped_df.collect
val topRow: Row = collectionOfValues.head
val value: Double = topRow.getDouble

Hope this is what you're looking for.

Please note as per documentation:

Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError



回答3:

edit: I forgor to write the import.

I solved it by transforming the result into a Panda's dataFrame and then using the int() function on the cell in position [[0][0]] to get the result in the variable x as an integer. Alternatively, you can use float().

import pyspark.sql.functions as f
data=[(1,1,1),(1,2,0),(0,3,1),(1,4,1),(0,1,0),(0,2,0),(1,3,1)]
df=spark.createDataFrame(data,['class_label','review','words'])

print(type(df))

> <class 'pyspark.sql.dataframe.DataFrame'>

print(df)

+-----------+------+-----+ 
|class_label|review|words| 
+-----------+------+-----+ 
|          1|     1|    1| 
|          1|     2|    0| 
|          0|     3|    1| 
|          1|     4|    1| 
|          0|     1|    0| 
|          0|     2|    0| 
|          1|     3|    1| 
+-----------+------+-----+

df2 = df.groupBy().agg(f.sum('class_label').alias('result')).toPandas()

x = int(df2.iloc[[0][0]])

print(type(x))
> <type 'int'>
print(x)
> 4