I've got a dataframe lets call it "df" in apache spark with 3 colums and about 1000 rows. One of the colums "stores" a double in each row that either is 1.00 or 0.00 lets call it "column x" I need to get the amount of rows in "column x" that is 1.00 to use as a variable.
I know at least 2 ways of doing it but I can't figure out how to finish either of them.
For the first one I first off made new dataframe and selecting "column x" lets call it df2 (getting rid of the other columns that I dont need for this):
df2 = df.select('column_x')
then I created another dataframe that groups up the 1.00 and 0.00 lets call it grouped_df:
grouped_df = df2.map(lambda label : (label, 1)).reduceByKey(lambda a, b: a +b)
This dataframe now only consist of 2 rows instead of 1000. The first row are the 1.00 rows added together into a double and the second rows 0.00.
Now here is the problem, I have no idea how to "extract" the element into a value so I can use it for a calculation. I only managed to use .take(1) or collect() to display that the dataframes element is correct but I cant make for example simple division with that since it doesnt return an int
The other way of doing this is by just filtering out all the 0.00 in df2 and then use .count() on the filtered dataframe since that seem to return an int I can use.
EDIT: This is what it looks like:
Once you have the final dataframe with aggregated counts for column, then you can call 'collect' on that Dataframe, this will return the rows of DataFrame as List of Rows datatype.
From the list of Rows, you can query the access the column value by column name and assign to the variable, as below:
edit I didn't notice that you're asking about python, and I wrote the code in Scala, but in principle the solution should be the same, you should only use the python API
The dataframe is essentially a wrapper on a collection of data. Distributed, but a collection nevertheless. There is an operation
org.apache.spark.sql.Dataset#collect
, which essentially unwraps that collection into a simple scala Array. When you have an array, you can simply take the n-th element from it, or, since you care only about the first element, you can callhead()
on an array to get the first element. Since you're using aDataFrame
, you've got a collection oforg.apache.spark.sql.Row
elements. To retrieve the value of an element you'd have to callgetDouble
or whatever value you want to extract from it.To summarise this is the code that would do what you want (roughly):
Hope this is what you're looking for.
Please note as per documentation:
edit: I forgor to write the import.
I solved it by transforming the result into a Panda's dataFrame and then using the int() function on the cell in position [[0][0]] to get the result in the variable x as an integer. Alternatively, you can use float().