I am new to spark, and I want to use group-by & reduce to find the following from CSV (one line by employed):
Department, Designation, costToCompany, State
Sales, Trainee, 12000, UP
Sales, Lead, 32000, AP
Sales, Lead, 32000, LA
Sales, Lead, 32000, TN
Sales, Lead, 32000, AP
Sales, Lead, 32000, TN
Sales, Lead, 32000, LA
Sales, Lead, 32000, LA
Marketing, Associate, 18000, TN
Marketing, Associate, 18000, TN
HR, Manager, 58000, TN
I would like to simplify the about CSV with group by Department, Designation, State with additional columns with sum(costToCompany) and TotalEmployeeCount
Should get a result like:
Dept, Desg, state, empCount, totalCost
Sales,Lead,AP,2,64000
Sales,Lead,LA,3,96000
Sales,Lead,TN,2,64000
Is there any way to achieve this using transformations and actions. Or should we go for RDD operations?
Procedure
Create a Class (Schema) to encapsulate your structure (it’s not required for the approach B, but it would make your code easier to read if you are using Java)
Loading CVS (JSON) file
At this point you have 2 approaches:
A. SparkSQL
Register a table (using the your defined Schema Class)
Query the table with your desired Query-group-by
Here you would also be able to do any other query you desire, using a SQL approach
B. Spark
Mapping using a composite key:
Department
,Designation
,State
});
reduceByKey using the composite key, summing
costToCompany
column, and accumulating the number of records by keyThe following might not be entirely correct, but it should give you some idea of how to juggle data. It's not pretty, should be replaced with case classes etc, but as a quick example of how to use the spark api, I hope it's enough :)
Or you can use SparkSQL:
Using Spark 2.x(and above) with Java
Create SparkSession object aka
spark
Create Schema for Row with
StructType
Create dataframe from CSV file and apply schema to it
more option on reading data from CSV file
Now we can aggregation on data in 2 ways
dependent libraries
For JSON, if your text file contains one JSON object per line, you can use
sqlContext.jsonFile(path)
to let Spark SQL load it as aSchemaRDD
(the schema will be automatically inferred). Then, you can register it as a table and query it with SQL. You can also manually load the text file as anRDD[String]
containing one JSON object per record and usesqlContext.jsonRDD(rdd)
to turn it as aSchemaRDD
.jsonRDD
is useful when you need to pre-process your data.