Problem 8. Suppose we have the following workers\' information. name Mike Paul B
ID: 3712059 • Letter: P
Question
Problem 8. Suppose we have the following workers' information. name Mike Paul Bob Olivia Rob Susan David Emma Lisa age 19 26 25 30 32 36 35 gender Male Male Male Female Male Female Male Female Female occupation Computer Scientist Computer Scientist Computer Scientist Accountant Computer Scientist Computer Scientist Accountant Accountant Accountant 32 The data is stored in a json file "/home/rob/exam2/workers_spark.json". The data file is uploaded into iCollege. We want to compute the number of workers above age 20 in each gender and each occupation That is, we want to get the following table from the above one. gender Male Female Male Female occupation Computer Scientist Computer Scientist Accountant Accountant count Note that Mike is 19, which is smaller than 20. Therefore, he is filtered out and not counted From the above result table, we can see that there are more male workers in Computer Science and more female workers in Accounting. This is what we learnt from the original workers' information table.Explanation / Answer
Below is the code in pyspark, spark SQL and pig for aggregating and group given data.
1) Pyspark
json file is read first in the first statement. The the filter, group by and aggregation is applied on the data.
df = sqlContext.read.json('/home/rob/exam2/workers_spark.json')
df.filter(df.age >= 20).groupby(df.gender, df.occupation).count()
2) spark SQL
val df = spark.read.json("/home/rob/exam2/workers_spaek.json")
df.createOrReplaceTempView("iCollege")
val sqlDF = spark.sql("SELECT gender, occupation, COUNT(name) FROM iCollege WHERE age >= 20 GROUP BY gender, occupation");
3) pig
The file is assumed to be in iCollege. First the grouping is done and then the aggregation is performed in the next statement.
by_gender_occu = GROUP iCollege by (gender, occupation);
by_gender_occupation_count = FOREACH by_gener_occu GENERATE FLATTEN(group) as (gender, occupation), COUNT($1);