Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Ineed help in writing hive command for these questions: u.data table -- The data

ID: 3586236 • Letter: I

Question

Ineed help in writing hive command for these questions:

u.data table -- The dataset has 100000 ratings by 943 users on 1682 movies.   The file has 4 tab (" ") separated columns. The first column is the user id, the second column is the movie id, the third column is the rating, and the fourth column is a timestamp.

u.user table - Demographic information about the users; this has 5 pipe "|" separated columns. the first column is the user id, the second column is the age, the third column is the gender (Male denoted by 'M' and Female denoted by 'F'), fourth column is the occupation, and the fifth column is the zip code. The user ids are the ones used in the u.data data set.

1.Find the user id who has rated the most number of movies

2.Find average rating received by movie with id 178.

3.The users belonging to which 3 occupations provided the most number of ratings

4.How many unique male users provided at least one rating of 5.

Explanation / Answer

The answers and the explanation of all above questiones/Queries are below:

1.Find the user id who has rated the most number of movies

-- Here we need to calculate what is max numbers of movies a user rated and filter the data

-- Count(1) will have more performance then Count(*)

SELECT userID

FROM(

SELECT userID, COUNT(1) as cnt

FROM Table_data

GROUP BY userID

HAVING COUNT(1) = (

SELECT MAX(c1)

FROM (

SELECT userID, COUNT(1) as c1

FROM Table_data

GROUP BY userID

) t1

)

) t2

---------------------------------------------

2.Find average rating received by movie with id 178.

-- This is very simple. First Filter the movie on basis of ID, then Calculate average using AVG Function

SELECT AVG(rating)

FROM Table_data

WHERE movieID = 178

---------------------------------------------------------

3.The users belonging to which 3 occupations provided the most number of ratings

-- Here Firstly we are having Count of movies rated by a user.

-- Take First 3 userIDs based upon max count of movies

-- Join with user table and get occupation of those three users

SELECT u.occupation

FROM(

SELECT userID,ROW_NUMBER() OVER (partition by userID order by cnt desc) AS row_num

FROM(

SELECT userID, COUNT(1) as cnt

FROM Table_data

GROUP BY userID

) t1

) t2

JOIN Table_user u on (t2.userID = u.userID)

WHERE row_num < 4

---------------------------------------------------------

4.How many unique male users provided at least one rating of 5.

-- This is again simple, First , Join the data and user data on userID

-- Then filter the data having rating = 5 and gender = 'M'

-- Count the distinct users using COUNT(DISTINCT x) clause

SELECT COUNT(DISTINCT c.userID)

FROM TABLE_DATA d

JOIN TABLE_USER u on (d.userID = c.userID)

WHERE d.rating = 5 and u.gender = 'M'