Ineed help in writing hive command for these questions: u.data table -- The data
ID: 3586236 • Letter: I
Question
Ineed help in writing hive command for these questions:
u.data table -- The dataset has 100000 ratings by 943 users on 1682 movies. The file has 4 tab (" ") separated columns. The first column is the user id, the second column is the movie id, the third column is the rating, and the fourth column is a timestamp.
u.user table - Demographic information about the users; this has 5 pipe "|" separated columns. the first column is the user id, the second column is the age, the third column is the gender (Male denoted by 'M' and Female denoted by 'F'), fourth column is the occupation, and the fifth column is the zip code. The user ids are the ones used in the u.data data set.
1.Find the user id who has rated the most number of movies
2.Find average rating received by movie with id 178.
3.The users belonging to which 3 occupations provided the most number of ratings
4.How many unique male users provided at least one rating of 5.
Explanation / Answer
The answers and the explanation of all above questiones/Queries are below:
1.Find the user id who has rated the most number of movies
-- Here we need to calculate what is max numbers of movies a user rated and filter the data
-- Count(1) will have more performance then Count(*)
SELECT userID
FROM(
SELECT userID, COUNT(1) as cnt
FROM Table_data
GROUP BY userID
HAVING COUNT(1) = (
SELECT MAX(c1)
FROM (
SELECT userID, COUNT(1) as c1
FROM Table_data
GROUP BY userID
) t1
)
) t2
---------------------------------------------
2.Find average rating received by movie with id 178.
-- This is very simple. First Filter the movie on basis of ID, then Calculate average using AVG Function
SELECT AVG(rating)
FROM Table_data
WHERE movieID = 178
---------------------------------------------------------
3.The users belonging to which 3 occupations provided the most number of ratings
-- Here Firstly we are having Count of movies rated by a user.
-- Take First 3 userIDs based upon max count of movies
-- Join with user table and get occupation of those three users
SELECT u.occupation
FROM(
SELECT userID,ROW_NUMBER() OVER (partition by userID order by cnt desc) AS row_num
FROM(
SELECT userID, COUNT(1) as cnt
FROM Table_data
GROUP BY userID
) t1
) t2
JOIN Table_user u on (t2.userID = u.userID)
WHERE row_num < 4
---------------------------------------------------------
4.How many unique male users provided at least one rating of 5.
-- This is again simple, First , Join the data and user data on userID
-- Then filter the data having rating = 5 and gender = 'M'
-- Count the distinct users using COUNT(DISTINCT x) clause
SELECT COUNT(DISTINCT c.userID)
FROM TABLE_DATA d
JOIN TABLE_USER u on (d.userID = c.userID)
WHERE d.rating = 5 and u.gender = 'M'