Project - MovieLens Data Analysis

The GroupLens Research Project is a research group in the Department of Computer Science and Engineering at the University of Minnesota. The data is widely used for collaborative filtering and other filtering solutions. However, we will be using this data to act as a means to demonstrate our skill in using Python to “play” with data.

Objective:

  • To implement the techniques learnt as a part of the course.

Learning Outcomes:

  • Exploratory Data Analysis

  • Visualization using Python

  • Pandas – groupby, merging

Domain

  • Internet and Entertainment

Note that the project will need you to apply the concepts of groupby and merging extensively.

Datasets Information:

rating.csv: It contains information on ratings given by the users to a particular movie.

  • user id: id assigned to every user
  • movie id: id assigned to every movie
  • rating: rating given by the user
  • timestamp: Time recorded when the user gave a rating

movie.csv: File contains information related to the movies and their genre.

  • movie id: id assigned to every movie
  • movie title: Title of the movie
  • release date: Date of release of the movie
  • Action: Genre containing binary values (1 - for action 0 - not action)
  • Adventure: Genre containing binary values (1 - for adventure 0 - not adventure)
  • Animation: Genre containing binary values (1 - for animation 0 - not animation)
  • Children’s: Genre containing binary values (1 - for children's 0 - not children's)
  • Comedy: Genre containing binary values (1 - for comedy 0 - not comedy)
  • Crime: Genre containing binary values (1 - for crime 0 - not crime)
  • Documentary: Genre containing binary values (1 - for documentary 0 - not documentary)
  • Drama: Genre containing binary values (1 - for drama 0 - not drama)
  • Fantasy: Genre containing binary values (1 - for fantasy 0 - not fantasy)
  • Film-Noir: Genre containing binary values (1 - for film-noir 0 - not film-noir)
  • Horror: Genre containing binary values (1 - for horror 0 - not horror)
  • Musical: Genre containing binary values (1 - for musical 0 - not musical)
  • Mystery: Genre containing binary values (1 - for mystery 0 - not mystery)
  • Romance: Genre containing binary values (1 - for romance 0 - not romance)
  • Sci-Fi: Genre containing binary values (1 - for sci-fi 0 - not sci-fi)
  • Thriller: Genre containing binary values (1 - for thriller 0 - not thriller)
  • War: Genre containing binary values (1 - for war 0 - not war)
  • Western: Genre containing binary values (1 - for western - not western)

user.csv: It contains information of the users who have rated the movies.

  • user id: id assigned to every user
  • age: Age of the user
  • gender: Gender of the user
  • occupation: Occupation of the user
  • zip code: Zip code of the use

Please provide you insights wherever necessary.

1. Import the necessary packages - 2.5 marks

In [539]:
#Importing Numpy, Pandas and Seaborn

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as mplt

2. Read the 3 datasets into dataframes - 2.5 marks

In [540]:
#Reading the CSV files into Jupyter Notebook.

user = pd.read_csv('user.csv')
item = pd.read_csv('item.csv')
data = pd.read_csv('Data.csv')

3. Apply info, shape, describe, and find the number of missing values in the data - 5 marks

  • Note that you will need to do it for all the three datasets seperately
In [541]:
#Finding a basic understanding of the datasets: its contents, shape, descriptive statistics, null values, etc.

print('\033[1m' + 'User:')
print('\033[0m')
print(user.info)        #User dataframe
print(user.shape)       #User dataframe number of rows and columns
print(user.describe())  #Descriptive Statistics for user
print(user.isnull().sum().sum())     #Number of missing observations for user
print('')
print('')
print('Item')           #Item dataframe.
print(item.info)
print(item.shape)
print(item.describe())
print(item.isnull().sum().sum())
print('')
print('')
print('Data')           #Data dataframe
print(data.info)
print(data.shape)
print(data.describe())
print(data.isnull().sum().sum())
User:

<bound method DataFrame.info of      user id  age gender     occupation zip code
0          1   24      M     technician    85711
1          2   53      F          other    94043
2          3   23      M         writer    32067
3          4   24      M     technician    43537
4          5   33      F          other    15213
..       ...  ...    ...            ...      ...
938      939   26      F        student    33319
939      940   32      M  administrator    02215
940      941   20      M        student    97229
941      942   48      F      librarian    78209
942      943   22      M        student    77841

[943 rows x 5 columns]>
(943, 5)
          user id         age
count  943.000000  943.000000
mean   472.000000   34.051962
std    272.364951   12.192740
min      1.000000    7.000000
25%    236.500000   25.000000
50%    472.000000   31.000000
75%    707.500000   43.000000
max    943.000000   73.000000
0


Item
<bound method DataFrame.info of       movie id                          movie title release date  unknown  \
0            1                           Toy Story   01-Jan-1995        0   
1            2                           GoldenEye   01-Jan-1995        0   
2            3                          Four Rooms   01-Jan-1995        0   
3            4                          Get Shorty   01-Jan-1995        0   
4            5                             Copycat   01-Jan-1995        0   
...        ...                                  ...          ...      ...   
1676      1678                          Mat' i syn   06-Feb-1998        0   
1677      1679                           B. Monkey   06-Feb-1998        0   
1678      1680                       Sliding Doors   01-Jan-1998        0   
1679      1681                        You So Crazy   01-Jan-1994        0   
1680      1682  Scream of Stone (Schrei aus Stein)   08-Mar-1996        0   

      Action  Adventure  Animation  Childrens  Comedy  Crime  ...  Fantasy  \
0          0          0          1          1       1      0  ...        0   
1          1          1          0          0       0      0  ...        0   
2          0          0          0          0       0      0  ...        0   
3          1          0          0          0       1      0  ...        0   
4          0          0          0          0       0      1  ...        0   
...      ...        ...        ...        ...     ...    ...  ...      ...   
1676       0          0          0          0       0      0  ...        0   
1677       0          0          0          0       0      0  ...        0   
1678       0          0          0          0       0      0  ...        0   
1679       0          0          0          0       1      0  ...        0   
1680       0          0          0          0       0      0  ...        0   

      Film-Noir  Horror  Musical  Mystery  Romance  Sci-Fi  Thriller  War  \
0             0       0        0        0        0       0         0    0   
1             0       0        0        0        0       0         1    0   
2             0       0        0        0        0       0         1    0   
3             0       0        0        0        0       0         0    0   
4             0       0        0        0        0       0         1    0   
...         ...     ...      ...      ...      ...     ...       ...  ...   
1676          0       0        0        0        0       0         0    0   
1677          0       0        0        0        1       0         1    0   
1678          0       0        0        0        1       0         0    0   
1679          0       0        0        0        0       0         0    0   
1680          0       0        0        0        0       0         0    0   

      Western  
0           0  
1           0  
2           0  
3           0  
4           0  
...       ...  
1676        0  
1677        0  
1678        0  
1679        0  
1680        0  

[1681 rows x 22 columns]>
(1681, 22)
          movie id      unknown       Action    Adventure    Animation  \
count  1681.000000  1681.000000  1681.000000  1681.000000  1681.000000   
mean    841.841761     0.000595     0.149316     0.080309     0.024985   
std     485.638077     0.024390     0.356506     0.271852     0.156126   
min       1.000000     0.000000     0.000000     0.000000     0.000000   
25%     422.000000     0.000000     0.000000     0.000000     0.000000   
50%     842.000000     0.000000     0.000000     0.000000     0.000000   
75%    1262.000000     0.000000     0.000000     0.000000     0.000000   
max    1682.000000     1.000000     1.000000     1.000000     1.000000   

         Childrens       Comedy        Crime  Documentary        Drama  \
count  1681.000000  1681.000000  1681.000000  1681.000000  1681.000000   
mean      0.072576     0.300416     0.064842     0.029744     0.431291   
std       0.259516     0.458576     0.246321     0.169931     0.495404   
min       0.000000     0.000000     0.000000     0.000000     0.000000   
25%       0.000000     0.000000     0.000000     0.000000     0.000000   
50%       0.000000     0.000000     0.000000     0.000000     0.000000   
75%       0.000000     1.000000     0.000000     0.000000     1.000000   
max       1.000000     1.000000     1.000000     1.000000     1.000000   

           Fantasy    Film-Noir       Horror      Musical      Mystery  \
count  1681.000000  1681.000000  1681.000000  1681.000000  1681.000000   
mean      0.013087     0.014277     0.054729     0.033314     0.036288   
std       0.113683     0.118667     0.227519     0.179507     0.187061   
min       0.000000     0.000000     0.000000     0.000000     0.000000   
25%       0.000000     0.000000     0.000000     0.000000     0.000000   
50%       0.000000     0.000000     0.000000     0.000000     0.000000   
75%       0.000000     0.000000     0.000000     0.000000     0.000000   
max       1.000000     1.000000     1.000000     1.000000     1.000000   

           Romance       Sci-Fi     Thriller          War      Western  
count  1681.000000  1681.000000  1681.000000  1681.000000  1681.000000  
mean      0.146936     0.060083     0.149316     0.042237     0.016062  
std       0.354148     0.237712     0.356506     0.201189     0.125751  
min       0.000000     0.000000     0.000000     0.000000     0.000000  
25%       0.000000     0.000000     0.000000     0.000000     0.000000  
50%       0.000000     0.000000     0.000000     0.000000     0.000000  
75%       0.000000     0.000000     0.000000     0.000000     0.000000  
max       1.000000     1.000000     1.000000     1.000000     1.000000  
0


Data
<bound method DataFrame.info of        user id  movie id  rating  timestamp
0          196       242       3  881250949
1          186       302       3  891717742
2           22       377       1  878887116
3          244        51       2  880606923
4          166       346       1  886397596
...        ...       ...     ...        ...
99995      880       476       3  880175444
99996      716       204       5  879795543
99997      276      1090       1  874795795
99998       13       225       2  882399156
99999       12       203       3  879959583

[100000 rows x 4 columns]>
(100000, 4)
            user id       movie id         rating     timestamp
count  100000.00000  100000.000000  100000.000000  1.000000e+05
mean      462.48475     425.530130       3.529860  8.835289e+08
std       266.61442     330.798356       1.125674  5.343856e+06
min         1.00000       1.000000       1.000000  8.747247e+08
25%       254.00000     175.000000       3.000000  8.794487e+08
50%       447.00000     322.000000       4.000000  8.828269e+08
75%       682.00000     631.000000       4.000000  8.882600e+08
max       943.00000    1682.000000       5.000000  8.932866e+08
0

USER dataset:

The user dataset has the general information of individuals who rated the movies. It contains their user id, age, gender, occupation and zip code. It has 943 rows, and 5 columns. User IDs range from 1-943. The average age is 34, and the median age of this sample population is 31. There are no missing values in this dataset.

ITEM dataset:

The item dataset contains the list of movies titles that were evaluated. Each movie is accompanied by a movie id, the release date, and the genres into which they are classified. The dataset contains 1681 rows and 22 columns. This dataset contains no null values.

DATA dataset:

The data dataset contains the ratings of the movies of users in the other 2 datasets. There are in total 100,000 rows and 4 columns. Ratings had a mean of 3.5 and a median of 4. The lowest rating given is 1, and the maximum rating given is 5. This dataset contains no null values.

4. Find the number of movies per genre using the item data - 2.5 marks

In [542]:
item.sum() #Number of movies by genre
Out[542]:
movie id                                                  1415136
movie title     Toy Story GoldenEye Four Rooms Get Shorty Copy...
release date    01-Jan-199501-Jan-199501-Jan-199501-Jan-199501...
unknown                                                         1
Action                                                        251
Adventure                                                     135
Animation                                                      42
Childrens                                                     122
Comedy                                                        505
Crime                                                         109
Documentary                                                    50
Drama                                                         725
Fantasy                                                        22
Film-Noir                                                      24
Horror                                                         92
Musical                                                        56
Mystery                                                        61
Romance                                                       247
Sci-Fi                                                        101
Thriller                                                      251
War                                                            71
Western                                                        27
dtype: object

Of all the genres, Drama had the most number of movies, followed by Comedy. The unknown category had one movie only.

5. Drop the movie where the genre is unknown - 2.5 marks

In [543]:
#Find row where unknown does not equal to 0.
print(item[(item['unknown']!= 0)])

#Copy original dataframe. Drop Row Index# 1371 from new dataframe 
item2 = item.copy()
item2 = item2.drop(item.index[1371])    #Delete row with index number 1371 from new dataframe
item2.iloc[1371]                        #Row is deleted from new original frame because the movie id is 1374 instead of 1373
      movie id    movie title release date  unknown  Action  Adventure  \
1371      1373  Good Morning    4-Feb-1971        1       0          0   

      Animation  Childrens  Comedy  Crime  ...  Fantasy  Film-Noir  Horror  \
1371          0          0       0      0  ...        0          0       0   

      Musical  Mystery  Romance  Sci-Fi  Thriller  War  Western  
1371        0        0        0       0         0    0        0  

[1 rows x 22 columns]
Out[543]:
movie id                          1374
movie title     Falling in Love Again 
release date               01-Jan-1980
unknown                              0
Action                               0
Adventure                            0
Animation                            0
Childrens                            0
Comedy                               1
Crime                                0
Documentary                          0
Drama                                0
Fantasy                              0
Film-Noir                            0
Horror                               0
Musical                              0
Mystery                              0
Romance                              0
Sci-Fi                               0
Thriller                             0
War                                  0
Western                              0
Name: 1372, dtype: object

The movie that was under the unknown category was "Good Morning" (Movie ID: 1373, and index: 1371). Once the movie is dropped, the next movie that appears with index number 1371 has a movie id of 1374. The movie is "Falling in Love Again".

6. Find the movies that have more than one genre - 5 marks

hint: use sum on the axis = 1

Display movie name, number of genres for the movie in dataframe

and also print(total number of movies which have more than one genres)

In [546]:
#Create new column('genre_total')that adds up genre columns, which are the last 19 columns

item2['genre_total']= item2.iloc[:, -19:-1].sum(axis=1)
print(item2.head())                     #Check that we added all the genres properly

#List of movie names, number of genres
item3 = item2[['movie title', 'genre_total']]
print(item3)

#Subset dataframe and create new dataframe (item_multiple_genres) where genre_total > 1.
item_multiple_genres = item3[item3['genre_total'] > 1]             # Creating new dataframe based off of condition genre_total > 1
item_multiple_genres.head()
item_multiple_genres['genre_total'].value_counts().sum()           # Count the number of movies that have more than 1 genre and calculate the sum.
   movie id  movie title release date  unknown  Action  Adventure  Animation  \
0         1   Toy Story   01-Jan-1995        0       0          0          1   
1         2   GoldenEye   01-Jan-1995        0       1          1          0   
2         3  Four Rooms   01-Jan-1995        0       0          0          0   
3         4  Get Shorty   01-Jan-1995        0       1          0          0   
4         5     Copycat   01-Jan-1995        0       0          0          0   

   Childrens  Comedy  Crime  ...  Film-Noir  Horror  Musical  Mystery  \
0          1       1      0  ...          0       0        0        0   
1          0       0      0  ...          0       0        0        0   
2          0       0      0  ...          0       0        0        0   
3          0       1      0  ...          0       0        0        0   
4          0       0      1  ...          0       0        0        0   

   Romance  Sci-Fi  Thriller  War  Western  genre_total  
0        0       0         0    0        0            3  
1        0       0         1    0        0            3  
2        0       0         1    0        0            1  
3        0       0         0    0        0            3  
4        0       0         1    0        0            3  

[5 rows x 23 columns]
                              movie title  genre_total
0                              Toy Story             3
1                              GoldenEye             3
2                             Four Rooms             1
3                             Get Shorty             3
4                                Copycat             3
...                                   ...          ...
1676                          Mat' i syn             1
1677                           B. Monkey             2
1678                       Sliding Doors             2
1679                        You So Crazy             1
1680  Scream of Stone (Schrei aus Stein)             1

[1680 rows x 2 columns]
Out[546]:
849

There are in total 849 movies that have more than one genre.

*** Although there are duplicate titles, some are not the same movie because they are released in different years. There are also instances where 2 movies with the same title were released at the same time; these are called "Twin Movies" (Example: "Chaos", 1995). For these reasons, I did not delete any films with the same title.

7. Univariate plots of columns: 'rating', 'Age', 'release year', 'Gender' and 'Occupation' - 10 marks

HINT: Use distplot for age. Use lineplot or countplot for release year.

HINT: Plot percentages in y-axis and categories in x-axis for ratings, gender and occupation

HINT: Please refer to the below snippet to understand how to get to release year from release date. You can use str.split() as depicted below or you could convert it to pandas datetime format and extract year (.dt.year)

In [ ]:
a = 'My*cat*is*brown' 
print(a.split('*')[3])

#similarly, the release year needs to be taken out from release date

#also you can simply slice existing string to get the desired data, if we want to take out the colour of the cat

print(a[10:])
print(a[-5:])
In [547]:
#AGE
print('Age')                                    #AGE
sns.distplot(user['age']);
Age

Age is somewhat right-skewed, and it does have a bimodal distribution between the ages of 20-40.

In [548]:
#RELEASE YEAR
print('Release Year')                                                            
item['release_year_string'] = pd.Series(item['release date'], dtype="string")     #Change release date from Series object to string
print(item.head())                                                                 
print(item.info())                                                                #Check to see if feature is a string

def releaseyr(release_year_string):
    return release_year_string.split('-')[2]                                      #Because series object doesn't have split, you have to use apply()
item['release_year'] = item['release_year_string'].apply(releaseyr)               #Get year from function
print(item.head())
print(item.tail())

mplt.figure(figsize=(15,10))                                                       #Countplot for Release Year
chart = sns.countplot(item['release_year'])
chart.set_xticklabels(chart.get_xticklabels(), rotation=90)
Release Year
   movie id  movie title release date  unknown  Action  Adventure  Animation  \
0         1   Toy Story   01-Jan-1995        0       0          0          1   
1         2   GoldenEye   01-Jan-1995        0       1          1          0   
2         3  Four Rooms   01-Jan-1995        0       0          0          0   
3         4  Get Shorty   01-Jan-1995        0       1          0          0   
4         5     Copycat   01-Jan-1995        0       0          0          0   

   Childrens  Comedy  Crime  ...  Film-Noir  Horror  Musical  Mystery  \
0          1       1      0  ...          0       0        0        0   
1          0       0      0  ...          0       0        0        0   
2          0       0      0  ...          0       0        0        0   
3          0       1      0  ...          0       0        0        0   
4          0       0      1  ...          0       0        0        0   

   Romance  Sci-Fi  Thriller  War  Western  release_year_string  
0        0       0         0    0        0          01-Jan-1995  
1        0       0         1    0        0          01-Jan-1995  
2        0       0         1    0        0          01-Jan-1995  
3        0       0         0    0        0          01-Jan-1995  
4        0       0         1    0        0          01-Jan-1995  

[5 rows x 23 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1681 entries, 0 to 1680
Data columns (total 23 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   movie id             1681 non-null   int64 
 1   movie title          1681 non-null   object
 2   release date         1681 non-null   object
 3   unknown              1681 non-null   int64 
 4   Action               1681 non-null   int64 
 5   Adventure            1681 non-null   int64 
 6   Animation            1681 non-null   int64 
 7   Childrens            1681 non-null   int64 
 8   Comedy               1681 non-null   int64 
 9   Crime                1681 non-null   int64 
 10  Documentary          1681 non-null   int64 
 11  Drama                1681 non-null   int64 
 12  Fantasy              1681 non-null   int64 
 13  Film-Noir            1681 non-null   int64 
 14  Horror               1681 non-null   int64 
 15  Musical              1681 non-null   int64 
 16  Mystery              1681 non-null   int64 
 17  Romance              1681 non-null   int64 
 18  Sci-Fi               1681 non-null   int64 
 19  Thriller             1681 non-null   int64 
 20  War                  1681 non-null   int64 
 21  Western              1681 non-null   int64 
 22  release_year_string  1681 non-null   string
dtypes: int64(20), object(2), string(1)
memory usage: 302.2+ KB
None
   movie id  movie title release date  unknown  Action  Adventure  Animation  \
0         1   Toy Story   01-Jan-1995        0       0          0          1   
1         2   GoldenEye   01-Jan-1995        0       1          1          0   
2         3  Four Rooms   01-Jan-1995        0       0          0          0   
3         4  Get Shorty   01-Jan-1995        0       1          0          0   
4         5     Copycat   01-Jan-1995        0       0          0          0   

   Childrens  Comedy  Crime  ...  Horror  Musical  Mystery  Romance  Sci-Fi  \
0          1       1      0  ...       0        0        0        0       0   
1          0       0      0  ...       0        0        0        0       0   
2          0       0      0  ...       0        0        0        0       0   
3          0       1      0  ...       0        0        0        0       0   
4          0       0      1  ...       0        0        0        0       0   

   Thriller  War  Western  release_year_string  release_year  
0         0    0        0          01-Jan-1995          1995  
1         1    0        0          01-Jan-1995          1995  
2         1    0        0          01-Jan-1995          1995  
3         0    0        0          01-Jan-1995          1995  
4         1    0        0          01-Jan-1995          1995  

[5 rows x 24 columns]
      movie id                          movie title release date  unknown  \
1676      1678                          Mat' i syn   06-Feb-1998        0   
1677      1679                           B. Monkey   06-Feb-1998        0   
1678      1680                       Sliding Doors   01-Jan-1998        0   
1679      1681                        You So Crazy   01-Jan-1994        0   
1680      1682  Scream of Stone (Schrei aus Stein)   08-Mar-1996        0   

      Action  Adventure  Animation  Childrens  Comedy  Crime  ...  Horror  \
1676       0          0          0          0       0      0  ...       0   
1677       0          0          0          0       0      0  ...       0   
1678       0          0          0          0       0      0  ...       0   
1679       0          0          0          0       1      0  ...       0   
1680       0          0          0          0       0      0  ...       0   

      Musical  Mystery  Romance  Sci-Fi  Thriller  War  Western  \
1676        0        0        0       0         0    0        0   
1677        0        0        1       0         1    0        0   
1678        0        0        1       0         0    0        0   
1679        0        0        0       0         0    0        0   
1680        0        0        0       0         0    0        0   

      release_year_string  release_year  
1676          06-Feb-1998          1998  
1677          06-Feb-1998          1998  
1678          01-Jan-1998          1998  
1679          01-Jan-1994          1994  
1680          08-Mar-1996          1996  

[5 rows x 24 columns]
Out[548]:
[Text(0, 0, '1922'),
 Text(0, 0, '1926'),
 Text(0, 0, '1930'),
 Text(0, 0, '1931'),
 Text(0, 0, '1932'),
 Text(0, 0, '1933'),
 Text(0, 0, '1934'),
 Text(0, 0, '1935'),
 Text(0, 0, '1936'),
 Text(0, 0, '1937'),
 Text(0, 0, '1938'),
 Text(0, 0, '1939'),
 Text(0, 0, '1940'),
 Text(0, 0, '1941'),
 Text(0, 0, '1942'),
 Text(0, 0, '1943'),
 Text(0, 0, '1944'),
 Text(0, 0, '1945'),
 Text(0, 0, '1946'),
 Text(0, 0, '1947'),
 Text(0, 0, '1948'),
 Text(0, 0, '1949'),
 Text(0, 0, '1950'),
 Text(0, 0, '1951'),
 Text(0, 0, '1952'),
 Text(0, 0, '1953'),
 Text(0, 0, '1954'),
 Text(0, 0, '1955'),
 Text(0, 0, '1956'),
 Text(0, 0, '1957'),
 Text(0, 0, '1958'),
 Text(0, 0, '1959'),
 Text(0, 0, '1960'),
 Text(0, 0, '1961'),
 Text(0, 0, '1962'),
 Text(0, 0, '1963'),
 Text(0, 0, '1964'),
 Text(0, 0, '1965'),
 Text(0, 0, '1966'),
 Text(0, 0, '1967'),
 Text(0, 0, '1968'),
 Text(0, 0, '1969'),
 Text(0, 0, '1970'),
 Text(0, 0, '1971'),
 Text(0, 0, '1972'),
 Text(0, 0, '1973'),
 Text(0, 0, '1974'),
 Text(0, 0, '1975'),
 Text(0, 0, '1976'),
 Text(0, 0, '1977'),
 Text(0, 0, '1978'),
 Text(0, 0, '1979'),
 Text(0, 0, '1980'),
 Text(0, 0, '1981'),
 Text(0, 0, '1982'),
 Text(0, 0, '1983'),
 Text(0, 0, '1984'),
 Text(0, 0, '1985'),
 Text(0, 0, '1986'),
 Text(0, 0, '1987'),
 Text(0, 0, '1988'),
 Text(0, 0, '1989'),
 Text(0, 0, '1990'),
 Text(0, 0, '1991'),
 Text(0, 0, '1992'),
 Text(0, 0, '1993'),
 Text(0, 0, '1994'),
 Text(0, 0, '1995'),
 Text(0, 0, '1996'),
 Text(0, 0, '1997'),
 Text(0, 0, '1998')]

When we plot release year on its own, it is left skewed. 1996 released the most films.

In [549]:
#RATING
rating = pd.DataFrame(data['rating'].value_counts())
total_ratings = data['rating'].value_counts().sum()          #Find frequency of ratings and make new dataframe
rating.head(6)
rating['percent'] = rating['rating']/total_ratings*100       #Calculating percentage of ratings
rating['percent'] = round(rating['percent'], 2)
rating.head(6)
rating.reset_index(inplace=True)
rating.head(6)
rating.columns
rating.columns = ['Rating', 'Frequency', 'Percentage (%)']   #Naming table columns
print(rating.head())

## Making lollipop chart in Matplotlib to show percentage of ratings.
(markerline, stemlines, baseline) = mplt.stem(rating['Rating'],
                                             rating['Percentage (%)'])
mplt.setp(markerline, marker='.', markersize=15,
         markeredgewidth=2, color='red')
mplt.setp(stemlines, color='red')
mplt.setp(baseline, visible=False)

mplt.tick_params(labelsize=12)
mplt.xlabel('Rating', size=12)
mplt.ylabel('Percentage (%)', size=12)
mplt.xticks([1, 2, 3, 4, 5])
mplt.ylim(bottom=0)

mplt.show()
   Rating  Frequency  Percentage (%)
0       4      34174           34.17
1       3      27145           27.15
2       5      21201           21.20
3       2      11370           11.37
4       1       6110            6.11
<ipython-input-549-d3e587502f93>:15: UserWarning: In Matplotlib 3.3 individual lines on a stem plot will be added as a LineCollection instead of individual lines. This significantly improves the performance of a stem plot. To remove this warning and switch to the new behaviour, set the "use_line_collection" keyword argument to True.
  (markerline, stemlines, baseline) = mplt.stem(rating['Rating'],
In [550]:
#GENDER
gender = pd.DataFrame(user['gender'].value_counts())
total = user['gender'].value_counts().sum()
gender['percent'] = gender['gender']/total*100
gender['percent'] = round(gender['percent'], 2)
gender.head()

gender.reset_index(inplace=True)
gender.head(6)
gender.columns
gender.columns = ['Gender', 'Frequency', 'Percentage (%)']
print(gender.head())

(markerline, stemlines, baseline) = mplt.stem(gender['Gender'],
                                             gender['Percentage (%)'])
mplt.setp(markerline, marker='x', markersize=15,
         markeredgewidth=2, color='blue')
mplt.setp(stemlines, color='blue')
mplt.setp(baseline, visible=False)

mplt.tick_params(labelsize=12)
mplt.xlabel('Gender', size=12)
mplt.ylabel('Percentage (%)', size=12)
mplt.ylim(bottom=0)

mplt.show()
  Gender  Frequency  Percentage (%)
0      M        670           71.05
1      F        273           28.95
<ipython-input-550-cfd909ebe7ac>:14: UserWarning: In Matplotlib 3.3 individual lines on a stem plot will be added as a LineCollection instead of individual lines. This significantly improves the performance of a stem plot. To remove this warning and switch to the new behaviour, set the "use_line_collection" keyword argument to True.
  (markerline, stemlines, baseline) = mplt.stem(gender['Gender'],
In [551]:
##OCCUPATION

occupation = pd.DataFrame(user['occupation'].value_counts())
total_occup = user['occupation'].value_counts().sum()
occupation['percent'] = occupation['occupation']/total_occup*100
occupation['percent'] = round(occupation['percent'], 2)
occupation.head()

occupation.reset_index(inplace=True)
occupation.head(6)
occupation.columns
occupation.columns = ['Occupation', 'Frequency', 'Percentage (%)']
print(occupation)

(markerline, stemlines, baseline) = mplt.stem(occupation['Occupation'],
                                             occupation['Percentage (%)'])
mplt.setp(markerline, marker='*', markersize=15,
         markeredgewidth=2, color='green')
mplt.setp(stemlines, color='green')
mplt.setp(baseline, visible=False)

mplt.tick_params(labelsize=12)
mplt.xlabel('Occupation', size=12)
mplt.ylabel('Percentage (%)', size=12)
mplt.xticks(rotation=90)
mplt.ylim(bottom=0)

mplt.show()
       Occupation  Frequency  Percentage (%)
0         student        196           20.78
1           other        105           11.13
2        educator         95           10.07
3   administrator         79            8.38
4        engineer         67            7.10
5      programmer         66            7.00
6       librarian         51            5.41
7          writer         45            4.77
8       executive         32            3.39
9       scientist         31            3.29
10         artist         28            2.97
11     technician         27            2.86
12      marketing         26            2.76
13  entertainment         18            1.91
14     healthcare         16            1.70
15        retired         14            1.48
16         lawyer         12            1.27
17       salesman         12            1.27
18           none          9            0.95
19      homemaker          7            0.74
20         doctor          7            0.74
<ipython-input-551-856a44b1ee73>:15: UserWarning: In Matplotlib 3.3 individual lines on a stem plot will be added as a LineCollection instead of individual lines. This significantly improves the performance of a stem plot. To remove this warning and switch to the new behaviour, set the "use_line_collection" keyword argument to True.
  (markerline, stemlines, baseline) = mplt.stem(occupation['Occupation'],

8. Visualize how popularity of genres has changed over the years - 10 marks

Note that you need to use the percent of number of releases in a year as a parameter of popularity of a genre

Hint 1: You need to reach to a data frame where the release year is the index and the genre is the column names (one cell shows the number of release in a year in one genre) or vice versa. (Drop unnecessary column if there are any)

Hint 2: Find the total number of movies release in a year(use sum(axis=1) store that value in a new column as 'total'). Now divide the value of each genre in that year by total to get percentage number of release in a particular year. (df.div(df['total'], axis= 0) * 100)

Once that is achieved, you can either use univariate plots or can use the heatmap to visualise all the changes over the years in one go.

Hint 3: Use groupby on the relevant column and use sum() on the same to find out the number of releases in a year/genre.

In [552]:
item.head()
item_year_action = item.groupby(['release_year'])[['unknown', 'Action', 'Adventure', 'Animation', 'Childrens', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']].sum().reset_index()
item_year_action.tail()                                                           #Using groupby function, and sum()

item_year_action['total'] = item_year_action.sum(axis=1)                         #Sum the genre columns and place result in 'total' column
item_year_action.tail()
item_year_action2 = item_year_action.set_index('release_year')                   #Make release_year the index
item_year_action2.info
item_year_action2.dtypes
movie_release_per_year_percent = item_year_action2.iloc[:,1:19].div(item_year_action2.total, axis=0)*100      #Get percentage of each genre in each year
print(movie_release_per_year_percent)

fig, ax = mplt.subplots(figsize=(15,10))                                  #Setting dimensions of heatmap
sns.heatmap(movie_release_per_year_percent, cmap = "cividis");            #Change color of heatmap
                 Action  Adventure  Animation  Childrens     Comedy  \
release_year                                                          
1922           0.000000   0.000000   0.000000   0.000000   0.000000   
1926           0.000000   0.000000   0.000000   0.000000   0.000000   
1930           0.000000   0.000000   0.000000   0.000000   0.000000   
1931           0.000000   0.000000   0.000000   0.000000   0.000000   
1932           0.000000   0.000000   0.000000   0.000000   0.000000   
...                 ...        ...        ...        ...        ...   
1994           8.498584   3.682720   1.133144   4.249292  23.229462   
1995          10.498688   5.774278   1.574803   5.511811  16.535433   
1996           7.871199   4.293381   1.610018   3.756708  19.320215   
1997           9.274194   4.032258   0.604839   4.435484  17.540323   
1998          10.256410   2.564103   0.000000   0.854701  11.111111   

                  Crime  Documentary       Drama   Fantasy  Film-Noir  \
release_year                                                            
1922           0.000000     0.000000    0.000000  0.000000   0.000000   
1926           0.000000     0.000000  100.000000  0.000000   0.000000   
1930           0.000000     0.000000  100.000000  0.000000   0.000000   
1931          33.333333     0.000000    0.000000  0.000000  33.333333   
1932           0.000000     0.000000    0.000000  0.000000   0.000000   
...                 ...          ...         ...       ...        ...   
1994           2.266289     2.549575   27.478754  0.849858   0.000000   
1995           2.887139     1.312336   23.359580  0.787402   0.262467   
1996           3.756708     3.220036   30.411449  1.073345   0.178891   
1997           6.048387     1.209677   22.782258  0.806452   0.403226   
1998           5.982906     2.564103   28.205128  0.854701   1.709402   

                  Horror   Musical   Mystery    Romance    Sci-Fi   Thriller  \
release_year                                                                   
1922          100.000000  0.000000  0.000000   0.000000  0.000000   0.000000   
1926            0.000000  0.000000  0.000000   0.000000  0.000000   0.000000   
1930            0.000000  0.000000  0.000000   0.000000  0.000000   0.000000   
1931            0.000000  0.000000  0.000000   0.000000  0.000000  33.333333   
1932            0.000000  0.000000  0.000000  50.000000  0.000000   0.000000   
...                  ...       ...       ...        ...       ...        ...   
1994            2.266289  0.566572  0.566572   9.915014  1.983003   7.082153   
1995            3.674541  0.787402  1.312336   9.711286  3.937008  10.236220   
1996            1.967800  1.610018  1.073345   6.797853  2.683363   8.407871   
1997            1.612903  1.008065  3.629032  10.282258  2.620968  10.887097   
1998            3.418803  0.854701  2.564103   9.401709  4.273504  15.384615   

                    War   Western  
release_year                       
1922           0.000000  0.000000  
1926           0.000000  0.000000  
1930           0.000000  0.000000  
1931           0.000000  0.000000  
1932          50.000000  0.000000  
...                 ...       ...  
1994           1.983003  1.699717  
1995           1.312336  0.524934  
1996           1.610018  0.357782  
1997           2.822581  0.000000  
1998           0.000000  0.000000  

[71 rows x 18 columns]

9. Find the top 25 movies according to average ratings such that each movie has number of ratings more than 100 - 10 marks

Hints :

  1. Find the count of ratings and average ratings for every movie.
  2. Slice the movies which have ratings more than 100.
  3. Sort values according to average rating such that movie which highest rating is on top.
  4. Select top 25 movies.
  5. You will have to use the .merge() function to get the movie titles.

Note: This question will need you to research about groupby and apply your findings. You can find more on groupby on https://realpython.com/pandas-groupby/.

In [553]:
data.head()
item.head()
movie_rating = pd.merge(item,data,how='left',on='movie id')   #Left merge to retain the movie list from the item dataset
movie_rating.head()
movie_rating2 = movie_rating.groupby(['movie title'])[['rating']].mean().reset_index()  #Average Ratings
movie_rating2.tail()
movie_rating.head()
movie_rating3 = movie_rating.groupby(['movie title'])[['rating']].sum().reset_index()   #Count of Ratings
movie_rating3.tail()
movie_rating_over_100 = movie_rating3[(movie_rating3['rating'] > 100)]               #Subsetting movies with more than 100 ratings
movie_rating_over_100.tail()
movie_rating_over_100_avg_rating = pd.merge(movie_rating_over_100, movie_rating2, how = 'left', on='movie title')     # Making frequency table to show frequency and average rating
movie_rating_over_100_avg_rating.head()
movie_rating_over_100_avg_rating.columns = ['movie title', 'Rating Count', 'Rating Average']   #Change the column titles.

print(movie_rating_over_100_avg_rating.sort_values(['Rating Average'],                   #Display table
                    ascending=False).head(25))
movie_rating_over_100_avg_rating.sort_values(['Rating Average'],                         #Creating bar chart of top 25 films with oveer 100 ratings
                    ascending=False).head(25).plot(x='movie title',y='Rating Average',kind='bar', figsize=(15,5))

mplt.show()
                                           movie title  Rating Count  \
152                                    Close Shave, A            503   
590                                  Schindler's List           1331   
758                               Wrong Trousers, The            527   
127                                        Casablanca           1083   
736   Wallace & Gromit: The Best of Aardman Animation            298   
610                         Shawshank Redemption, The           1258   
555                                       Rear Window            917   
726                               Usual Suspects, The           1171   
654                                         Star Wars           2541   
1                                        12 Angry Men            543   
684                                    Third Man, The            312   
142                                      Citizen Kane            850   
629                  Some Folks Call It a Sling Blade            176   
700                             To Kill a Mockingbird            940   
502                   One Flew Over the Cuckoo's Nest           1133   
616                         Silence of the Lambs, The           1673   
490                                North by Northwest            767   
292                                    Godfather, The           1769   
597                                    Secrets & Lies            691   
296                                 Good Will Hunting            844   
420                         Manchurian Candidate, The            558   
210  Dr. Strangelove or: How I Learned to Stop Worr...           825   
547                           Raiders of the Lost Ark           1786   
728                                           Vertigo            761   
696                                           Titanic           1486   

     Rating Average  
152        4.491071  
590        4.466443  
758        4.466102  
127        4.456790  
736        4.447761  
610        4.445230  
555        4.387560  
726        4.385768  
654        4.358491  
1          4.344000  
684        4.333333  
142        4.292929  
629        4.292683  
700        4.292237  
502        4.291667  
616        4.289744  
490        4.284916  
292        4.283293  
597        4.265432  
296        4.262626  
420        4.259542  
210        4.252577  
547        4.252381  
728        4.251397  
696        4.245714  

"A Close Shave" had the highest average rating (4.49) of movies with more than 100 rating entries. The top 25 movies with over 100 ratings have an average rating between 4.25-4.49. I merged the original datasets with no changes from previous questions in order to get the actual top 25 films with over 100 rating entries.

10. See gender distribution across different genres check for the validity of the below statements - 10 marks

  • Men watch more drama than women
  • Women watch more Sci-Fi than men
  • Men watch more Romance than women

compare the percentages

  1. Merge all the datasets

  2. There is no need to conduct statistical tests around this. Just compare the percentages and comment on the validity of the above statements.

  3. you might want ot use the .sum(), .div() function here.

  4. Use number of ratings to validate the numbers. For example, if out of 4000 ratings received by women, 3000 are for drama, we will assume that 75% of the women watch drama.

In [554]:
#Merge all datasets
movie = pd.merge(item, data, how = 'outer', on = 'movie id') #Merge item and data datasets
movie_user = pd.merge(movie, user, how = 'outer', on = 'user id') #Merge previous merged dataset with user dataset
movie_user.head()


#Verifying if men watch more drama than women.

gender_drama = movie_user.groupby(by=['gender'])['Drama'].sum().reset_index().sort_values(['Drama']).tail(10)
gender_total_drama = gender_drama['Drama'].sum()
gender_drama['Percent'] = round(gender_drama['Drama']/gender_total_drama * 100, 2)
print('Question 2 : Percentage of males and females who watch drama.')
print(gender_drama)

#Number of Rankings by Gender
rating_no_gender = movie_user.groupby(by=['gender'])['rating'].sum().reset_index().sort_values(['rating']).tail(10)
print('Number of rankings by Gender')
print(rating_no_gender)

#Checking to see the percentage of drama viewers amongst male and female raters
gender_drama['Percent_rankings'] = round(gender_drama['Drama']/rating_no_gender['rating'] * 100, 2)     #Getting the percentage of each gender who watch drama from all the ratings from that gender
print('Question 4: Percentage of drama viewers amongs male and female rating entries')
print(gender_drama)
Question 2 : Percentage of males and females who watch drama.
  gender    Drama  Percent
0      F  11008.0    27.59
1      M  28887.0    72.41
Number of rankings by Gender
  gender  rating
0      F   90901
1      M  262085
Question 4: Percentage of drama viewers amongs male and female rating entries
  gender    Drama  Percent  Percent_rankings
0      F  11008.0    27.59             12.11
1      M  28887.0    72.41             11.02

Question 2: Among people who watched drama films, more men (N=28887, 72.41%) watched drama than women (N=11008, 27.59%) did.

Question 4: However, among female raters, 12.11% watched drama. Among male raters, only 11.02% view drama films.

In [555]:
#Verifying if women watch more sci-fi than men


gender_scifi = movie_user.groupby(by=['gender'])['Sci-Fi'].sum().reset_index().sort_values(['Sci-Fi']).tail(10)
gender_total_scifi = gender_scifi['Sci-Fi'].sum()
gender_scifi['Percent'] = round(gender_scifi['Sci-Fi']/gender_total_scifi * 100, 2)
print('Question 2 : Percentage of males and females who watch sci-fi.')
print(gender_scifi)


#Checking to see the percentage of sci-fi viewers amongst male and female raters
gender_scifi['Percent_rankings'] = round(gender_scifi['Sci-Fi']/rating_no_gender['rating'] * 100, 2)     #Getting the percentage of each gender who watch sci-fi from all the ratings from that gender
print('Question 4: Percentage of sci-fi viewers amongs male and female rating entries')
print(gender_scifi)
Question 2 : Percentage of males and females who watch sci-fi.
  gender   Sci-Fi  Percent
0      F   2629.0    20.65
1      M  10101.0    79.35
Question 4: Percentage of sci-fi viewers amongs male and female rating entries
  gender   Sci-Fi  Percent  Percent_rankings
0      F   2629.0    20.65              2.89
1      M  10101.0    79.35              3.85

Question 2: According to these merged datasets, more men (N=10101, 79.35%) watched sci-fi compared to women (N= 2629, 20.65%).

Question 4: However, among female raters, 2.89% watched sci-fi. Among male raters, only 3.85% viewed sci-fi films.

In [ ]:
 
In [556]:
#Verifying if men watch more romance than women

gender_romance = movie_user.groupby(by=['gender'])['Romance'].sum().reset_index().sort_values(['Romance']).tail(10)
gender_total_romance = gender_romance['Romance'].sum()
gender_romance['Percent'] = round(gender_romance['Romance']/gender_total_romance * 100, 2)
print('Question 2 : Percentage of males and females who watch romance.')
print(gender_romance)


#Checking to see the percentage of romance viewers amongst male and female raters
gender_romance['Percent_rankings'] = round(gender_romance['Romance']/rating_no_gender['rating'] * 100, 2)     #Getting the percentage of each gender who watch romance from all the ratings from that gender
print('Question 4: Percentage of romance viewers amongs male and female rating entries')
print(gender_romance)
Question 2 : Percentage of males and females who watch romance.
  gender  Romance  Percent
0      F   5858.0     30.1
1      M  13603.0     69.9
Question 4: Percentage of romance viewers amongs male and female rating entries
  gender  Romance  Percent  Percent_rankings
0      F   5858.0     30.1              6.44
1      M  13603.0     69.9              5.19

Question 2: Amongst people who watched romance films, there were more males (N= 13603, 69.9%) compared to females (N= 5858, 30.1%)

Question 4: Amongst all male raters who participated in the study, 5.19% watched romance movies. Amongst all females who participated in the study, 6.44% watched romance films.

In [ ]: