It’s actually very easy to build a simple recommendation system in Python. I’ll show you how to do it utilizing a movie dataset with various user ratings. In this case we’re just comparing two movies against all others to recommend what a user might like if they were to enjoy Star Wars or Liar Liar.

```
#Initial Imports
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline
```

### Data Imports¶

```
#Set column names
column_names = ['user_id','item_id','rating','timestamp']
#Read in data
df = pd.read_csv('u.data',sep='\t',names=column_names)
```

```
df.head()
```

```
#Read in data
movie_titles = pd.read_csv('Movie_Id_Titles')
```

```
movie_titles.head()
```

```
#Merge these two datasets on the item id
df = pd.merge(df,movie_titles,on='item_id')
```

```
df.head()
```

### Data Manipulation¶

First we want to create a ratings dataframe to hold the average rating for every movie title.

Then we want to add the number of users that rated each movie onto that dataset. This is important because we may end up with outliers in which a movie only received 1-2 ratings and will skew results.

```
#Create base ratings dataframe with avg. rating by title
ratings = pd.DataFrame(df.groupby('title')['rating'].mean())
```

```
ratings.head()
```

```
#Tack on a new column called 'num of ratings' to hold the num of users who rated that movie
ratings['num of ratings'] = pd.DataFrame(df.groupby('title')['rating'].count())
```

```
ratings.head()
```

### Light Visual Exploration¶

We can visualize the number of ratings and the rating given to help determine outliers. Here we can see that a small amount of movies contain the most ratings (one could surmise that a small amount of popular movies are more likely to be rated). Also - we can see that there are a disproportionate amount of 1.0 ratings (one could surmise this that this is because highly unpopular movies are more likely to be rated as a 1.0, or that these could be movies that are less likely to be heard of)

```
sns.set_style('whitegrid')
ratings['num of ratings'].hist(bins=70)
```

```
ratings['rating'].hist(bins=70)
```

Notice while utilizing a jointplot that the number of ratings seem to correlate to higher ratings - again this could mean that more popular movies are more likely to receive both higher ratings as well as an increased volume of ratings.

```
sns.jointplot(x='rating',y='num of ratings',data=ratings,alpha=0.5)
```

### Data Manipulation Pt. 2¶

Now that we have a dataframe to hold the number of ratings per movie, we'll want to pivot the data to more easily obtain individual ratings for specific movies. In this example we're going to test out the correlation of all movies against Star Wars and Liar Liar.

```
#Create pivoted dataframe to show user ratings by movie titles
moviemat = df.pivot_table(index='user_id',columns='title',values='rating')
```

```
moviemat.head()
```

```
#Grab the ratings for Star Wars and Liar Liar individually
starwars_user_ratings = moviemat['Star Wars (1977)']
liarliar_user_ratings = moviemat['Liar Liar (1997)']
```

```
starwars_user_ratings.head()
```

### Correlate - Star Wars¶

We can now use the corrwith function to take the original pivoted dataframe (moviemat), and throw back all movies that are similar (1 meaning perfectly correlated)

```
#Find all movies similar to Star Wars
similar_to_starwars = moviemat.corrwith(starwars_user_ratings)
```

```
#Build correlation dataframe for Star Wars
corr_starwars = pd.DataFrame(similar_to_starwars, columns=['Correlation'])
#Drop out NaN
corr_starwars.dropna(inplace=True)
```

```
corr_starwars.head()
```

```
corr_starwars.sort_values('Correlation',ascending=False).head(10)
```

So what is the problem here? We're receiving perfect correlations for a large amount of moviees. The most likely issue we're facing is that these movies probably have only a few number of users who have rated them, and those users just so happened to rate Star Wars as well. Meaning these are outliers due to the low number of users who have rated them. We could of course verify this by pulling the number of ratings for these 10.

```
#Pull number of ratings for these titles
tempDF = corr_starwars.sort_values('Correlation',ascending=False).head(10)
pd.merge(ratings,tempDF,on='title')
```

```
#Pull mean of num of ratings
ratings['num of ratings'].mean()
```

Based on the fact that the mean number of ratings is ~60, we could likely conclude these are just outliers that could be removed to help improve accuracy. To complete out correlation, we can bring in the num of ratings for all movies into our dataframe and filter these out.

```
#Join in the ratings for all movies
corr_starwars = corr_starwars.join(ratings['num of ratings'])
```

```
corr_starwars.head()
```

```
#Sort out anything that falls below the mean of 60 ratings
corr_starwars[corr_starwars['num of ratings'] > 60].sort_values('Correlation',ascending=False)
```

This data seems to make a lot of sense as the sequels to star wars are being correlated highly.

### Correlate - Liar Liar¶

Now we can repeat the same process for the next movie...

```
#Find all movies similar to Liar Liar
similar_to_liarliar = moviemat.corrwith(liarliar_user_ratings)
```

```
#Build correlation dataframe
corr_liarliar = pd.DataFrame(similar_to_liarliar, columns=['Correlation'])
#Drop NaN
corr_liarliar.dropna(inplace=True)
```

```
#Join in the num of ratings
corr_liarliar = corr_liarliar.join(ratings['num of ratings'])
```

```
#Show final correlation
corr_liarliar[corr_liarliar['num of ratings'] > 100].sort_values('Correlation',ascending=False)
```

Given the final results here, I'd have to say this also makes a lot of sense. The movies being shown were all quite popular all around the same general timeframe.