Content-based Filtering in a Nutshell#
In this section, we will go through a straightforward way to generate candidates for recommendations. As we mentioned before, one of the methods is content-based filtering. We will go through an explanation of this method with an example and finally discuss a particular library to implement it. Before that, we have to define and understand embeddings. As you might have noticed, we mentioned a lot “similar items”, “similar users” etc and the question arises – how do we define that similarity? Speaking of the calculation of similarity it is pretty straightforward – we calculate cosine between two arrays. The intriguing part is how we get these arrays from our data.
Embeddings Explained#
The evolution of text processing started from one-hot encoding. When there was text data, Data Scientists would preprocess them (lower case, remove symbols, etc.) and then create one-hot representations of words or n-grams (when we split words/text into 2-3-…-n parts by characters). Finally, use some ML model on top of it. Notwithstanding the fact of easiness and interpretability of this approach, human language is sophisticated and various words can mean different meanings depending on the context and such a technique fails in most cases.
Therefore, embeddings have become the next stage in the text processing pipeline. It is the type of word representation that allows words with similar meanings to have a similar representation. Unlike methods such as one-hot encoding, word embeddings provide a way to represent words in a more meaningful way, by mapping them to a vector of real numbers in a continuous vector space. The idea behind word embedding is to use a neural network to learn relationships between words in a dataset. The neural network is trained to assign a numeric vector to each word in the dataset. Typically, the vector is of fixed length and the goal is to find a vector that accurately represents the meaning of the word, in the context of the dataset. This allows for words in similar contexts to have similar vector representations.
For example, imagine a dataset of movie reviews. Let’s say that the neural network has been trained to assign a vector to each word in the dataset. If the word “amazing” is used in a movie review, then the vector assigned to “amazing” will be similar to the vector assigned to “incredible”. This is because the meanings of these two words are similar and they are often used in similar contexts. Word embeddings can also be used to identify relationships between words. For example, consider the words “man” and “woman”. If the neural network assigned similar vectors to these two words, this would indicate that the two words are related. In addition to identifying relationships between words, word embeddings can also be used to classify documents. For example, if a document contains the words “amazing” and “incredible”, then the neural network can assign an appropriate vector to each of these words. If a second document contains similar words, then the neural network can assign similar vectors to these words. This allows the neural network to accurately classify the documents as being similar.
Finally, word embeddings can be used for data visualization. By plotting the vectors assigned to words in a two-dimensional space, it is possible to see how words are related. This can be a useful tool for understanding the relationships between words in a given dataset. In summary, word embeddings are a powerful tool for representing words in a meaningful way. They can be used to identify relationships between words, classify documents, and visualize data.
Now, let’s consider content-based filtering and use simple Word2Vec/Doc2Vec model to get such recommendations.
Content-based Filtering#
Content-based filtering can be used in a variety of applications, from recommending films and music to suggesting restaurants and travel destinations. In this part, we’ll discuss how content-based filtering works and provide some examples.
Content-based filtering is a type of recommender system that recommends items to users based on their past preferences and behaviors. It works by analyzing a user’s preferences, in terms of attributes such as genre, director, actor, or even a combination of these, and then recommending other items that have similar attributes. For example, if a user has previously watched romantic comedies with Julia Roberts, content-based filtering would recommend other romantic comedies with Julia Roberts, or other films featuring similar actors or directors.
Content-based filtering is based on the assumption that users who liked one item will likely like similar items. To generate recommendations, the system first identifies the attributes of the items that the user has previously interacted with. It then identifies other items that have similar attributes and recommends them to the user. For example, if a user has previously listened to Taylor Swift songs, the system will identify other Taylor Swift songs as well as songs with similar attributes, such as a similar genre or artist. In industry, this type of recommendation is shown with “Similar to …”. It is an additional nudge to increase the interest of a user as recommendations with explanations seem to be personalized from the user’s point of view.
In conclusion, content-based filtering is a type of recommender system that recommends items to users based on their
past preferences and behaviors. Next, we jump to the coding part and create a simple Word2Vec model via gensim
library.
Well-explained the logic of the Word2Vec model you can find here.
Here, we will not discuss the details of implementation.
gensim: example of content-based recommendations based on Doc2Vec approach#
Now, we move on to the implementation of a content-based recommender using gensim
library and Doc2Vec. It is almost
the same as Word2Vec with slight modifications, but the idea remains the same.
0. Configuration#
# links to shared data MovieLens
# source on kaggle: https://www.kaggle.com/code/quangnhatbui/movie-recommender/data
MOVIES_METADATA_URL = 'https://drive.google.com/file/d/19g6-apYbZb5D-wRj4L7aYKhxS-fDM4Fb/view?usp=share_link'
1. Modules and functions#
# just to make it available to download w/o SSL verification
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
import re
import nltk
import numpy as np
import pandas as pd
from tqdm import tqdm_notebook
from ast import literal_eval
from pymystem3 import Mystem
from string import punctuation
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import warnings
warnings.filterwarnings('ignore')
# download stop words beforehand
nltk.download('stopwords')
[nltk_data] Downloading package stopwords to /home/runner/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
True
1.1. Helper functions to avoid copypaste#
def read_csv_from_gdrive(url):
"""
gets csv data from a given url (taken from file -> share -> copy link)
:url: example https://drive.google.com/file/d/1BlZfCLLs5A13tbNSJZ1GPkHLWQOnPlE4/view?usp=share_link
"""
file_id = url.split('/')[-2]
file_path = 'https://drive.google.com/uc?export=download&id=' + file_id
data = pd.read_csv(file_path)
return data
# init lemmatizer to avoid slow performance
mystem = Mystem()
def word_tokenize_clean(doc: str, stop_words: list):
'''
tokenize from string to list of words
'''
# split into lower case word tokens \w lemmatization
tokens = list(set(mystem.lemmatize(doc.lower())))
# remove tokens that are not alphabetic (including punctuation) and not a stop word
tokens = [word for word in tokens if word.isalpha() and not word in stop_words \
not in list(punctuation)]
return tokens
Installing mystem to /home/runner/.local/bin/mystem from http://download.cdn.yandex.net/mystem/mystem-3.1-linux-64bit.tar.gz
2. Main#
2.1. Data Preparation#
# read csv information about films etc
movies_metadata = read_csv_from_gdrive(MOVIES_METADATA_URL)
movies_metadata.dtypes
adult object
belongs_to_collection object
budget object
genres object
homepage object
id object
imdb_id object
original_language object
original_title object
overview object
popularity object
poster_path object
production_companies object
production_countries object
release_date object
revenue float64
runtime float64
spoken_languages object
status object
tagline object
title object
video object
vote_average float64
vote_count float64
dtype: object
To get accurate results we need to preprocess the text a bit. The pipeline will be as follows:
Filter only necessary columns from movies_metadada : id, original_title, overview;
Define
model_index
for the model to match back withid
column;Text cleaning: removing stopwords & punctuation, lemmatization for further tokenization, and tagged document creation required for gensim.Doc2Vec
# filter cols
sample = movies_metadata[['id', 'original_title', 'overview']].copy()
sample.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 45466 non-null object
1 original_title 45466 non-null object
2 overview 44512 non-null object
dtypes: object(3)
memory usage: 1.0+ MB
# as you see from above, we have missing overview in some cases -- let's fill it with the original title
sample.loc[sample['overview'].isnull(), 'overview'] = sample.loc[sample['overview'].isnull(), 'original_title']
sample.isnull().sum()
id 0
original_title 0
overview 0
dtype: int64
# define model_index and make it as string
sample = sample.reset_index().rename(columns = {'index': 'model_index'})
sample['model_index'] = sample['model_index'].astype(str)
# create mapper with title and model_idnex to use it further in evaluation
movies_inv_mapper = dict(zip(sample['original_title'].str.lower(), sample['model_index'].astype(int)))
# preprocess by removing non-character data, stopwords
tags_corpus = sample['overview'].values
tags_corpus = [re.sub('-[!/()0-9]', '', x) for x in tags_corpus]
stop_words = stopwords.words('english')
tags_doc = [word_tokenize_clean(description, stop_words) for description in tags_corpus]
tags_corpus[:1]
["Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences."]
# prepare data as model input for Word2Vec
## it takes some time to execute
tags_doc = [TaggedDocument(words = word_tokenize_clean(D, stop_words), tags = [str(i)]) for i, D in enumerate(tags_corpus)]
# let's check what do we have
## tag = movie index
tags_doc[1]
TaggedDocument(words=['freedom', 'evil', 'invite', 'world', 'siblings', 'running', 'trapped', 'enchanted', 'judy', 'peter', 'inside', 'hope', 'giant', 'living', 'proves', 'finish', 'find', 'rhinoceroses', 'board', 'risky', 'magical', 'terrifying', 'adult', 'unwittingly', 'creatures', 'opens', 'alan', 'years', 'monkeys', 'game', 'three', 'door', 'discover', 'room'], tags=['1'])
2.2. Model Training and Evaluation#
First, let’s define some paramters for Doc2Vec model
VEC_SIZE = 50 # length of the vector for each movie
ALPHA = .02 # model learning param
MIN_ALPHA = .00025 # model learning param
MIN_COUNT = 5 # min occurrence of a word in dictionary
EPOCHS = 20 # number of trainings
# initialize the model
model = Doc2Vec(vector_size = VEC_SIZE,
alpha = ALPHA,
min_alpha = MIN_ALPHA,
min_count = MIN_COUNT,
dm = 0)
# generate vocab from all tag docs
model.build_vocab(tags_doc)
# train model
model.train(tags_doc,
total_examples = model.corpus_count,
epochs = EPOCHS)
Now, let’s make some checks by defining parameters for the model ourselves.
Assume that we watched the movie batman
and based on that generate recommendations similar to its description.
To do that we need:
To extract movie id from
movies_inv_mapper
we created to map back titles from the model outputLoad embeddings from the trained model
Use the built-in most_similar() method to get the most relevant recommendations based on film embedding
Finally, map title names for sense-check
# get id
movie_id = movies_inv_mapper['batman']
movie_id
8603
# load trained embeddings
movies_vectors = model.dv.vectors
movie_embeddings = movies_vectors[movie_id]
# get recommendations
similars = model.docvecs.most_similar(positive = [movie_embeddings], topn = 20)
output = pd.DataFrame(similars, columns = ['model_index', 'model_score'])
output.head()
model_index | model_score | |
---|---|---|
0 | 8603 | 1.000000 |
1 | 5713 | 0.954307 |
2 | 7772 | 0.949543 |
3 | 29872 | 0.948109 |
4 | 43009 | 0.948001 |
# reverse values and indices to map names in dataframe
name_mapper = {v: k for k, v in movies_inv_mapper.items()}
output['title_name'] = output['model_index'].astype(int).map(name_mapper)
output
model_index | model_score | title_name | |
---|---|---|---|
0 | 8603 | 1.000000 | batman |
1 | 5713 | 0.954307 | rollover |
2 | 7772 | 0.949543 | this island earth |
3 | 29872 | 0.948109 | angels die hard |
4 | 43009 | 0.948001 | ultimate avengers 2 |
5 | 28001 | 0.947121 | reach me |
6 | 13835 | 0.946621 | k2 |
7 | 44366 | 0.944187 | abraxas, guardian of the universe |
8 | 24433 | 0.942500 | the creeping terror |
9 | 42040 | 0.942463 | equalizer 2000 |
10 | 43461 | 0.942375 | megafault |
11 | 33298 | 0.941947 | necessary evil |
12 | 44339 | 0.941848 | the underground world |
13 | 18468 | 0.941571 | the incredible petrified world |
14 | 15627 | 0.941512 | crossworlds |
15 | 18294 | 0.941455 | the darkest hour |
16 | 26511 | 0.941257 | the 7 adventures of sinbad |
17 | 43165 | 0.940965 | the zookeeper's wife |
18 | 14178 | 0.940437 | battle for terra |
19 | 11256 | 0.940303 | sun faa sau si |