Content-based Filtering in a Nutshell

Content-based Filtering in a Nutshell#

In this section, we will go through a straightforward way to generate candidates for recommendations. As we mentioned before, one of the methods is content-based filtering. We will go through an explanation of this method with an example and finally discuss a particular library to implement it. Before that, we have to define and understand embeddings. As you might have noticed, we mentioned a lot “similar items”, “similar users” etc and the question arises – how do we define that similarity? Speaking of the calculation of similarity it is pretty straightforward – we calculate cosine between two arrays. The intriguing part is how we get these arrays from our data.

Embeddings Explained#

The evolution of text processing started from one-hot encoding. When there was text data, Data Scientists would preprocess them (lower case, remove symbols, etc.) and then create one-hot representations of words or n-grams (when we split words/text into 2-3-…-n parts by characters). Finally, use some ML model on top of it. Notwithstanding the fact of easiness and interpretability of this approach, human language is sophisticated and various words can mean different meanings depending on the context and such a technique fails in most cases.

Therefore, embeddings have become the next stage in the text processing pipeline. It is the type of word representation that allows words with similar meanings to have a similar representation. Unlike methods such as one-hot encoding, word embeddings provide a way to represent words in a more meaningful way, by mapping them to a vector of real numbers in a continuous vector space. The idea behind word embedding is to use a neural network to learn relationships between words in a dataset. The neural network is trained to assign a numeric vector to each word in the dataset. Typically, the vector is of fixed length and the goal is to find a vector that accurately represents the meaning of the word, in the context of the dataset. This allows for words in similar contexts to have similar vector representations.

For example, imagine a dataset of movie reviews. Let’s say that the neural network has been trained to assign a vector to each word in the dataset. If the word “amazing” is used in a movie review, then the vector assigned to “amazing” will be similar to the vector assigned to “incredible”. This is because the meanings of these two words are similar and they are often used in similar contexts. Word embeddings can also be used to identify relationships between words. For example, consider the words “man” and “woman”. If the neural network assigned similar vectors to these two words, this would indicate that the two words are related. In addition to identifying relationships between words, word embeddings can also be used to classify documents. For example, if a document contains the words “amazing” and “incredible”, then the neural network can assign an appropriate vector to each of these words. If a second document contains similar words, then the neural network can assign similar vectors to these words. This allows the neural network to accurately classify the documents as being similar.

Finally, word embeddings can be used for data visualization. By plotting the vectors assigned to words in a two-dimensional space, it is possible to see how words are related. This can be a useful tool for understanding the relationships between words in a given dataset. In summary, word embeddings are a powerful tool for representing words in a meaningful way. They can be used to identify relationships between words, classify documents, and visualize data.

Now, let’s consider content-based filtering and use simple Word2Vec/Doc2Vec model to get such recommendations.

Content-based Filtering#

Content-based filtering can be used in a variety of applications, from recommending films and music to suggesting restaurants and travel destinations. In this part, we’ll discuss how content-based filtering works and provide some examples.

Content-based filtering is a type of recommender system that recommends items to users based on their past preferences and behaviors. It works by analyzing a user’s preferences, in terms of attributes such as genre, director, actor, or even a combination of these, and then recommending other items that have similar attributes. For example, if a user has previously watched romantic comedies with Julia Roberts, content-based filtering would recommend other romantic comedies with Julia Roberts, or other films featuring similar actors or directors.

Content-based filtering is based on the assumption that users who liked one item will likely like similar items. To generate recommendations, the system first identifies the attributes of the items that the user has previously interacted with. It then identifies other items that have similar attributes and recommends them to the user. For example, if a user has previously listened to Taylor Swift songs, the system will identify other Taylor Swift songs as well as songs with similar attributes, such as a similar genre or artist. In industry, this type of recommendation is shown with “Similar to …”. It is an additional nudge to increase the interest of a user as recommendations with explanations seem to be personalized from the user’s point of view.

In conclusion, content-based filtering is a type of recommender system that recommends items to users based on their past preferences and behaviors. Next, we jump to the coding part and create a simple Word2Vec model via gensim library. Well-explained the logic of the Word2Vec model you can find here. Here, we will not discuss the details of implementation.

gensim: example of content-based recommendations based on Doc2Vec approach#

Now, we move on to the implementation of a content-based recommender using gensim library and Doc2Vec. It is almost the same as Word2Vec with slight modifications, but the idea remains the same.

0. Configuration#

# links to shared data MovieLens
# source on kaggle: https://www.kaggle.com/code/quangnhatbui/movie-recommender/data
MOVIES_METADATA_URL = 'https://drive.google.com/file/d/19g6-apYbZb5D-wRj4L7aYKhxS-fDM4Fb/view?usp=share_link'

1. Modules and functions#

# just to make it available to download w/o SSL verification
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

import re
import nltk
import numpy as np
import pandas as pd
from tqdm import tqdm_notebook
from ast import literal_eval
from pymystem3 import Mystem
from string import punctuation
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

import warnings
warnings.filterwarnings('ignore')

# download stop words beforehand
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/runner/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.

True

1.1. Helper functions to avoid copypaste#

def read_csv_from_gdrive(url):
    """
    gets csv data from a given url (taken from file -> share -> copy link)
    :url: example https://drive.google.com/file/d/1BlZfCLLs5A13tbNSJZ1GPkHLWQOnPlE4/view?usp=share_link
    """
    file_id = url.split('/')[-2]
    file_path = 'https://drive.google.com/uc?export=download&id=' + file_id
    data = pd.read_csv(file_path)

    return data

# init lemmatizer to avoid slow performance
mystem = Mystem() 

def word_tokenize_clean(doc: str, stop_words: list):
    '''
    tokenize from string to list of words
    '''

    # split into lower case word tokens \w lemmatization
    tokens = list(set(mystem.lemmatize(doc.lower())))
  
    # remove tokens that are not alphabetic (including punctuation) and not a stop word
    tokens = [word for word in tokens if word.isalpha() and not word in stop_words \
              not in list(punctuation)]
    return tokens

Installing mystem to /home/runner/.local/bin/mystem from http://download.cdn.yandex.net/mystem/mystem-3.1-linux-64bit.tar.gz

2. Main#

2.1. Data Preparation#

# read csv information about films etc
movies_metadata = read_csv_from_gdrive(MOVIES_METADATA_URL)
movies_metadata.dtypes

adult                     object
belongs_to_collection     object
budget                    object
genres                    object
homepage                  object
id                        object
imdb_id                   object
original_language         object
original_title            object
overview                  object
popularity                object
poster_path               object
production_companies      object
production_countries      object
release_date              object
revenue                  float64
runtime                  float64
spoken_languages          object
status                    object
tagline                   object
title                     object
video                     object
vote_average             float64
vote_count               float64
dtype: object

To get accurate results we need to preprocess the text a bit. The pipeline will be as follows:

Filter only necessary columns from movies_metadada : id, original_title, overview;
Define model_index for the model to match back with id column;
Text cleaning: removing stopwords & punctuation, lemmatization for further tokenization, and tagged document creation required for gensim.Doc2Vec

# filter cols
sample = movies_metadata[['id', 'original_title', 'overview']].copy()
sample.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              45466 non-null  object
 1   original_title  45466 non-null  object
 2   overview        44512 non-null  object
dtypes: object(3)
memory usage: 1.0+ MB

# as you see from above, we have missing overview in some cases -- let's fill it with the original title
sample.loc[sample['overview'].isnull(), 'overview'] = sample.loc[sample['overview'].isnull(), 'original_title']
sample.isnull().sum()

id                0
original_title    0
overview          0
dtype: int64

# define model_index and make it as string
sample = sample.reset_index().rename(columns = {'index': 'model_index'})
sample['model_index'] = sample['model_index'].astype(str)

# create mapper with title and model_idnex to use it further in evaluation
movies_inv_mapper = dict(zip(sample['original_title'].str.lower(), sample['model_index'].astype(int)))

# preprocess by removing non-character data, stopwords
tags_corpus = sample['overview'].values
tags_corpus = [re.sub('-[!/()0-9]', '', x) for x in tags_corpus]
stop_words = stopwords.words('english')

tags_doc = [word_tokenize_clean(description, stop_words) for description in tags_corpus]
tags_corpus[:1]

["Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences."]

# prepare data as model input for Word2Vec
## it takes some time to execute
tags_doc = [TaggedDocument(words = word_tokenize_clean(D, stop_words), tags = [str(i)]) for i, D in enumerate(tags_corpus)]

# let's check what do we have
## tag = movie index
tags_doc[1]

TaggedDocument(words=['freedom', 'evil', 'invite', 'world', 'siblings', 'running', 'trapped', 'enchanted', 'judy', 'peter', 'inside', 'hope', 'giant', 'living', 'proves', 'finish', 'find', 'rhinoceroses', 'board', 'risky', 'magical', 'terrifying', 'adult', 'unwittingly', 'creatures', 'opens', 'alan', 'years', 'monkeys', 'game', 'three', 'door', 'discover', 'room'], tags=['1'])

2.2. Model Training and Evaluation#

First, let’s define some paramters for Doc2Vec model

VEC_SIZE = 50 # length of the vector for each movie
ALPHA = .02 # model learning param
MIN_ALPHA = .00025 # model learning param
MIN_COUNT = 5 # min occurrence of a word in dictionary
EPOCHS = 20 # number of trainings

# initialize the model
model = Doc2Vec(vector_size = VEC_SIZE,
                alpha = ALPHA, 
                min_alpha = MIN_ALPHA,
                min_count = MIN_COUNT,
                dm = 0)

# generate vocab from all tag docs
model.build_vocab(tags_doc)

# train model
model.train(tags_doc,
            total_examples = model.corpus_count,
            epochs = EPOCHS)

Now, let’s make some checks by defining parameters for the model ourselves. Assume that we watched the movie batman and based on that generate recommendations similar to its description. To do that we need:

To extract movie id from movies_inv_mapper we created to map back titles from the model output
Load embeddings from the trained model
Use the built-in most_similar() method to get the most relevant recommendations based on film embedding
Finally, map title names for sense-check

# get id
movie_id = movies_inv_mapper['batman']
movie_id

# load trained embeddings 
movies_vectors = model.dv.vectors

movie_embeddings = movies_vectors[movie_id]

# get recommendations
similars = model.docvecs.most_similar(positive = [movie_embeddings], topn = 20)
output = pd.DataFrame(similars, columns = ['model_index', 'model_score'])
output.head()

	model_index	model_score
0	8603	1.000000
1	5713	0.954307
2	7772	0.949543
3	29872	0.948109
4	43009	0.948001

# reverse values and indices to map names in dataframe
name_mapper = {v: k for k, v in movies_inv_mapper.items()}

output['title_name'] = output['model_index'].astype(int).map(name_mapper)
output

	model_index	model_score	title_name
0	8603	1.000000	batman
1	5713	0.954307	rollover
2	7772	0.949543	this island earth
3	29872	0.948109	angels die hard
4	43009	0.948001	ultimate avengers 2
5	28001	0.947121	reach me
6	13835	0.946621	k2
7	44366	0.944187	abraxas, guardian of the universe
8	24433	0.942500	the creeping terror
9	42040	0.942463	equalizer 2000
10	43461	0.942375	megafault
11	33298	0.941947	necessary evil
12	44339	0.941848	the underground world
13	18468	0.941571	the incredible petrified world
14	15627	0.941512	crossworlds
15	18294	0.941455	the darkest hour
16	26511	0.941257	the 7 adventures of sinbad
17	43165	0.940965	the zookeeper's wife
18	14178	0.940437	battle for terra
19	11256	0.940303	sun faa sau si

Source & further recommendations#

Deep explanation of Word2Vec and embeddings