In this post I explore an emerging type
of recommendation system focused on the context rather than the traditional
similarity based methods. This post is inspired by my past work on a user-to-user pairing system
that similar to user-to-item recommendations. I explained how to extract the meanings out of the
unstructured data and store it on a representative vector. These vectors are then stored in a vector
database, enabling the identification of similar contexts. Finally I provide an example use case,
demonstrating the code for implementing these systems. While I cover the basics, dealing with
real-world data requires more complex and scalable systems with low latency requirements.
The Recommendation
Problem
Recommendation systems aim to understand what users want, which
can be challenging with limited interactions. By linking behaviors to available options we can
better predict what might interest them. I like to think of a recommendation system as a function
that takes in user information and outputs a list of top items they might like. This is not confined
to recommending a product to a user, for example a user-to-user pairing in chess tournaments or
profile ranking in dating apps are also considered recommendation in essence. In its most common
formulation, the recommendation problem is reduced to problem of estimating ratings for the items
that have not been seen by a user. While it's tricky to guess exactly what a user wants,
getting it right can benefit a business. In fact, a smart recommendation system can increase sales
by about 22.66% on average!
Many recommendation algorithms rely on similarity
being either user similarity or item similarity. However these two recommendation types require a
deep understanding of both user/item data and scenario to establish connections between them.
Let's delve into a hypothetical scenario where an online library wants to design an academic
paper recommendation system for its users. Common approaches would be:
- content-based
filtering, similar items that users interacted with previously
and/or
-
collaborative filtering, items that similar users engaged with
Figure: User similarity (collaborative filtering) and item
similarity (content-based filtering)
Context Aware Recommendation
Systems
While similarity based approaches are widely adopted they require a
decent amount of data to establish statistically significant patterns. Unfortunately in most
business cases, they fail to perform well because the interaction data is limited.
The similarity patterns between users or items can lead to ambiguity and retrieve unsuitable
recommendations. Additionally the cold start problem poses a challenge, particularly for new users
with no prior interactions. In such cases, similarity-based recommendation systems may only offer
default suggestions.
To address these issues, context aware recommendation
systems offer a promising alternative by considering contextual information of the
user/item/interaction. Contextual information can be inquired directly from the user/item or
indirectly by generating representative query as metadata. This creates the context we need from the
path and relevance searching will be carried out by this context for these systems.
" I assume as the LLMs take over, creating metadata like a query
will be more prevalent and accurate "
Figure: A query can be generated from
different context aspects like item's name/category, interaction's setting and
scenario
A traditional approach for context based recommendations could be
using full-text search for query. This approach will be inefficient because in real life the texts
are not exactly the same, although they might carry similar meaning. Strings are lexical references
but what we look for is the semantics (=meanings).
Figure: An example from Weaviate for
full text search does not retrieve desired results. Altough fish and seafood are semantically
close, the algorithm could not retrieve relevant item in full-text search.
Extracting the Meaning out of Text
In
order to work on semantics we need to capture the meaning of a word. The state-of-art way of
capturing meaning is to represent the text in an array of numbers, we call this data structure an
embedding. It is a vector that represent features of data extracted from the
source. Great thing about embeddings is the input data is not restricted to text, images and audio
can also be transformed into embeddings.
Transforming data from any kind to embeddings is
called vectorization. Vectorization outputs similar arrays when given similar
inputs and we can calculate how similar those arrays are for estimating the semantic relation of
these inputs. For example cat and kitten are two close words with similar features but totally
different in text similarity. The embeddings look like:
cat = [1.5, -0.4, 7.2, 19.6,
3.1, ..., 20.2]
kitty = [1.5, -0.4, 7.2, 19.5, 3.2, ..., 20.8]
These
values in the embedding vectors do not represent a specific feature a human can grasp. Since
vectorization is a deep learning method it is almost impossible to explain what those numbers
represent in the total meaning of the word. However, some calculations with the vectors might give
us an intuition about the meanings, a great example is given by Jay Alammar illustrates if we
subtract man from king and add woman, we get a result that is way more similar to queen than any
other word in the corpus.
Figure: An example of feature
representation guessing from basic calculations with the embedding vectors.
Modern embeddings are created by deep learning models and they are
context aware. Each occurrence of a word has different embeddings modified by
surrounding text. For example light in the "bright light" and "light weight
dumbell" are not the same. In the last couple of years we’ve also seen multimodal embedding
models, like CLIP, enable algorithms to link an image to a text representation. This is the core of
text-to-image algorithms.
Implementing Recommendation System with Vector
Database
For recommendation, we need to find semantically similar items
with the given query. First we need to have a corpus of item embeddings, then vectorize the query
with the same method and find the most similar items with the query. Finding similar items with the
query at this point is just a vector operation of calculating the distance and taking the nearest
neighbours of the query vector.
We need a database to store our item’s embeddings and a
database management system to search for a relevant item with CRUD capabilities. Vector databases
are a special type of database that specialize in storing and searching vector data structures. The
operation is simple, first create embeddings of our items and store them in a vector database.
Figure: Query is vectorized by the same
embedding model item data is vectorized. Vector Databases are specialized to retrieve the similar
vectors
Vector databases create an index for the vectors for storing. This
allows for fast lookup when querying. The vectors in any index we are creating must share the same
dimensionality and distance metric for measuring similarity. To judge how similar two vectors are we use
distance metrics. The lower the distance metric between the user query and the item is, the more similar
the two. It's essential to choose a distance metric that aligns with the specific model you're
using. For example, if you're using a Siamese Neural Network (SNN) for embedding, the contrastive
loss function typically incorporates the Euclidean distance. Similarly, when fine-tuning a sentence
transformer, you define the loss function, where the CosineSimilarityLoss computed similarity based on
cosine similarity between two embeddings. In essence, there isn't a universal distance metric that
fits all scenarios. The choice depends on various factors such as your data characteristics, the model
architecture, and the intended application. While cosine distance and dot product may yield similar
results in some cases. Similarly, the Manhattan and Euclidean distances trade off between accuracy and
speed, making them suitable for different contexts.
Example with Code
Let’s say we’re
building a recommendation system for academic papers. The papers have summary part where we can
extract meaningful insights of its context. So instead of user/paper similarity on past interaction
data, the assumption is that the summary holds more information on what's the paper about. A
vector database of summary embeddings will be populated. Then when a user search a query or the
application generates a query from the interactions, will be feed into the database to retrieve most
similar items. Those items will be the recommended papers to the user. Let's run over a sample
code to achieve this!
1- Import all neccesary modules and define database's index
and embedding model. sentence-transformers library provides pre-trained embedding models. I chose
all-MiniLM-L6-v2 because it’s light and fast but still does not underperform.
import arxiv
import pandas as pd
from sentence_transformers import SentenceTransformer
from pinecone import Pinecone
from collections import defaultdict
from tqdm import tqdm
import time
pc = Pinecone(api_key=pc_api) # your VDB api key
model = SentenceTransformer('all-MiniLM-L6-v2') # embedding model
index_name = "papers" # name of our index
2- Define embedding function and create the index in the
database. For this task I used Pinecone as database. Pinecone is easy to use and it offers $100 free
usage when first signing up. For a small throughput, it should be enough for a long time.
def embed_func(text):
return model.encode(text).tolist() # returns the embedding vector
pc.create_index(
name="papers",
dimension=384, # dimension of the embedding model vector size
metric="euclidean", # choose the distance metric according to the use-case
spec=ServerlessSpec(
cloud='aws',
region='us-west-2' )
)
index = pc.Index("papers")
time.sleep(1)
index.describe_index_stats() # get the VDB's index properties
3- Define the functions that populates the index by pulling the
data and vectorize it and search the index with database's internal functon. Pinecone also
allows inserting metadata with the item so we can do a hybrid search with the other features like
author.
def get_arxiv_df(embed_func, length=100):
results = arxiv.Search(
query="cat:cs.AI OR cat:cs.CV OR cat:stat.ML",
max_results=length,
sort_by=arxiv.SortCriterion.Relevance,
sort_order=arxiv.SortOrder.Descending,
).results()
df = defaultdict(list)
for result in tqdm(results, total=length):
try:
df["title"].append(result.title)
df["summary"].append(result.summary)
df["authors"].append(str(result.authors))
df['updated'].append(result.updated)
df["url"].append(result.entry_id)
df["vector"].append(embed_func(result.summary)) #create the embedding vector
except Exception as e:
print("error: ", e)
return pd.DataFrame(df)
def search_table(query, embed_func=embed_func, lim=3): # retrieve top 3 similar items
db = pc.Index("papers")
embs = embed_func(query)
results = db.query(vector=embs, top_k=lim, namespace="papers") # search the VDB
return results["matches"]
4- Now we can play around with this application by changing the
query
if __name__ == "__main__":
df = get_arxiv_df(embed_func, length=100)
embed_into_pinecone(df)
query = 'self supervised learning'
results = search_table(query, embed_func=embed_func, lim=3)
print(results)
Closing Notes
In this post I
tried to explain context aware recommendation systems and its implementation using vector databases.
We're still figuring out the use cases of AI and it's absolutely more than chatbots. We
can understand our users better or extract meaning out of different kinds of data. I believe future
recommendations will be more personalized and context aware.
I realized the success of
AI implementation in businesses depends heavily on the data management. The key challenges are
making sure accurate/clean data is fed into these algorithms, fast accessibility, effective cost
management for storage, and proper integrations.
If you're a business interested in
collaborating or an individual with feedback on the post, please contact me. Thank you for
reading!