Movie Finder Application with MongoDB Vector Search + RAG
This tutorial will show how we can built a movie finder application using MongoDB Atlas Vector Search and React
In this tutorial, I am going to explain an application project I have created showing how Atlas Vector Search, Atlas Search, and a combined RAG system on movie finding by plot works.
We will take a step by step look at the three parts of this project to explain how it works and what is going on behind the scenes. To do so, part 1 is going to start with the creation of vectors using OpenAI + Langchain + MongoDB Atlas.
You can find the repository of this project here: which is important to download in advance as we are going to explain what each part does under the hood.
Part 1: CreateAtlasVectorSearch
After cloning the Github repository, you will find a folder with the name createAtlasVectorSearch
. Inside this directory is where we are going to work in part 1 of this 3 part series. First, we need to duplicate the env.sample
file and rename it to .env
.
1
2
3
4
5
6
7
8
9
OPENAI_API_KEY=
ATLAS_CLUSTER_URI=
GROUP_ID=
CLUSTER_NAME=
PUBLIC_KEY=
PRIVATE_KEY=
ATLAS_EMBEDDING_NAMESPACE=sample_mflix.embedded_movies
ATLAS_NAMESPACE=sample_mflix.movies
MODEL_NAME=text-embedding-3-small
We need to fill out the variables here with the data needed for this portion of the application to work.
OPENAI_API_KEY
: The OpenAI key needed to make a request to OpenAI. You can create yours in the official OpenAI key section.ATLAS_CLUSTER_URI
: This is the MongoDB Atlas connection URI to connect to your cluster from an application. Please remember that if you are using Network Access Lists to your cluster, add the IP from where you are going to run this application.GROUP_ID
: The Project ID displayed at the top of the page in Atlas.CLUSTER_NAME
: The name of your Cluster.PUBLIC_KEY
: The public key needed to access programmatically to your Atlas cluster.PRIVATE_KEY
: The private key needed to access programmatically to your Atlas cluster.ATLAS_EMBEDDING_NAMESPACE
: This is the namespace (Database.Collection) where the embedding vector will be located.ATLAS_NAMESPACE
: This is the namespace (Database.Collection) where the original documents are stored. The ones we are going to use for generating the embedding vectors.MODEL_NAME
: The OpenAI model we are going to use for embedding.EMBEDDING_KEY
: The field name (collection field name) where our vector is located within each document.
An example of a .env
file finished would be somewhat similar to the following:
1
2
3
4
5
6
7
8
9
10
OPENAI_API_KEY=sk-proj-5zSKg5QIK1-***
ATLAS_CLUSTER_URI=mongodb+srv://<username>:<password>@cluster1.****.mongodb.net
GROUP_ID=658d46ca7605526eb452****
CLUSTER_NAME=Cluster1
PUBLIC_KEY=eapm****
PRIVATE_KEY=5976f0f4-2304-4042-****
ATLAS_EMBEDDING_NAMESPACE=movies.embedded_movies
ATLAS_NAMESPACE=sample_mflix.movies
MODEL_NAME=text-embedding-3-small
EMBEDDING_KEY=plot_embedding
After having this set up, we can now move to the app.ts
file. This is the main file of this application.
When running the app with:
1
npm run start
We will be prompted with several different options to choose from:
1
2
3
4
5
6
1 - Create an Embedding array for the plot field on the sample_mflix.movies collection in your Atlas Cluster.
2 - Create an euclidean type Atlas Vector Search Index in your Atlas Cluster.
3 - Create a cosine type Atlas Vector Search Index in your Atlas Cluster.
4 - Create a dotProduct type Atlas Vector Search Index in your Atlas Cluster.
5 - Create an Atlas Search index in your Atlas Cluster.
6 - Query using an Atlas Vector Search Index.
Let’s look at what these options do one by one:
1 - Create an Embedding array for the plot field on the sample_mflix.movies collection in your Atlas Cluster.
When this option is selected, the function createEmbeddings
is called. This will execute an aggregation pipeline that will look for the field plot
or fullplot
(as we want to get the plot of the movies in the sample_mflix.movies
collection).
Please note that for this tutorial to work as is, the MongoDB Sample collection must have been loaded in your cluster, specifically the
sample_mflix.movies
.
The aggregation pipeline looks like:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
[{
"$match": {
"$or": [
{
"plot": {
"$exists": true
}
}, {
"fullplot": {
"$exists": true
}
}
]
}
},{
"$project": {
"fullplot": {
"$ifNull": [
"$fullplot", "$plot"
]
},
"year": 1,
"type": 1
}
}]
This will return the fullplot
of each movie, and if the fullplot
field does not exist, it will return the plot
in the fullplot
field. Along with this, the year
, type
and _id
will also be returned. The year
and type
would be useful for enabling pre-filtering when using Atlas Vector Search.
This part of the code will look like:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
const fullplot = [];
const ids = [];
for await (const doc of await collection.aggregate([
{
'$match': {
'$or': [
{
'plot': {
'$exists': true
}
}, {
'fullplot': {
'$exists': true
}
}
]
}
}, {
'$project': {
'fullplot': {
'$ifNull': [
'$fullplot', '$plot'
]
},
'year': 1,
'type': 1
}
}
])
) {
fullplot.push(doc.fullplot);
ids.push({ _id: doc._id, namespace: namespace, type: doc.type, year: doc.year });
}
Therefore, we are iterating in each movie’s document and returning the fullplot
, year
, type
and _id
and adding that result in an array variable fullplot
.
Once we have this array created, we are going to use the static Langchain MongoDBAtlasVectorSearch
class to create a vector store the list of documents fullplot
embedded in a vector. This first converts the documents to vectors and then adds them to the MongoDB collection. We are going to use OpenAIEmbeddings
to create the embeddings from the fullplot
text field.
The full code will be:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
const vectorstore = await MongoDBAtlasVectorSearch.fromTexts(
fullplot,
ids,
new OpenAIEmbeddings({
openAIApiKey: process.env.OPENAI_API_KEY,
modelName: process.env.MODEL_NAME,
}),
{
collection: tCollection,
indexName: 'vs_euclidean', // The name of the Atlas search index. Defaults to 'vs_euclidean'
textKey: 'text', // The name of the collection field containing the raw content. Defaults to 'fullplot'
embeddingKey: `${process.env.EMBEDDING_KEY}`, // The name of the collection field containing the embedded text. Defaults to 'plot_embedding'
}
)
.finally(() => {
clearInterval(vectorStoreInterval);
console.log('\nfinished creating vector store embeddings');
logger('finished creating vector store embeddings');
});
There are a few important things here that we are going to cover one by one:
modelName
: This is the embedding model we are going to use to generate the vector embeddings. For this particular example, since we are going to embed text fields, I am usingtext-embedding-3-small
from OpenAI.
Please note that in part two of this tutorial, the input/question must be embedded using the same model.
indexName
: This is the name of the Vector Search index that we are going to create later on in Atlas and that we will use to retrieve the similarity search.textKey
: The name of the collection field containing the raw content and corresponds to the plaintext of'pageContent'
.embeddingKey
: The name of the collection field containing the embedded vector.
The rest of the code we need to add our recently created vector to MongoDB Atlas, is the following:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
const assignedIds = await vectorstore.addDocuments([
{ pageContent: "upsertable", metadata: {} },
]);
logger(`Nº assigned IDs: ${assignedIds.length}`);
const upsertedDocs = [{ pageContent: "overwritten", metadata: {} }];
logger(`Upserted Docs: ${upsertedDocs.length}`);
console.log('adding vector documents to the collection.\n');
logger('adding vector documents to the collection.\n');
vectorstore.addDocuments(upsertedDocs, { ids: assignedIds })
.then(() => {
console.log('finished adding vector documents to the collection');
logger('finished adding vector documents to the collection');
return
})
.catch(err => {
throw new Error(`Promise rejected with error: ${err}`);
})
.finally(() => {
client.close();
return;
});
We are using upsertable
to avoid duplicating documents if this option is selected more than once and the vector embeddings are already created for a certain document.
This part of our application will take a bit to execute but by the time it ends, in your Atlas Cluster you will find a new database and a new collection with documents with the following structure:
1
2
3
4
5
6
7
8
{
"_id": ObjectId("573a1390f29313caabcd42e8"),
"text": "Among the earliest existing films in American cinema - notable as the …",
"plot_embedding": [-0.0117434, ..., 0.014351842](1536),
"namespace": "sample_mflix.movies",
"type": "movie",
"year": 1903
}
2 - Create an euclidean type Atlas Vector Search Index in your Atlas Cluster.
We need to create a Vector Search Index in our Atlas Cluster so that we can query our embedding data. This option in the code will do that for us. The function that will do this is createIndex
.
There are several ways we can create an Atlas Vector Search index (UI, CLI, API), in this particular example we are going to use the API. For the API call to work, we need to create the body of the request. An Atlas Vector Search index is composed of:
1
2
3
4
5
6
7
interface IndexBody {
database: string;
collectionName: string;
name?: string;
type?: string;
definition?: object;
}
Thus, our body will have:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
'database': `${embeddingNamespace.split('.')[0]}`, // The name of the database
'collectionName': `${embeddingNamespace.split('.')[1]}`, // The name of the collection,
'name': (type === 'euclidean') ? 'vs_euclidean' : `vs_${type}`, // The name of the index
'type': 'vectorSearch',
'definition': {
'fields': [
{
'path': `${process.env.EMBEDDING_KEY}`, // Name of the field in the collection
'similarity': `${type}`,
'type': 'vector',
'numDimensions': 1536, // This is for OpenAI Embeddings
},
{
'path': 'year',
'type': 'filter'
},
{
'path': 'type',
'type': 'filter'
}
]
}
Let’s talk a little bit about the above fields:
database
andcollectionName
: This will tell the index which namespace is the one that has the documents to query. In this case, when we are calling this function we are passing theembeddingNamespace
variable that contains the${process.env.ATLAS_EMBEDDING_NAMESPACE}
that we have defined in the.env
file:
1
2
3
4
5
createIndex(embeddingNamespace, type)
.then((res) => {
logger(`Atlas Vector Search Index created: ${res}`)
})
.catch(handleError);
name
: This will define the name of the index. In this case, there are three types of algorithms (-by the time this tutorial has been written-) available in Atlas. Euclidean, Cosine and dotProduct. For this, is thetype === 'euclidean'
we are going to use the namevs_euclidean
but if it iscosine
we are going to use the namevs_cosine
. The same fordotProduct
withvs_dotProduct
.
euclidean - measures the distance between ends of vectors. This allows you to measure similarity based on varying dimensions.
cosine - measures similarity based on the angle between vectors. This allows you to measure similarity that isn’t scaled by magnitude.
dotProduct - measures similarity like cosine, but takes into account the magnitude of the vector. This allows you to efficiently measure similarity based on both angle and magnitude.
type
: For this example, we are going to create an Atlas vector Search Index and therefore we will create avectorSearch
index type.definition.fields
: This is an array that contains, at least:- The primary definition of our Vector Search index:
1 2 3 4 5 6
{ 'path': `${process.env.EMBEDDING_KEY}`, // Name of the field in the collection 'similarity': `${type}`, 'type': 'vector', 'numDimensions': 1536, // This is for OpenAI Embeddings }
path
: This is the name of the field in the embedding collection where the vectors are stored.similarity
: One of the three available algorithms:euclidean
,cosine
ordotProduct
.type
: In this particular case, we are usingvector
as this is an Atlas Vector Search example project.numDimensions
: The dimension of the vectors. This will depend on the model that we are using for generating the embeddings. In this particular scenario forOpenAI.text-embedding-3-small
is 1536.
Finally, as we mentioned earlier, we are going to issue a POST
request using the Atlas Admin API for creating this index in our Atlas cluster.
3 - Create a cosine type Atlas Vector Search Index in your Atlas Cluster.
This is exactly the same as before but in this case we are going to create a new index named vs_cosine
for creating a similarity cosine vector search index in our Atlas Cluster.
4 - Create a dotProduct type Atlas Vector Search Index in your Atlas Cluster.
This is exactly the same as before but in this case we are going to create a new index named vs_dotProduct
for creating a similarity dotProduct vector search index in our Atlas Cluster.
Please note that an M0 Free tier cluster would only allow us to create a maximum of 3 indexes.
5 - Create an Atlas Search index in your Atlas Cluster.
This project will also allow us to create an Atlas Search Index to compare results. Thus, we can use the same input query to evaluate the results using a Vector Search Index vs an Atlas Search Index.
The overall logic is very similar to what we have defined and used before except that the body of the POST
request is slightly different:
1
2
3
4
5
createIndex(namespace, '', false)
.then((res) => {
logger(`Atlas Vector Search Index created: ${res}`)
})
.catch(handleError);
An Atlas Search index is defined with:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
{
'database': `${embeddingNamespace.split('.')[0]}`, // The name of the database
'collectionName': `${embeddingNamespace.split('.')[1]}`, // The name of the collection,
'name': 'default_search',
'type': 'search',
'definition': {
'mappings': {
'dynamic': true,
'fields': {
'plot': {
'type': 'string'
}
}
}
}
}
The fields fields: database
, collection
, name
, type
and definition
will be present, however, there are some differences:
type
: In this case we are going to create asearch
type index instead of avector
type.definition
:mappings.dynamic
:mappings.fields.plot.type
:string
Once this Atlas Search index is created we can use the plot
type for using full text search and compare the results with a similarity vector search.
6 - Query using an Atlas Vector Search Index.
This last option will allow us to test that the creation of the embeddings and the Vector Search is working as expected. For this option, we are using the hardcoded input: War in outer space
and using Langchaing MongoDBAtlasVectorSearch
to return the first result with a higher vector score.
If all previous steps worked fine, the result we will get is the following:
1
2
3
4
5
6
7
8
9
10
11
[
Document {
pageContent: "Near the end of the 20th century, WMDs (weapons of mass destruction) are retired. However, certain factions plan to use a science space station as a weapon against each other. The astronauts inside will decide the world's fate.",
metadata: {
_id: new ObjectId("573a1398f29313caabce8f9e"),
namespace: 'sample_mflix.movies',
type: 'movie',
year: 1984
}
}
]
If you are using an M0 cluster to test this, please note that a Vector Search index named
vs_euclidean
must be created.
In part 2, we are going to cover the backend application written in Typescript that we are going to use to display the results in the frontend.
Stay tuned!