RAG - Baseline implementation#

Overview#

In this part, we will build the building blocks of a RAG solution.

  1. Create a Search Index

  2. Upload the data

  3. Perform a search

  4. Create a prompt

  5. Wire everything together

Goal#

The goal of this section is to familiarize yourself with RAG in a hands-on way, so that later on we can experiment with different aspects.

This will also represent a baseline for our RAG application.

Setup#

%%capture --no-display
%run -i ./pre-requisites.ipynb
%run -i ./helpers/search.ipynb

Import required libraries and environment variables#

import os
import json
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery
from azure.search.documents.indexes.models import (
    SearchIndex,
    SearchFieldDataType,
    SimpleField,
    SearchableField,
    SearchField,
    VectorSearchProfile,
    HnswAlgorithmConfiguration,
    VectorSearch,
    HnswParameters
)
from azure.search.documents.indexes import SearchIndexClient
import os.path

import openai

openai.api_key = os.getenv("AZURE_OPENAI_KEY")
openai.api_base = os.getenv("AZURE_OPENAI_ENDPOINT")
openai.api_type = "azure"
openai.api_version = "2023-07-01-preview"

1. Create a Search Index#

For those familiar with relational databases, you can imagine that:

In our case, we would like to represent the following:

Field

Type

Description

ChunkId

SimpleField

The id of the chunk, in the form of source_document_name+chunk_number

Source

SimpleField

The path to the source document

ChunkContent

SearchableField

The content of the chunk

ChunkContentVector

SearchField

The vectorized content of the chunk

Run the cell bellow to define a function which creates an index with the above described schema:

def create_index(search_index_name, service_endpoint, key):
    client = SearchIndexClient(service_endpoint, AzureKeyCredential(key))

    # 1. Define the fields
    fields = [
        SimpleField(
            name="chunkId",
            type=SearchFieldDataType.String,
            sortable=True,
            filterable=True,
            key=True,
        ),
        SimpleField(
            name="source",
            type=SearchFieldDataType.String,
            sortable=True,
            filterable=True,
        ),
        SearchableField(name="chunkContent", type=SearchFieldDataType.String),
        SearchField(
            name="chunkContentVector",
            type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
            searchable=True,
            vector_search_dimensions=1536,  # the dimension of the embedded vector
            vector_search_profile_name="my-vector-config",
        ),
    ]

    # 2. Configure the vector search configuration
    vector_search = VectorSearch(
        profiles=[
            VectorSearchProfile(
                name="my-vector-config",
                algorithm_configuration_name="my-algorithms-config"
            )
        ],
        algorithms=[
            # Contains configuration options specific to the hnsw approximate nearest neighbors  algorithm used during indexing and querying
            HnswAlgorithmConfiguration(
                name="my-algorithms-config",
                kind="hnsw",
                # https://learn.microsoft.com/en-us/python/api/azure-search-documents/azure.search.documents.indexes.models.hnswparameters?view=azure-python-preview#variables
                parameters=HnswParameters(
                    m=4,
                    # The size of the dynamic list containing the nearest neighbors, which is used during index time.
                    # Increasing this parameter may improve index quality, at the expense of increased indexing time.
                    ef_construction=400,
                    # The size of the dynamic list containing the nearest neighbors, which is used during search time.
                    # Increasing this parameter may improve search results, at the expense of slower search.
                    ef_search=500,
                    # The similarity metric to use for vector comparisons.
                    # Known values are: "cosine", "euclidean", and "dotProduct"
                    metric="cosine",
                ),
            )
        ],
    )

    index = SearchIndex(
        name=search_index_name,
        fields=fields,
        vector_search=vector_search,
    )

    result = client.create_or_update_index(index)
    print(f"Index: {result.name} created or updated")

Run the cell below to create the index. If the index already exists, it will be updated. Make sure to update the seach_index_name variable to a unique name.

search_index_name = "first_index"
create_index(search_index_name, service_endpoint, search_index_key)
Index: first_index created or updated

2. Upload the Data to the Index#

2.1 Chunking#

Data ingestion requires a special attention as it can impact the outcome of the RAG solution. What chunking strategy to use, what AI Enrichment to perform are just few of the considerations. Further discussion and experimentation will be done in Chapter 3. Experimentation - Chunking.

In this baseline setup, we have previously chunked the data based on a fixed size (180 tokens) and overlap of 30%.

The chunks can be found here. You can take a look at the content of the file.

2.2 Embedding#

Embedding the chunks in vectors can also be done in various ways. Further discussion and experimentation will be done in Chapter 3. Experimentation - Embeeding.

In this baseline setup, we will take a vanilla approach, where:

  • We used the embedding model from OpenAI, text-embedding-ada-002 since this is one obvious choice to start with

The outcome can be found here. You can take a look at the content of the file.

Let’s define the path to the embedded chunks:

chunk_size = 180
chunk_overlap = 30
path_to_embedded_chunks = f"./output/pre-generated/embeddings/fixed-size-chunks-{chunk_size}-{chunk_overlap}-batch-engineering-mlops-ada.json"

2.3. Upload the data to the Index#

def upload_data(file_path, search_index_name):
    try:
        with open(file_path, "r") as file:
            documents = json.load(file)

        search_client = SearchClient(
            endpoint=service_endpoint,
            index_name=search_index_name,
            credential=credential,
        )
        search_client.upload_documents(documents)
        print(
            f"Uploaded {len(documents)} documents to Index: {search_index_name}")
    except Exception as e:
        print(f"Error uploading documents: {e}")
upload_data(path_to_embedded_chunks, search_index_name)
Uploaded 3236 documents to Index: first_index

4. Create a prompt#

def create_prompt(query, documents):
    system_prompt = f"""

    Instructions:

    "You are an AI assistant that helps users answer questions given a specific context.
    You will be given a context (Retrieved Documents) and asked a question (User Question) based on that context.
    Your answer should be as precise as possible and should only come from the context.
    Please add citation after each sentence when possible in a form "(Source: source+chunkId),
    where both 'source' and 'chunkId' are taken from the Retrieved Documents."
    """

    user_prompt = f"""
    ## Retrieve Documents:
    {documents}

    ## User Question
    {query}
    """

    final_message = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt + "\nEND OF CONTEXT"},
    ]

    return final_message

Create a function to call the Chat Completion endpoint#

For this, we will use OpenAI library for Python:

from openai import AzureOpenAI


def call_llm(messages: list[dict]):
    client = AzureOpenAI(
        api_key=azure_openai_key,
        api_version=azure_openai_api_version,
        azure_endpoint=azure_aoai_endpoint
    )

    response = client.chat.completions.create(
        model=azure_openai_chat_deployment, messages=messages)
    return response.choices[0].message.content

5. Finally, put all the pieces together#

Note: Usually in a RAG solution there is an intent extraction step. However, since we are having a QA system and not a chat, in our workshop we are assuming that the intent is the actual query.

def custom_rag_solution(query):
    try:
        # 1. Embed the query using the same embedding model as your data in the Index
        query_embeddings = oai_query_embedding(query)

        # Intent recognition - skipped in our workhsop

        # 1. Search for relevant documents
        search_response = search_documents(query_embeddings)

        # 2. Create prompt with the query and retrieved documents
        prompt_from_chunk_context = create_prompt(query, search_response)

        # 3. Call the Azure OpenAI GPT model
        response = call_llm(prompt_from_chunk_context)
        return response

    except Exception as e:
        print(f"Error: {e}")

Try it out#

query = "What does the develop phase include?"
print(f"User question: {query}")

response = custom_rag_solution(query)
print(f"Response: {response}")
User question: What does the develop phase include?
Response: The develop phase includes designing the interface, which involves creating method signatures and names, writing documentation for the methods, and making architecture decisions that would influence testing (Source: code-with-engineering/agile-development/advanced-topics/collaboration/virtual-collaboration.md, chunk16_3).

Perfect! This answer seems to make sense.

Now… what?#

  • Is this good enough?

  • What does good enough even mean?

  • How can I prove that this works as expected?

  • What does works as expected even mean?!

Let’s go to Chapter 3. Experimentation, to try to tackle these questions.