RAG - Baseline implementation

RAG - Baseline implementation#

Overview#

In this part, we will build the building blocks of a RAG solution.

Create a Search Index
Upload the data
Perform a search
Create a prompt
Wire everything together

Goal#

The goal of this section is to familiarize yourself with RAG in a hands-on way, so that later on we can experiment with different aspects.

This will also represent a baseline for our RAG application.

Setup#

%%capture --no-display
%run -i ./pre-requisites.ipynb
%run -i ./helpers/search.ipynb

Import required libraries and environment variables#

import os
import json
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery
from azure.search.documents.indexes.models import (
    SearchIndex,
    SearchFieldDataType,
    SimpleField,
    SearchableField,
    SearchField,
    VectorSearchProfile,
    HnswAlgorithmConfiguration,
    VectorSearch,
    HnswParameters
)
from azure.search.documents.indexes import SearchIndexClient
import os.path

import openai

openai.api_key = os.getenv("AZURE_OPENAI_KEY")
openai.api_base = os.getenv("AZURE_OPENAI_ENDPOINT")
openai.api_type = "azure"
openai.api_version = "2023-07-01-preview"

1. Create a Search Index#

For those familiar with relational databases, you can imagine that:

A (search) index ~= A table
- it describes the schema of your data
- it consists of field definitions described by field attributes (searchable, filterable, sortable etc)
A (search) document ~= A row in your table

In our case, we would like to represent the following:

Field	Type	Description
ChunkId	SimpleField	The id of the chunk, in the form of `source_document_name+chunk_number`
Source	SimpleField	The path to the source document
ChunkContent	SearchableField	The content of the chunk
ChunkContentVector	SearchField	The vectorized content of the chunk

Run the cell bellow to define a function which creates an index with the above described schema:

def create_index(search_index_name, service_endpoint, key):
    client = SearchIndexClient(service_endpoint, AzureKeyCredential(key))

    # 1. Define the fields
    fields = [
        SimpleField(
            name="chunkId",
            type=SearchFieldDataType.String,
            sortable=True,
            filterable=True,
            key=True,
        ),
        SimpleField(
            name="source",
            type=SearchFieldDataType.String,
            sortable=True,
            filterable=True,
        ),
        SearchableField(name="chunkContent", type=SearchFieldDataType.String),
        SearchField(
            name="chunkContentVector",
            type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
            searchable=True,
            vector_search_dimensions=1536,  # the dimension of the embedded vector
            vector_search_profile_name="my-vector-config",
        ),
    ]

    # 2. Configure the vector search configuration
    vector_search = VectorSearch(
        profiles=[
            VectorSearchProfile(
                name="my-vector-config",
                algorithm_configuration_name="my-algorithms-config"
            )
        ],
        algorithms=[
            # Contains configuration options specific to the hnsw approximate nearest neighbors  algorithm used during indexing and querying
            HnswAlgorithmConfiguration(
                name="my-algorithms-config",
                kind="hnsw",
                # https://learn.microsoft.com/en-us/python/api/azure-search-documents/azure.search.documents.indexes.models.hnswparameters?view=azure-python-preview#variables
                parameters=HnswParameters(
                    m=4,
                    # The size of the dynamic list containing the nearest neighbors, which is used during index time.
                    # Increasing this parameter may improve index quality, at the expense of increased indexing time.
                    ef_construction=400,
                    # The size of the dynamic list containing the nearest neighbors, which is used during search time.
                    # Increasing this parameter may improve search results, at the expense of slower search.
                    ef_search=500,
                    # The similarity metric to use for vector comparisons.
                    # Known values are: "cosine", "euclidean", and "dotProduct"
                    metric="cosine",
                ),
            )
        ],
    )

    index = SearchIndex(
        name=search_index_name,
        fields=fields,
        vector_search=vector_search,
    )

    result = client.create_or_update_index(index)
    print(f"Index: {result.name} created or updated")

Run the cell below to create the index. If the index already exists, it will be updated. Make sure to update the seach_index_name variable to a unique name.

search_index_name = "first_index"
create_index(search_index_name, service_endpoint, search_index_key)

Index: first_index created or updated

2. Upload the Data to the Index#

2.1 Chunking#

Data ingestion requires a special attention as it can impact the outcome of the RAG solution. What chunking strategy to use, what AI Enrichment to perform are just few of the considerations. Further discussion and experimentation will be done in Chapter 3. Experimentation - Chunking.

In this baseline setup, we have previously chunked the data based on a fixed size (180 tokens) and overlap of 30%.

The chunks can be found here. You can take a look at the content of the file.

2.2 Embedding#

Embedding the chunks in vectors can also be done in various ways. Further discussion and experimentation will be done in Chapter 3. Experimentation - Embeeding.

In this baseline setup, we will take a vanilla approach, where:

We used the embedding model from OpenAI, text-embedding-ada-002 since this is one obvious choice to start with

The outcome can be found here. You can take a look at the content of the file.

Let’s define the path to the embedded chunks:

chunk_size = 180
chunk_overlap = 30
path_to_embedded_chunks = f"./output/pre-generated/embeddings/fixed-size-chunks-{chunk_size}-{chunk_overlap}-batch-engineering-mlops-ada.json"

2.3. Upload the data to the Index#

def upload_data(file_path, search_index_name):
    try:
        with open(file_path, "r") as file:
            documents = json.load(file)

        search_client = SearchClient(
            endpoint=service_endpoint,
            index_name=search_index_name,
            credential=credential,
        )
        search_client.upload_documents(documents)
        print(
            f"Uploaded {len(documents)} documents to Index: {search_index_name}")
    except Exception as e:
        print(f"Error uploading documents: {e}")

upload_data(path_to_embedded_chunks, search_index_name)

Uploaded 3236 documents to Index: first_index

3. Perform a vector search#

There are various types of search that one can perform such as: keyword search, semantic search, vector search, hybrid search. Since we generated embeddings for our chunks and we would like to leverage the power of vector search, in this baseline solution we will perform a simple vector search.

Perform a vector similarity search#

def search_documents(query_embeddings):
    search_client = SearchClient(
        service_endpoint, search_index_name,
        credential=credential
    )

    vector_query = VectorizedQuery(
        vector=query_embeddings, k_nearest_neighbors=3, fields="chunkContentVector"
    )

    results = search_client.search(
        search_text=None,
        vector_queries=[vector_query],
        select=["chunkContent", "chunkId", "source"],
    )

    documents = []
    for document in results:
        item = {}
        item["chunkContent"] = document["chunkContent"]
        item["source"] = document["source"]
        item["chunkId"] = document["chunkId"]
        documents.append(item)

    return documents

Run the search_documents function to find the most similar documents to a given query.

query = "What does the develop phase include"
embedded_query = oai_query_embedding(query)
search_documents(embedded_query)

[{'chunkContent': 'Steps\n\nDesign Phase: Both developers design the interface together. This includes:\n\nMethod signatures and names\nWriting documentation or docstrings for what the methods are intended to do.\nArchitecture decisions that would influence testing (Factory patterns, etc.)\n\nImplementation Phase: The developers separate and parallelize work, while continuing to communicate.\n\nDeveloper A will design the implementation of the methods, adhering to the previously decided design.\nDeveloper B will concurrently write tests for the same method signatures, without knowing details of the implementation.\n\nIntegration & Testing Phase: Both developers commit their code and run the tests.',
  'source': '..\\data\\docs\\code-with-engineering\\agile-development\\advanced-topics\\collaboration\\virtual-collaboration.md',
  'chunkId': 'chunk16_3'},
 {'chunkContent': 'In order to minimize the risk and set the expectations on the right way for all parties, an identification phase is important to understand each other.\nSome potential steps in this phase may be as following (not limited):\n\nWorking agreement\n\nIdentification of styles/preferences in communication, sharing, learning, decision making of each team member\n\nTalking about necessity of pair programming\n\nDecisions on backlog management & refinement meetings, weekly design sessions, social time sessions...etc.\n\nSync/Async communication methods, work hours/flexible times\n\nDecisions and identifications of charts that will be helpful to provide transparent and true information to everyone\n\nIdentification of "Software Craftspersonship" areas which means the tools and methods will be widely used during the engagement and taking the required actions on team upskilling side if necessary.',
  'source': '..\\data\\docs\\code-with-engineering\\agile-development\\advanced-topics\\collaboration\\teaming-up.md',
  'chunkId': 'chunk15_1'},
 {'chunkContent': 'Integration & Testing Phase: Both developers commit their code and run the tests.\n\nUtopian Scenario: All tests run and pass correctly.\nRealistic Scenario: The tests have either broken or failed due to flaws in testing. This leads to further clarification of the design and a discussion of why the tests failed.\n\nThe developers will repeat the three phases until the code is functional and tested.\n\nWhen to follow the RTT strategy\n\nRTT works well under specific circumstances. If collaboration needs to happen virtually, and all communication is virtual, RTT reduces the need for constant communication while maintaining the benefits of a joint design session. This considers the human element: Virtual communication is more exhausting than in person communication.',
  'source': '..\\data\\docs\\code-with-engineering\\agile-development\\advanced-topics\\collaboration\\virtual-collaboration.md',
  'chunkId': 'chunk16_4'}]

4. Create a prompt#

def create_prompt(query, documents):
    system_prompt = f"""

    Instructions:

    "You are an AI assistant that helps users answer questions given a specific context.
    You will be given a context (Retrieved Documents) and asked a question (User Question) based on that context.
    Your answer should be as precise as possible and should only come from the context.
    Please add citation after each sentence when possible in a form "(Source: source+chunkId),
    where both 'source' and 'chunkId' are taken from the Retrieved Documents."
    """

    user_prompt = f"""
    ## Retrieve Documents:
    {documents}

    ## User Question
    {query}
    """

    final_message = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt + "\nEND OF CONTEXT"},
    ]

    return final_message

Create a function to call the Chat Completion endpoint#

For this, we will use OpenAI library for Python:

from openai import AzureOpenAI


def call_llm(messages: list[dict]):
    client = AzureOpenAI(
        api_key=azure_openai_key,
        api_version=azure_openai_api_version,
        azure_endpoint=azure_aoai_endpoint
    )

    response = client.chat.completions.create(
        model=azure_openai_chat_deployment, messages=messages)
    return response.choices[0].message.content

5. Finally, put all the pieces together#

Note: Usually in a RAG solution there is an intent extraction step. However, since we are having a QA system and not a chat, in our workshop we are assuming that the intent is the actual query.

def custom_rag_solution(query):
    try:
        # 1. Embed the query using the same embedding model as your data in the Index
        query_embeddings = oai_query_embedding(query)

        # Intent recognition - skipped in our workhsop

        # 1. Search for relevant documents
        search_response = search_documents(query_embeddings)

        # 2. Create prompt with the query and retrieved documents
        prompt_from_chunk_context = create_prompt(query, search_response)

        # 3. Call the Azure OpenAI GPT model
        response = call_llm(prompt_from_chunk_context)
        return response

    except Exception as e:
        print(f"Error: {e}")

Try it out#

query = "What does the develop phase include?"
print(f"User question: {query}")

response = custom_rag_solution(query)
print(f"Response: {response}")

User question: What does the develop phase include?

Response: The develop phase includes designing the interface, which involves creating method signatures and names, writing documentation for the methods, and making architecture decisions that would influence testing (Source: code-with-engineering/agile-development/advanced-topics/collaboration/virtual-collaboration.md, chunk16_3).

Perfect! This answer seems to make sense.

Now… what?#

Is this good enough?
What does good enough even mean?
How can I prove that this works as expected?
What does works as expected even mean?!

Let’s go to Chapter 3. Experimentation, to try to tackle these questions.