# RAG - Baseline implementation

## Overview

In this part, we will build the building blocks of a RAG solution.

1. Create a Search Index
2. Upload the data
3. Perform a search
4. Create a prompt
5. Wire everything together

<!-- To create the index we need the following objects:

- Data Source - a `link` to some data storage
- Azure Index - defines the data structure over which to search
  - Create an empty index based on an index schema
  - Fill in the data using the Search Indexer (below\_)
- Azure Search Indexer - which acts as a crawler that retrieves data from external sources, can also trigger skillsets (Optical Character Recognition) -->

## Goal

The goal of this section is to familiarize yourself with RAG in a hands-on way, so that later on we can experiment with different aspects.

This will also represent a baseline for our RAG application.

## Setup

<!-- First, we install the necessary dependencies.
https://github.com/openai/openai-cookbook/blob/main/examples/azure/chat_with_your_own_data.ipynb -->


In [1]:
%%capture --no-display
%run -i ./pre-requisites.ipynb
%run -i ./helpers/search.ipynb

### Import required libraries and environment variables


In [2]:
import os
import json
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery
from azure.search.documents.indexes.models import (
    SearchIndex,
    SearchFieldDataType,
    SimpleField,
    SearchableField,
    SearchField,
    VectorSearchProfile,
    HnswAlgorithmConfiguration,
    VectorSearch,
    HnswParameters
)
from azure.search.documents.indexes import SearchIndexClient
import os.path

import openai

openai.api_key = os.getenv("AZURE_OPENAI_KEY")
openai.api_base = os.getenv("AZURE_OPENAI_ENDPOINT")
openai.api_type = "azure"
openai.api_version = "2023-07-01-preview"

## 1. Create a Search Index

<!-- https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/search/azure-search-documents/samples/sample_index_crud_operations.py

https://github.com/microsoft/rag-experiment-accelerator/blob/development/rag_experiment_accelerator/init_Index/create_index.py

Used for overall Fields and Semantic Settings inspiration - https://github.com/Azure/azure-search-vector-samples/blob/main/demo-python/code/azure-search-vector-python-huggingface-model-sample.ipynb

Used for SearchField inspiration - https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/search/azure-search-documents/samples/sample_vector_search.py -->

For those familiar with relational databases, you can imagine that:

- A (search) index ~= A table
  - it describes the [schema of your data](https://learn.microsoft.com/en-us/azure/search/search-what-is-an-index#schema-of-a-search-index)
  - it consists of [`field definitions`](https://learn.microsoft.com/en-us/azure/search/search-what-is-an-index#field-definitions) described by [`field attributes`](https://learn.microsoft.com/en-us/azure/search/search-what-is-an-index#field-attributes) (searchable, filterable, sortable etc)
- A (search) document ~= A row in your table

In our case, we would like to represent the following:

| Field              | Type            | Description                                                             |
| ------------------ | --------------- | ----------------------------------------------------------------------- |
| ChunkId            | SimpleField     | The id of the chunk, in the form of `source_document_name+chunk_number` |
| Source             | SimpleField     | The path to the source document                                         |
| ChunkContent       | SearchableField | The content of the chunk                                                |
| ChunkContentVector | SearchField     | The vectorized content of the chunk                                     |


Run the cell bellow to define a function which creates an index with the above described schema:


In [3]:
def create_index(search_index_name, service_endpoint, key):
    client = SearchIndexClient(service_endpoint, AzureKeyCredential(key))

    # 1. Define the fields
    fields = [
        SimpleField(
            name="chunkId",
            type=SearchFieldDataType.String,
            sortable=True,
            filterable=True,
            key=True,
        ),
        SimpleField(
            name="source",
            type=SearchFieldDataType.String,
            sortable=True,
            filterable=True,
        ),
        SearchableField(name="chunkContent", type=SearchFieldDataType.String),
        SearchField(
            name="chunkContentVector",
            type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
            searchable=True,
            vector_search_dimensions=1536,  # the dimension of the embedded vector
            vector_search_profile_name="my-vector-config",
        ),
    ]

    # 2. Configure the vector search configuration
    vector_search = VectorSearch(
        profiles=[
            VectorSearchProfile(
                name="my-vector-config",
                algorithm_configuration_name="my-algorithms-config"
            )
        ],
        algorithms=[
            # Contains configuration options specific to the hnsw approximate nearest neighbors  algorithm used during indexing and querying
            HnswAlgorithmConfiguration(
                name="my-algorithms-config",
                kind="hnsw",
                # https://learn.microsoft.com/en-us/python/api/azure-search-documents/azure.search.documents.indexes.models.hnswparameters?view=azure-python-preview#variables
                parameters=HnswParameters(
                    m=4,
                    # The size of the dynamic list containing the nearest neighbors, which is used during index time.
                    # Increasing this parameter may improve index quality, at the expense of increased indexing time.
                    ef_construction=400,
                    # The size of the dynamic list containing the nearest neighbors, which is used during search time.
                    # Increasing this parameter may improve search results, at the expense of slower search.
                    ef_search=500,
                    # The similarity metric to use for vector comparisons.
                    # Known values are: "cosine", "euclidean", and "dotProduct"
                    metric="cosine",
                ),
            )
        ],
    )

    index = SearchIndex(
        name=search_index_name,
        fields=fields,
        vector_search=vector_search,
    )

    result = client.create_or_update_index(index)
    print(f"Index: {result.name} created or updated")

Run the cell below to create the index. If the index already exists, it will be updated. Make sure to update the `seach_index_name` variable to a unique name.


In [4]:
search_index_name = "first_index"
create_index(search_index_name, service_endpoint, search_index_key)

Index: first_index created or updated


## 2. Upload the Data to the Index

### 2.1 Chunking

Data ingestion requires a special attention as it can impact the outcome of the RAG solution. What chunking strategy to use, what AI Enrichment to perform are just few of the considerations. Further discussion and experimentation will be done in `Chapter 3. Experimentation - Chunking`.

In this baseline setup, we have previously chunked the data based on a fixed size (180 tokens) and overlap of 30%.

The chunks can be found [here](./output/pre-generated/chunking/fixed-size-chunks-engineering-mlops-180-30.json). You can take a look at the content of the file.


### 2.2 Embedding

Embedding the chunks in vectors can also be done in various ways. Further discussion and experimentation will be done in `Chapter 3. Experimentation - Embeeding`.

In this baseline setup, we will take a vanilla approach, where:

- We used the embedding model from OpenAI, [`text-embedding-ada-002`](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings) since this is one obvious choice to start with

The outcome can be found [here](./output/pre-generated/embeddings/fixed-size-chunks-180-30-batch-engineering-mlops-ada.json). You can take a look at the content of the file.


Let's define the path to the embedded chunks:


In [5]:
chunk_size = 180
chunk_overlap = 30
path_to_embedded_chunks = f"./output/pre-generated/embeddings/fixed-size-chunks-{chunk_size}-{chunk_overlap}-batch-engineering-mlops-ada.json"

### 2.3. Upload the data to the Index

<!-- https://github.com/microsoft/rag-experiment-accelerator/blob/development/rag_experiment_accelerator/ingest_data/acs_ingest.py -->


In [6]:
def upload_data(file_path, search_index_name):
    try:
        with open(file_path, "r") as file:
            documents = json.load(file)

        search_client = SearchClient(
            endpoint=service_endpoint,
            index_name=search_index_name,
            credential=credential,
        )
        search_client.upload_documents(documents)
        print(
            f"Uploaded {len(documents)} documents to Index: {search_index_name}")
    except Exception as e:
        print(f"Error uploading documents: {e}")

In [7]:
upload_data(path_to_embedded_chunks, search_index_name)

Uploaded 3236 documents to Index: first_index


## 3. Perform a vector search

<!-- https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-ai-search-outperforming-vector-search-with-hybrid/ba-p/3929167 -->

<!-- There are two layers of execution: retrieval and ranking.

- Retrieval - also called L1, has the goal to quickly find all the documents from the index that satisfy the search criteria (possibly across millions or billions of documents). These are scored to pick the top few (typically in order of 50) to return to the user or to feed the next layer. Azure AI Search supports three different models:

  - Keyword: Uses traditional full-text search methods – content is broken into terms through language-specific text analysis, inverted indexes are created for fast retrieval, and the BM25 probabilistic model is used for scoring.

  - Vector: Documents are converted from text to vector representations using an embedding model. Retrieval is performed by generating a query embedding and finding the documents whose vectors are closest to the query’s. We used Azure Open AI text-embedding-ada-002 (Ada-002) embeddings and cosine similarity for all our tests in this post.
  - Hybrid: Performs both keyword and vector retrieval and applies a fusion step to select the best results from each technique. Azure AI Search currently uses Reciprocal Rank Fusion (RRF) to produce a single result set.

- Ranking – also called L2, takes a subset of the top L1 results and computes higher quality relevance scores to reorder the result set. The L2 can improve the L1's ranking because it applies more computational power to each result. The L2 ranker can only reorder what the L1 already found – if the L1 missed an ideal document, the L2 can't fix that. L2 ranking is critical for RAG applications to make sure the best results are in the top positions.
  - Semantic ranking is performed by Azure AI Search's L2 ranker which utilizes multi-lingual, deep learning models adapted from Microsoft Bing. The Semantic ranker can rank the top 50 results from the L1.

https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-ai-search-outperforming-vector-search-with-hybrid/ba-p/3929167 -->

There are [various types of search](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/use-your-data?tabs=ai-search#search-options) that one can perform such as: keyword search, semantic search, vector search, hybrid search. Since we generated embeddings for our chunks and we would like to leverage the power of vector search, in this baseline solution we will perform a simple vector search.

<!-- Further discussion and experimentation will be done in `Chapter 3. Experimentation - Search` -->


### Perform a vector similarity search


In [8]:
def search_documents(query_embeddings):
    search_client = SearchClient(
        service_endpoint, search_index_name,
        credential=credential
    )

    vector_query = VectorizedQuery(
        vector=query_embeddings, k_nearest_neighbors=3, fields="chunkContentVector"
    )

    results = search_client.search(
        search_text=None,
        vector_queries=[vector_query],
        select=["chunkContent", "chunkId", "source"],
    )

    documents = []
    for document in results:
        item = {}
        item["chunkContent"] = document["chunkContent"]
        item["source"] = document["source"]
        item["chunkId"] = document["chunkId"]
        documents.append(item)

    return documents

Run the search_documents function to find the most similar documents to a given query.


In [9]:
query = "What does the develop phase include"
embedded_query = oai_query_embedding(query)
search_documents(embedded_query)

[{'chunkContent': 'Steps\n\nDesign Phase: Both developers design the interface together. This includes:\n\nMethod signatures and names\nWriting documentation or docstrings for what the methods are intended to do.\nArchitecture decisions that would influence testing (Factory patterns, etc.)\n\nImplementation Phase: The developers separate and parallelize work, while continuing to communicate.\n\nDeveloper A will design the implementation of the methods, adhering to the previously decided design.\nDeveloper B will concurrently write tests for the same method signatures, without knowing details of the implementation.\n\nIntegration & Testing Phase: Both developers commit their code and run the tests.',
  'source': '..\\data\\docs\\code-with-engineering\\agile-development\\advanced-topics\\collaboration\\virtual-collaboration.md',
  'chunkId': 'chunk16_3'},
 {'chunkContent': 'In order to minimize the risk and set the expectations on the right way for all parties, an identification phase is

## 4. Create a prompt


In [10]:
def create_prompt(query, documents):
    system_prompt = f"""

    Instructions:

    "You are an AI assistant that helps users answer questions given a specific context.
    You will be given a context (Retrieved Documents) and asked a question (User Question) based on that context.
    Your answer should be as precise as possible and should only come from the context.
    Please add citation after each sentence when possible in a form "(Source: source+chunkId),
    where both 'source' and 'chunkId' are taken from the Retrieved Documents."
    """

    user_prompt = f"""
    ## Retrieve Documents:
    {documents}

    ## User Question
    {query}
    """

    final_message = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt + "\nEND OF CONTEXT"},
    ]

    return final_message

### Create a function to call the Chat Completion endpoint

For this, we will use [OpenAI library for Python](https://github.com/openai/openai-python):

<!-- https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/migration?tabs=python-new%2Cdalle-fix#chat-completions
https://learn.microsoft.com/en-us/azure/ai-services/openai/reference?WT.mc_id=AZ-MVP-5004796#completions -->


In [11]:
from openai import AzureOpenAI


def call_llm(messages: list[dict]):
    client = AzureOpenAI(
        api_key=azure_openai_key,
        api_version=azure_openai_api_version,
        azure_endpoint=azure_aoai_endpoint
    )

    response = client.chat.completions.create(
        model=azure_openai_chat_deployment, messages=messages)
    return response.choices[0].message.content

## 5. Finally, put all the pieces together

Note: Usually in a RAG solution there is an intent extraction step.
However, since we are having a QA system and not a chat, in our workshop we are assuming that the intent is the actual query.


In [12]:
def custom_rag_solution(query):
    try:
        # 1. Embed the query using the same embedding model as your data in the Index
        query_embeddings = oai_query_embedding(query)

        # Intent recognition - skipped in our workhsop

        # 1. Search for relevant documents
        search_response = search_documents(query_embeddings)

        # 2. Create prompt with the query and retrieved documents
        prompt_from_chunk_context = create_prompt(query, search_response)

        # 3. Call the Azure OpenAI GPT model
        response = call_llm(prompt_from_chunk_context)
        return response

    except Exception as e:
        print(f"Error: {e}")

## Try it out


In [14]:
query = "What does the develop phase include?"
print(f"User question: {query}")

response = custom_rag_solution(query)
print(f"Response: {response}")

User question: What does the develop phase include?
Response: The development phase includes designing the interface together, which involves creating method signatures and names, writing documentation or docstrings for the methods, and making architecture decisions that would influence testing (Factory patterns, etc.). It also involves separating and parallelizing the work, with one developer designing the implementation of the methods and the other writing tests for the same method signatures without knowing the implementation details. Finally, in the integration and testing phase, both developers commit their code and run the tests. (Source: ..\data\docs\code-with-engineering\agile-development\advanced-topics\collaboration\virtual-collaboration.md, chunkId: chunk16_3)


Perfect! **This answer** seems to make sense.

## Now... what?

- Is this _good enough_?
- What does _good enough_ even mean?
- How can I prove that this works _as expected_?
- What does _works as expected_ even mean?!

Let's go to `Chapter 3. Experimentation`, to try to tackle these questions.
