# 3.1. Embeddings Experiment

According to [multiple estimates](https://mitsloan.mit.edu/ideas-made-to-matter/tapping-power-unstructured-data), 80% of data generated by businesses today is unstructured data such as text, images, or audio. This data has enormous potential for machine learning applications, but there is _some_ work to be done before it can be used directly. [Embeddings](https://medium.com/analytics-vidhya/introduction-to-word-embeddings-c2ba135dce2f) are the backbone of our system. Our goal is to understand how different embeddings have an impact on the returned results for a given query.

Which Embeddings Model to use?! Glad you asked! There are several options available:

1. [OpenAI models](https://openai.com/blog/new-embedding-models-and-api-updates?ref=haihai.ai), such as: [text-embedding-ada-002](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings), text-embedding-3-small, text-embedding-3-large
2. Open source models, which you can find at [HuggingFace](https://huggingface.co/models). The [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) ranks the performance of embeddings models on a few axis, though not all models can be run locally.

<!-- relevancy of -->

## Experiment Overview

| **Topic**                 | Description                                                                                                                                                                                                                                                                                                                                                                                                  |
| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| üìù **Hypothesis**         | Exploratory hypothesis: "Can introducing a new word embedding method improve the system's performance?"                                                                                                                                                                                                                                                                                                      |
| ‚öñÔ∏è **Comparison**         | We will compare **text-embedding-ada-002** (from OpenAI) and **infloat/e5-small-v2** (open-source)                                                                                                                                                                                                                                                                                                           |
| üéØ **Evaluation Metrics** | We will look at Accuracy and Cosine Similarity to compare the performance.                                                                                                                                                                                                                                                                                                                                   |
| üìä **Data**               | The data that we will use consists of [code-with-engineering](../data/docs/code-with-engineering/) and [code-with-mlops](../data/docs/code-with-mlops/) sections from Solution Ops repository which were previously pre-chunked in chunks of 180 tokens with 30% overlap [fixed-size-chunks-engineering-mlops-180-30.json](./output/pre-generated/chunking/fixed-size-chunks-engineering-mlops-180-30.json). |
| üìä **Evaluation Dataset** | [300 question-answer](./output/qa/evaluation/qa_pairs_solutionops.json) pairs generated from [code-with-engineering](../data/docs/code-with-engineering/) and [code-with-mlops](../data/docs/code-with-mlops/) sections from Solution Ops repository. See [Generation QA Notebook](./5.1.generation-qa.ipynb) for insights on how they were generated.                                                       |

<!-- üìù **Hypothesis**

Exploratory hypothesis: "Can introducing a new word embedding method improve the system's performance?"

üéØ **Evaluation Metrics**

For this experiment we will look at Accuracy and Cosine Similarity to compare the performance. -->

<!-- As we highlighted in the `Chapter 3. Experiments`, our system has two components: the retrieval and the generative one. Take a moment to think what would be the part that would be impacted if we change the embedding model? <details markdown="1">

<summary> Hint:</summary>

Embeddings are used for transforming the input query from plain text into a vector, as well as for vectorizing the documents we have in our index. Therefore, it contributes to how well the system can retrieve relevant documents based on the input query and the documents. As mentioned in `Chapter 3. Experiments`, the evaluation metrics for this case will be accuracy, cosine similarity and Discounted cumulative gain.

</details> -->

<!-- üìä **Data**

In this experiment, the data that we would like to embed consists of the first 200 documents from the Solution Ops Playbook, which were previously chunked in size of 300. The dataset can be found at [chunks-solution-ops-200-300-0.json](./output/chunks-solution-ops-200-300-0.json). -->


<!-- ## üëÄ Get to know the data

Before we try out different embedding models, let's first try to understand the data. In what follows, you will see the data being clustered and keywords extracted from each cluster. To accomplish this, we performed Dimensionality Reduction, using [t-SNE](https://towardsdatascience.com/what-why-and-how-of-t-sne-1f78d13e224d). If you want to see the code we've been using to accomplish this, go to [t-SNE.ipynb](./helpers/t-SNE.ipynb). -->

<!-- # %run -i ./helpers/t-SNE.ipynb -->

<!-- As we have seen from the cluster from above, the data `can` be clustered, and the clusters seem to be different from one another. One is centered on data (sql, databricks) vs backlog related (stories, sprint, team) vs engineering fundamentals (security, testing, code). However, if we think about these clusters on a broader sense, they are part of one big cluster, which is IT. -->


## Setup

Import necessary libraries


In [8]:
%run -i ./pre-requisites.ipynb

print(f"Evaluation dataset: {path_to_evaluation_dataset}")
print(f"Pre-chunked documents: {pregenerated_fixed_size_chunks}")

Evaluation dataset: ./output/qa/evaluation/qa_pairs_solutionops.json
Pre-chunked documents: ./output/pre-generated/chunking/fixed-size-chunks-engineering-mlops-180-30.json


## 1. Use `text-embedding-ada-002` from OpenAI

This model has a maximum token limit of [8191](https://platform.openai.com/docs/guides/embeddings/embedding-models). Usage is priced per input token, it is available either as Pay-As-You-Go or as Provisioned Throughput Units (PTUs) model. More price related info can be found [here](https://azure.microsoft.com/en-us/pricing/details/cognitive-services/openai-service/).

Let's create a function which is responsible to embed an input query using `text-embedding-ada-002. We will use the REST API, [here](https://learn.microsoft.com/en-us/azure/ai-services/openai/reference?WT.mc_id=AZ-MVP-5004796#embeddings) you can see the documentation.


In [9]:
import requests



def oai_query_embedding(
    query,
    endpoint=azure_aoai_endpoint,
    api_key=azure_openai_key,
    api_version="2023-07-01-preview",
    embedding_model_deployment=azure_openai_embedding_deployment
):
    """
    Query the OpenAI Embedding model to get the embeddings for the given query.
    Args:
    query (str): The query for which to get the embeddings.
    endpoint (str): The endpoint for the OpenAI service.
    api_key (str): The API key for the OpenAI service.
    api_version (str): The API version for the OpenAI service.
    embedding_model_deployment (str): The deployment for the OpenAI embedding model.

    Returns:

    list: The embeddings for the given query.
    """

    request_url = f"{endpoint}/openai/deployments/{embedding_model_deployment}/embeddings?api-version={api_version}"
    headers = {"Content-Type": "application/json", "api-key": api_key}
    request_payload = {"input": query}
    embedding_response = requests.post(
        request_url, json=request_payload, headers=headers, timeout=None
    )

    if embedding_response.status_code == 200:
        data_values = embedding_response.json()["data"]
        embeddings_vectors = [data_value["embedding"] for data_value in data_values]
        return embeddings_vectors[0]
    else:
        raise Exception(f"failed to get embedding: {embedding_response.json()}")

üë©‚Äçüíª Try it out. Feel free to pass another query:


In [10]:
query = "Hello"

query_vectors = oai_query_embedding(query)

print(f"The query is: {query}")
print(f"The embedded vector is: {query_vectors}")
print(f"The length of the embedding is: {len(query_vectors)}")

The query is: Hello
The embedded vector is: [-0.021819873, -0.0072516315, -0.02838273, -0.02452299, -0.023587296, 0.028824585, -0.012300482, -0.002914298, -0.008369266, -0.0053834915, 0.029370407, -0.0032050782, -0.015555919, -0.0026917458, 0.012313478, -0.0009478779, 0.038779333, 0.0057538706, 0.018687896, -0.0139704365, -0.019740552, 0.009954749, 0.0052600317, 0.009025552, -0.0081548365, -0.0052242936, 0.0024545733, -0.012345967, 0.003312293, -0.015659885, 0.0036940433, -0.016166719, -0.017882159, -0.012904785, 0.0040774182, -0.016218703, -0.0010892067, -0.00985728, 0.021300042, -0.008564203, 0.013080227, -0.0062801987, 0.00324569, -0.0067642904, -0.02804484, 0.013216683, -0.012378457, 0.00046459824, -0.014815161, 0.03599824, 0.009187999, 0.0127943205, -0.014750182, -0.0007468498, -0.0061697345, -0.01472419, -0.0077584656, 0.0062542073, 0.007641504, -0.043587763, 0.002810332, 0.024042146, -0.0059455577, 0.015023093, -0.0044477973, 0.020221395, 0.015101068, 0.0052957702, 0.008122347, 

## 2. Use `intfloat/e5-small-v2` from Hugging Face

This model is open source and has a size 0.13 GB. The model is limited to working with English texts and can handle texts with a maximum of 512 tokens. Being open sourced, it means there is no price associated with it, you can download it locally, you can fine-tune it etc.
[The embedding size is 384](https://huggingface.co/intfloat/e5-small-v2#e5-small-v2).

**üë©‚Äçüíª Embed an input query using `e5-small-v2 model`**

Look at the [</> Use in sentence-transformers](https://huggingface.co/intfloat/e5-small-v2) section from Hugging Face.

<details markdown="1">
<summary> üîç Solution. Expand this only if you got stuck: </summary>

```python
from sentence_transformers import SentenceTransformer

input = "Hello"

model = SentenceTransformer("intfloat/e5-small-v2")
embedded_input = model.encode(query, normalize_embeddings=True)

print(f"The query is: {input}")
print(f"The embedded vector is: {embedded_input}")
print(f"The length of the embedding is: {len(embedded_input)}")
```

</details>


## Analysis

So far we have been looking at two different embedding models and we've listed some of their characteristics. Let's try now to evaluate how well each model performs in our context. For this, we should first create embeddings from our data.


```{note}
In the interest of time, we have pre-generated the embeddings for you:
- The pre-generated embeddings using `intfloat/e5-small-v2` can be found at [fixed-size-chunks-180-30-engineering-mlops-e5-small-v2.json](./output/pre-generated/embeddings/fixed-size-chunks-180-30-engineering-mlops-e5-small-v2.json)
- The pre-generated embeddings using `text-embedding-ada-002` can be found at [fixed-size-chunks-180-30-engineering-mlops-ada.json](./output/pre-generated/embeddings/fixed-size-chunks-180-30-engineering-mlops-ada.json)
```

Let's load the path to each file. Note the name of variables:


In [11]:
%run -i ./pre-requisites.ipynb

print(f"Pre-generated embeddings using intfloat/e5-small-vs: {pregenerated_fixed_size_chunks_embeddings_os}")
print(f"Pre-generated embeddings using text-embedding-ada-002: {pregenerated_fixed_size_chunks_embeddings_ada}")


Pre-generated embeddings using intfloat/e5-small-vs: ./output/pre-generated/embeddings/fixed-size-chunks-180-30-engineering-mlops-e5-small-v2.json
Pre-generated embeddings using text-embedding-ada-002: ./output/pre-generated/embeddings/fixed-size-chunks-180-30-engineering-mlops-ada.json


## üìà Evaluation

In this workshop, to separate our experiments, we will take the **Full Reindex** strategy and we will create a new index per embedding model.
Therefore, for each embedding model we will:

1. Create a new index. Note: make sure to give a relevant name.
2. Populate the index with the embeddings that you have generated at the previous steps.

```{note}
You can reuse available functions from [./helpers/search.ipynb](./helpers/search.ipynb), such as: *create_index* and *upload_data*.
```


### üë©‚Äçüíª Create two indexes

By running the next cell, all the functions from search.ipynb will become available:


In [12]:
%%capture --no-display
%run -i ./helpers/search.ipynb

Sample code for creating a new index and uploading the data which was previously embedded using AOI model:


In [None]:
# 1. Create a new index
# TODO: Replace the prefix with a relevant name given your embedding model
new_index_name = "fixed-size-chunks-180-30-engineering-mlops-ada"
vector_size = 1536  # TODO: Replace with the vector size of your embedding model
create_index(new_index_name)

# Uncomment the following when running the cell:
# # 2. Upload the embeddings to the new index
# # TODO: Replace the embeddings_file_path to point to the right file path
# embeddings_file_path = pregenerated_fixed_size_chunks_embeddings_ada
# upload_data(file_path=embeddings_file_path,
#             search_index_name=new_index_name)

#### üë©‚Äçüíª Create a new index and upload the embeddings created with intfloat/e5-small-v2 model.

<details markdown="1">
<summary> üîç Solution. Expand this only if you got stuck: </summary>

```python
from sentence_transformers import SentenceTransformer

# 1. Create a new index
new_index_name = "fixed-size-chunks-180-30-engineering-mlops-e5-small-v2"
vector_size = 384  # TODO: Replace with the vector size of your embedding model
create_index(new_index_name, vector_size)

# 2. Upload the embeddings to the new index
embeddings_file_path = pregenerated_fixed_size_chunks_emebddings_os
upload_data(file_path=embeddings_file_path, search_index_name=new_index_name)
```

</details>


### üìä Evaluation Dataset

Note: The evaluation dataset can be found at [qa_pairs_solutionops.json](./output/qa/evaluation/qa_pairs_solutionops.json). The format is:

```json
"user_prompt": "", # The question
"output_prompt": "", # The answer
"context": "", # The relevant piece of information from a document
"chunk_id": "", # The ID of the chunk
"source": "" # The path to the document, i.e. "..\\data\\docs\\code-with-dataops\\index.md"
```


### üéØ Evaluation metrics

Let us try to evaluate our baseline model. We will have two metrics:

- Cosine similarity:

  Using cosine similarity we will calculate how similar in meaning is the first text that was retrieved from the search index compare to the text that was used to formulate the question (and hence, to answer to it). Note: our search index returns the top 3 nearest neighbors, but we will look at the first retrieved one. We will then calculate the mean and median cosine across our evaluation dataset.

- Accuracy:

  By accuracy we mean how many times the search returned the document (the file path to the document) which we expected in our evaluation dataset. We will return the percentage of successfully retrieved documents across our evaluation dataset.

  <!-- `Retrieval_evaluation` function is going through the evaluation dataset and, for each `user_prompt`, it embeds it using the `embedding_function` passed as parameter and then it does a vector search in the Index with name `search_index_name`. If the retrieved documents includes the `source` from the evaluation dataset, then it is considered a success. Note: This can also be adapted to ensure that the `first` retrieved document is the expected one. -->


In [16]:
import numpy as np
from numpy.linalg import norm


def calculate_cosine_similarity(expected_document_vector, retrieved_document_vector):
    cosine_sim = np.dot(expected_document_vector, retrieved_document_vector) / \
        (norm(expected_document_vector)*norm(retrieved_document_vector))
    return float(cosine_sim)

In [24]:
import os
import ntpath
import numpy as np
from numpy.linalg import norm


def calculate_metrics(evaluation_data_path, embedding_function, search_index_name):
    """ Evaluate the retrieval performance of the search index using the evaluation data set.
    Args:
    evaluation_data_path (str): The path to the evaluation data set.
    embedding_function (function): The function to use for embedding the question.
    search_index_name (str): The name of the search index to use for retrieval.

    Returns:
    list: The cosine similarities between the expected documents and the top retrieved documents.
    """
    if not os.path.exists(evaluation_data_path):
        print(
            f"The path to the evaluation data set {evaluation_data_path} does not exist. Please check the path and try again.")
        return
    nr_correctly_retrieved_documents = 0
    nr_qa = 0
    cosine_similarities = []

    with open(evaluation_data_path, "r", encoding="utf-8") as file:
        evaluation_data = json.load(file)
        for data in evaluation_data:
            user_prompt = data["user_prompt"]
            expected_document = data["source"]
            expected_document_vector = embedding_function(data["context"])

            # 1. Search in the index
            search_response = search_documents(
                search_index_name=search_index_name,
                input=user_prompt,
                embedding_function=embedding_function,
            )

            retrieved_documents = [ntpath.normpath(response["source"])
                                   for response in search_response]
            top_retrieved_document = search_response[0]["chunkContentVector"]

            # 2. Calculate cosine similarity between the expected document and the top retrieved document
            cosine_similarity = calculate_cosine_similarity(
                expected_document_vector, top_retrieved_document)
            cosine_similarities.append(cosine_similarity)

            # 3. If the expected document is part of the retrieved documents,
            # we will consider it correctly retrieved
            if ntpath.normpath(expected_document) in retrieved_documents:
                nr_correctly_retrieved_documents += 1

            nr_qa += 1
    accuracy = (nr_correctly_retrieved_documents / nr_qa)*100
    print(f"Accuracy: {accuracy}% of the documents were correctly retrieved from Index {index_name}.")

    return cosine_similarities

In [21]:
%run -i ./pre-requisites.ipynb

### üë©‚Äçüíª 1. Evaluate the system using _text-embedding-ada-002_ model

<details markdown="1">
<summary> üîç Sample code. Feel free to expand it. It may take up to 4 minutes to run: </summary>

```python
# TODO: Replace the prefix with a relevant name given your embedding model
from statistics import mean, median

index_name = "fixed-size-chunks-180-30-engineering-mlops-ada"

cosine_similarities = calculate_metrics(
    evaluation_data_path=path_to_evaulation_dataset,
    embedding_function=oai_query_embedding,
    search_index_name=index_name,
)
avg_score = mean(cosine_similarities)
print(f"Avg cosine similarity score:{avg_score}")
median_score = median(cosine_similarities)
print(f"Median cosine similarity score: {median_score}")
```

</details>


### üë©‚Äçüíª2. Evaluate the system using _infloat/e5-small-v2_ model

Using the `calculate_metrics` function, calculate the metrics using the infloat/e5-small-v2 open source model.

<details markdown="1">
<summary> üîç Sample code. It may take up to 4 minutes to run:</summary>

```python
# TODO: Replace the prefix with a relevant name given your embedding model
index_name = "fixed-size-chunks-180-30-engineering-mlops-e5-small-v2"
cosine_similarities = calculate_metrics(
    evaluation_data_path=path_to_evaluation_dataset,
    embedding_function=embed_chunk,
    search_index_name=index_name,
)

avg_score = mean(cosine_similarities)
print(f"Avg score:{avg_score}")
median_score = median(cosine_similarities)
print(f"Median score: {median_score}")
```

</details>


## üí° Conclusions

What conclusions can you reach? Are you surprised by the results? In what cases would you find useful to use the open source model? Discuss these questions and any other ideas you may have with your colleagues.

<details markdown="1">

<summary> Possible conclusions. Expand this only after you reached your own conclusions: </summary>

![results.png](./images/results-embedding.png)

Open source models can be useful when you need more control over the model, such as running it offline, fine-tuning it, or customizing it for your specific needs. As it was proven in our experiment, the trade-off is excellent. However, open source models may require more engineering effort, have lower performance on some tasks, and have less safety and content filtering features than closed source models.

</details>
