# [Experiments](https://github.com/microsoft/rag-openai/blob/main/topics/RAG_EnablingExperimentation.md)

It sounds like a simplistic statement, but "the best Search solution is the one that returns the best results".

The only way to get the best and most relevant results is via experimentation. As a result, the experimentation phase is crucial and effort should be invested to create experiments that can be _tracked_, _evaluated_ against a consistent set of metrics and can be _repeatable_.

We will follow the [experiment template](https://github.com/cse-labs/ai-garden/blob/main/docs/templates/using-the-templates.md#using-the-experiment-template) from ai-garden

```{note}
Each situation is different, and the techniques you use will depend on the type of documents you have, the type of queries you expect to receive, and the type of results you want to return. Discuss with your data scientist to work this out for a specific engagement.
```

<!--
```{seealso}
As you build out your experimentation process for search, reference the [Things to Consider](https://github.com/microsoft/rag-openai/blob/main/topics/RAG_ThingsToConsider.md) document which will highlight some important features to include in your experiments. You can see the [existing learnings from engagements ](https://github.com/microsoft/rag-openai/blob/main/topics/RAG_EnablingSearch.md#learnings-from-our-engagements-2)
``` -->

## ðŸ’¡The Role of Experimentation

When creating/running search experiments, there are multiple factors that shape the outcome of each experiment. These are small changes that add up over time and change the functionality and effectiveness of your search experience.

<!-- These tweaks should help you determine which combination of document shaping and indexing techniques will provide the most relevant set of documents returned for the set of queries that you care about. -->

Creating an effective solution is a delicate balance of several factors, such as:

- Data ingestion: Optical Character Recognition (OCR), data conversation, usage of Azure Form Recognizer to extract information from the documents, summarization, chunking, speech to text, tagging, etc. are all methods that can be considered and experimented with as part of the ingestion experimentation. See [learnings form engagements](https://github.com/microsoft/rag-openai/blob/main/topics/RAG_EnablingSearch.md#learnings-from-engagements-1)
- Which [search mechanism](https://github.com/microsoft/rag-openai/blob/main/topics/RAG_EnablingSearch.md) to use - whether it is vector search, semantic search, or other.

- Which model to use - [GPT4, GPT3 (Ada, Curie, Da Vinci), GPT Turbo](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models) etc.

- The prompt - the instruction given to the model in order to produce the desired result. Writing an effective prompt is referred to as "[Prompt Engineering](https://learn.microsoft.com/en-us/azure/cognitive-services/openai/concepts/prompt-engineering)". It is an empirical and iterative process.

<!-- - Pre and pos-processing: Optical Character Recognition, data conversation, use of Azure Form Recognizer to extract information from the documents, chunking, summarization, post-processing to make data more "human like", video captioning, speech to text, tagging, etc. are all methods that need to be considered and experimented with as part of the ingestion pipeline experimentation.  -->

<!-- - A series of arguments, [passed to the OpenAI APIs](https://learn.microsoft.com/en-us/azure/cognitive-services/openai/reference) also impact the result. These are arguments such as "temperature", "logit_bias", "presence_penalty" etc. -->

```{note}
At this point, choosing the final mechanism to ingest data for example is a decision that can wait until the experimentation is completed: using native indexers, building durable functions, using Azure Machine Learning pipelines are all viable options, but out of the scope of experimentation. These decisions, however, can, and should, wait until the experimentation is done, and the pre and post-processing operations including tagging and other document enrichment are well defined.

Ideally, during the experimentation phase, there should be a simple method to load data, with all the pre-and postprocessing. For instance: Python code on a Jupyter notebook, or a Postman call. Only after the best search design is known, a commitment can be made to deliver the ingestion solution.
```

## ðŸ“ˆ Evaluation

One of the most important aspects of developing a system for production is regular evaluation and iterative experimentation. This process allows you to measure performance, troubleshoot issues, and fine-tune to improve it.

**[Evaluation Strategy](https://github.com/microsoft/rag-openai/blob/main/topics/RAG_EnablingEvaluation.md)**

A RAG solution may involve different components that will work together to make up the core functionalities. For example, we could have:

- An agent that takes care of the conversation part of the solution.
- An agent that detects intent within a conversation.
- An agent that extracts vehicle information from a conversation.
- In addition, the system may need to be integrated with live services. For example, an agent that needs to make calls to a hosted API to help verify that the customer has fully specified the vehicle's manufacture, model to be able to progress the conversation.

In evaluating the system, we need to account for evaluating all components of the system individually and not only the ones that are using LLMs.

<!-- A RAG solution is a system that combines retrieval and generation to produce natural language outputs based on relevant information from a large corpus of documents.  -->

<!-- To evaluate the performance of a RAG solution, you need to consider both the quality of the retrieved documents and the quality of the generated outputs. These components (retrieval & generative part) must be evaluated individually before evaluating the end-to-end solution. -->

```{note}
In our workshop, we have the following components:

- **Retrieval Component** - responsible to retrieve the most relevant documents from the Index, based on the input query.
- **Generative Component** - responsible to return a response, based on the input query and extracted documents.
- **End-to-end Solution** - given a question, the RAG solution is responsible to formulate an answer which is grounded in relevant data.
```

### ðŸ“Š Evaluation Dataset

A key part of this process is creating an evaluation dataset for each component.

<!-- - Evaluate each component of the solution - Develop Dataset for each
- Evaluate the end-to-end output flow (User input to UI output) - Develop Dataset for output flow
- Evaluate the whole conversation with the chatbot - Develop Dataset for whole chat session - Note: since our scenario is a QA and not an actual chat, we will skip this part. -->

When developing a dataset, there are a few things to keep in mind, such as making sure the evaluation set is representative of the data the system will be used on in the real world and regularly updating the evaluation dataset to ensure that it stays relevant to edge cases that are discovered throughout experimentation.

There are three potential approaches:

- We already have evaluation dataset - **ideal, but not always feasible**
- We can use humans to create evaluation dataset - **laborious and not always possible**
- We generate synthetic data - **dangerous, but sometimes the only method available**

```{note}
In our workshop, we followed the third option and we generated the data using an LLM (GPT4). You can find the generated data in [qa_pairs_solutionops](./output/qa/evaluation/qa_pairs_solutionops.json). You can find in [generate-qa.ipynb](./5.1.generation-qa.ipynb) how we generated them.

The evaluation data consists of:
- user_prompt: the input question
- context: the piece(s) of text that contains the relevant content to answer the input question
- output_prompt: the final answer in a human readable/friendly format

For the **Retrieval Component**, the dataset is composed of question and citation (context). Evaluating the Retrieval Component means to evaluate if for a given query (user question) the search engine is returning the relevant citation(s).

For the **Generative Component**, the dataset is composted of question, citation (context) and answer (output_prompt). Evaluating the Generative Component means to evaluate if for a given query (user question) and a given set of retrieved documented, the engine is returning a good answer.

For the **End-to-end Solution**, we evaluate if for a given question, the system is able to retrieve relevant documents and formulate an answer.
```

### ðŸŽ¯ Evaluation Metrics

<!-- Quality measurement guides the entire development process of the Search solution. The team must know what performance metric they are chasing.

This is easier said than done. We are developing systems where the results are fuzzy by nature. What documents do you expect to be retrieved when you ask a question? -->

There are different types of metrics and ways to evaluate a system.

- There is an end-to-end business evaluation, where we want to measure whether the deployed system has met certain business metrics (e.g.: sales increase)
- There is a technical evaluation where we want to ensure that each functional part of the system meets certain technical requirements or baseline metrics to ensure that we are pushing a quality system in production.

<!-- In RAG, there are two main components involved: the retrieval part and the generative part. -->

<!-- A RAG solution is a system that combines retrieval and generation to produce natural language outputs based on relevant information from a large corpus of documents. To evaluate the performance of a RAG solution, you need to consider both the quality of the retrieved documents and the quality of the generated outputs. These components (retrieval & generative part) must be evaluated individually before evaluating the end-to-end solution. -->

```{note}
In our workshop, we will focus on the technical evaluation for two components:

- For the [Retrieval Component](https://github.com/microsoft/rag-openai/blob/main/topics/RAG_EnablingEvaluation.md#evaluating-the-retrieval-component)
  - There are various metrics that can be used, you can read more [here](https://medium.com/@prateekgaurav/evaluating-information-retrieval-models-a-comprehensive-guide-to-performance-metrics-78aadacb73b4#:~:text=Evaluating%20Information%20Retrieval%20Models%3A%20A%20Comprehensive%20Guide%20to,...%206%206.%20Mean%20Reciprocal%20Rank%20%28MRR%29%20)
  - We will use accuracy and mean and median [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity).
    <!-- defined by verifying whether the source document, present in our evaluation dataset, is part of the retrieved documents. TODO: Or the first retrieved document?! -->
- For the [End-to-end Solution](https://github.com/microsoft/rag-openai/blob/main/topics/RAG_EnablingEvaluation.md#evaluating-the-generative-component)
  - We will look at human-centric metrics ([Groundedness, Relevance, Coherence, Similarity, Fluency](https://learn.microsoft.com/en-us/azure/machine-learning/prompt-flow/concept-model-monitoring-generative-ai-evaluation-metrics?view=azureml-api-2)) using [another LLM as judge approach](https://arxiv.org/pdf/2311.09476.pdf).
```
