Experiments

Experiments #

It sounds like a simplistic statement, but “the best Search solution is the one that returns the best results”.

The only way to get the best and most relevant results is via experimentation. As a result, the experimentation phase is crucial and effort should be invested to create experiments that can be tracked, evaluated against a consistent set of metrics and can be repeatable.

We will follow the experiment template from ai-garden

Note

Each situation is different, and the techniques you use will depend on the type of documents you have, the type of queries you expect to receive, and the type of results you want to return. Discuss with your data scientist to work this out for a specific engagement.

💡The Role of Experimentation#

When creating/running search experiments, there are multiple factors that shape the outcome of each experiment. These are small changes that add up over time and change the functionality and effectiveness of your search experience.

Creating an effective solution is a delicate balance of several factors, such as:

Data ingestion: Optical Character Recognition (OCR), data conversation, usage of Azure Form Recognizer to extract information from the documents, summarization, chunking, speech to text, tagging, etc. are all methods that can be considered and experimented with as part of the ingestion experimentation. See learnings form engagements
Which search mechanism to use - whether it is vector search, semantic search, or other.
Which model to use - GPT4, GPT3 (Ada, Curie, Da Vinci), GPT Turbo etc.
The prompt - the instruction given to the model in order to produce the desired result. Writing an effective prompt is referred to as “Prompt Engineering”. It is an empirical and iterative process.

Note

At this point, choosing the final mechanism to ingest data for example is a decision that can wait until the experimentation is completed: using native indexers, building durable functions, using Azure Machine Learning pipelines are all viable options, but out of the scope of experimentation. These decisions, however, can, and should, wait until the experimentation is done, and the pre and post-processing operations including tagging and other document enrichment are well defined.

Ideally, during the experimentation phase, there should be a simple method to load data, with all the pre-and postprocessing. For instance: Python code on a Jupyter notebook, or a Postman call. Only after the best search design is known, a commitment can be made to deliver the ingestion solution.

📈 Evaluation#

One of the most important aspects of developing a system for production is regular evaluation and iterative experimentation. This process allows you to measure performance, troubleshoot issues, and fine-tune to improve it.

Evaluation Strategy

A RAG solution may involve different components that will work together to make up the core functionalities. For example, we could have:

An agent that takes care of the conversation part of the solution.
An agent that detects intent within a conversation.
An agent that extracts vehicle information from a conversation.
In addition, the system may need to be integrated with live services. For example, an agent that needs to make calls to a hosted API to help verify that the customer has fully specified the vehicle’s manufacture, model to be able to progress the conversation.

In evaluating the system, we need to account for evaluating all components of the system individually and not only the ones that are using LLMs.

Note

In our workshop, we have the following components:

Retrieval Component - responsible to retrieve the most relevant documents from the Index, based on the input query.
Generative Component - responsible to return a response, based on the input query and extracted documents.
End-to-end Solution - given a question, the RAG solution is responsible to formulate an answer which is grounded in relevant data.

📊 Evaluation Dataset#

A key part of this process is creating an evaluation dataset for each component.

When developing a dataset, there are a few things to keep in mind, such as making sure the evaluation set is representative of the data the system will be used on in the real world and regularly updating the evaluation dataset to ensure that it stays relevant to edge cases that are discovered throughout experimentation.

There are three potential approaches:

We already have evaluation dataset - ideal, but not always feasible
We can use humans to create evaluation dataset - laborious and not always possible
We generate synthetic data - dangerous, but sometimes the only method available

Note

In our workshop, we followed the third option and we generated the data using an LLM (GPT4). You can find the generated data in qa_pairs_solutionops. You can find in generate-qa.ipynb how we generated them.

The evaluation data consists of:

user_prompt: the input question
context: the piece(s) of text that contains the relevant content to answer the input question
output_prompt: the final answer in a human readable/friendly format

For the Retrieval Component, the dataset is composed of question and citation (context). Evaluating the Retrieval Component means to evaluate if for a given query (user question) the search engine is returning the relevant citation(s).

For the Generative Component, the dataset is composted of question, citation (context) and answer (output_prompt). Evaluating the Generative Component means to evaluate if for a given query (user question) and a given set of retrieved documented, the engine is returning a good answer.

For the End-to-end Solution, we evaluate if for a given question, the system is able to retrieve relevant documents and formulate an answer.

🎯 Evaluation Metrics#

There are different types of metrics and ways to evaluate a system.

There is an end-to-end business evaluation, where we want to measure whether the deployed system has met certain business metrics (e.g.: sales increase)
There is a technical evaluation where we want to ensure that each functional part of the system meets certain technical requirements or baseline metrics to ensure that we are pushing a quality system in production.

Note

In our workshop, we will focus on the technical evaluation for two components:

For the Retrieval Component
- There are various metrics that can be used, you can read more here
- We will use accuracy and mean and median cosine similarity.
For the End-to-end Solution
- We will look at human-centric metrics (Groundedness, Relevance, Coherence, Similarity, Fluency) using another LLM as judge approach.

Experiments

Contents

Experiments#

💡The Role of Experimentation#

📈 Evaluation#

📊 Evaluation Dataset#

🎯 Evaluation Metrics#

Experiments #