Generation of synthetic data

Generation of synthetic data#

The goal of this notebook is to create a synthetically evaluation dataset using GPT-4 on a the Microsoft SolutionOps Repo.

Imports and credentials#

import pandas as pd
import glob
import os
import openai
import asyncio
import json
import nltk
import tqdm
import backoff
from langchain_community.document_loaders import UnstructuredFileLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter, MarkdownTextSplitter, RecursiveCharacterTextSplitter
from openai import AsyncAzureOpenAI
from typing import Coroutine, Any, Tuple
import plotly.express as px
import tracemalloc

from plotly import colors
from plotly.subplots import make_subplots
import plotly.graph_objects as go 

aoai_key = ""
aoai_endpoint = ""
os.environ.setdefault("AZURE_OPENAI_KEY", aoai_key)
os.environ.setdefault("AZURE_OPENAI_ENDPOINT", aoai_endpoint)

''

tracemalloc.start()

Load documents#

We aim to create an evaluation dataset that contains 300 samples. Each sample is represented by a question/answer pair that will be used to evaluate how well our QA (Question Answering) system is doing. The process if the following:

Load the documents from the repo
Chunk the documents into smaller pieces
Given a chunk, ask GPT-4 to generate a question/answer pair

However, the repo contains a lot of documents and each chunk can vary. For example in length, informativeness, etc. Thus, in a bid to create samples that have substance, we first want to gain some insights on documents and chunks

The distribution of the document lengths
The distribution of the number of chunks per document
The distribution of the chunk lengths

Documents in the repo#

Let’s first load all the documents in the repo, and look at how many documents we have.

def load_documents_from_folder(path: str) -> list[str]:
    print("Loading documents...")
    data_documents = {
        "solutionops_section": [],
        "document_object": [],
        "document_text": [],
        "document_length": [],
        "document_path": []
    }

    documents = []
    
    for file in tqdm.tqdm(glob.glob(path, recursive=True)):
        loader = UnstructuredFileLoader(file) 
        document = loader.load()
        data_documents["solutionops_section"].append(file.split("/")[3])
        data_documents["document_object"].append(document)
        data_documents["document_text"].append(document[0].page_content)
        data_documents["document_length"].append(len(document[0].page_content.split()))
        data_documents["document_path"].append(file)

        # markdown_documents.append(document)
    return data_documents

markdown_documents = load_documents_from_folder("../data/solutionops/**/*.md")

Loading documents...

0it [00:00, ?it/s]

0it [00:00, ?it/s]

df_markdown_documents = pd.DataFrame(markdown_documents)

df_markdown_documents.head()

	solutionops_section	document_object	document_text	document_length	document_path

unique_sections = list(df_markdown_documents['solutionops_section'].unique())
color_mapping = {section: color for section, color in zip(unique_sections, colors)}

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[8], line 2
      1 unique_sections = list(df_markdown_documents['solutionops_section'].unique())
----> 2 color_mapping = {section: color for section, color in zip(unique_sections, colors)}

TypeError: 'module' object is not iterable

color_mapping

{'code-with-dataops': 'rgb(31, 119, 180)',
 'playbook': 'rgb(255, 127, 14)',
 'code-with-platformops': 'rgb(44, 160, 44)',
 'code-with-devsecops': 'rgb(214, 39, 40)',
 'industryops': 'rgb(148, 103, 189)',
 'code-with-engineering': 'rgb(140, 86, 75)',
 'code-with-mlops': 'rgb(227, 119, 194)',
 'code-with-fusionops': 'rgb(127, 127, 127)'}

fig = px.pie(df_markdown_documents, names='solutionops_section', title='Percentage of each section in the SolutionOps repo', color='solutionops_section', color_discrete_map=color_mapping)
fig.update_traces(opacity=0.9)
fig.show()

for i, _ in enumerate(["hey", "hello"]):
    print(i)

0
1

[[{'type': 'xy'}] for _ in range(8)]

[[{'type': 'xy'}],
 [{'type': 'xy'}],
 [{'type': 'xy'}],
 [{'type': 'xy'}],
 [{'type': 'xy'}],
 [{'type': 'xy'}],
 [{'type': 'xy'}],
 [{'type': 'xy'}]]

fig = make_subplots(8, 1, specs=[[{'type': 'xy'}] for _ in range(8)],
                subplot_titles=unique_sections,
                shared_xaxes=True,
                shared_yaxes=True)

n_sections = len(df_markdown_documents['solutionops_section'].unique())

# Create a color mapping for each unique 'solutionops_section'
color_mapping = {section: color for section, color in zip(df_markdown_documents['solutionops_section'].unique(), colors)}

# fig.add_trace(go.Pie(labels=df_markdown_documents['solutionops_section'], marker=dict(colors=df_markdown_documents['solutionops_section'].map(color_mapping)), values=df_markdown_documents['document_length'],name="\% docs"), 1, 1)

for i, section in enumerate(df_markdown_documents['solutionops_section'].unique()):
        fig.add_trace(go.Histogram(x=df_markdown_documents[df_markdown_documents['solutionops_section'] == section]['document_length'], marker_color=color_mapping[section], name=section, showlegend=True), i+1 , 1)
        fig['layout'][f'xaxis{i+1}'].update(showticklabels=True)
        fig['layout'][f'yaxis{i+1}'].update(showticklabels=True)

fig.update_layout(title_text='Distribution of document lengths for each section', showlegend=True, height=2000, width=1000)
fig.update_yaxes(range=[0, 70]) 
fig.update_traces(opacity=0.9)
fig.show()

fig = px.histogram(df_markdown_documents, x="document_length", labels='solutionops_section', color='solutionops_section', color_discrete_map=color_mapping, marginal="box")
fig.update_layout(title_text='Distribution of document lengths for each section', showlegend=True, height=800)
# fig.update_yaxes(range=[0, 70]) 
fig.update_traces(opacity=0.9)

fig.show()

All sections seem have a distribution with the same shape. We can see that that mlops and engineering are the two sections that contain the volume of documents with a reasonable length. For engineering more than half documents are above 457 words, and for mlops more than half documents are above 734 words.

Hence, we decide to create our subset from these two sections, while taking into account the document length when sampling documents.

We can see that almost 50% of the documents are in the “code-with-engineering” and “code-with-mlops” folders. Since we’re looking to create a smaller dataset, we decide to create a dataset only out of these two folders.

n_docs_engineering = len(df_markdown_documents[df_markdown_documents['solutionops_section'] == 'code-with-engineering'])
n_docs_mlops = len(df_markdown_documents[df_markdown_documents['solutionops_section'] == 'code-with-mlops'])
n = n_docs_engineering + n_docs_mlops

print(f"Number of documents in the code-with-engineering section: {n_docs_engineering}")
print(f"Number of documents in the code-with-mlops section: {n_docs_mlops}")
print(f"Number of documents in the code-with-engineering + code-with-mlops sections: {n}/{len(df_markdown_documents)}")

Number of documents in the code-with-engineering section: 251
Number of documents in the code-with-mlops section: 146
Number of documents in the code-with-engineering + code-with-mlops sections: 397/776

subset_df_markdown_documents = df_markdown_documents[df_markdown_documents['solutionops_section'].isin(['code-with-engineering', 'code-with-mlops'])]

subset_df_markdown_documents.head()

	solutionops_section	document_object	document_text	document_length	document_path
346	code-with-engineering	[page_content='Structure of a Sprint\n\nThe pu...	Structure of a Sprint\n\nThe purpose of this d...	468	../data/solutionops/code-with-engineering/SPRI...
347	code-with-engineering	[page_content='Who We Are\n\nOur team, ISE (In...	Who We Are\n\nOur team, ISE (Industry Solution...	263	../data/solutionops/code-with-engineering/ISE.md
348	code-with-engineering	[page_content="Engineering Fundamentals Checkl...	Engineering Fundamentals Checklist\n\nThis che...	793	../data/solutionops/code-with-engineering/ENG-...
349	code-with-engineering	[page_content='ISE Code-With Engineering Playb...	ISE Code-With Engineering Playbook\n\nAn engin...	357	../data/solutionops/code-with-engineering/inde...
350	code-with-engineering	[page_content="Work Item ID\n\nFor more inform...	Work Item ID\n\nFor more information about how...	325	../data/solutionops/code-with-engineering/code...

final_markdown_documents = subset_df_markdown_documents.to_dict('list')

Chunking documents into smaller pieces#

We will now chunk the documents into smaller pieces. We can think of two strategies:

Consider that the markdown pages are well structured, and split using the markdown headers
Split the documents into chunks of a fixed tokens length

Splitting using Markdown headers#

def create_chunks_md_headers(documents:list) -> list:
    print("Creating chunks...")
    headers_to_split_on = [
        ("#", "Header 1"),
        ("##", "Header 2"),
        ("###", "Header 3"),
    ]

    markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on, strip_headers=False)

    lengths = {}
    all_chunks = {}
    chunk_id = 0

    chunks  = {
        "chunk_id": [],
        "chunk_text": [],
        "source": []
    }

    for document in tqdm.tqdm(documents):
        current_chunks_text_list = markdown_splitter.split_text(document[0].page_content)

        for i, chunk in enumerate(current_chunks_text_list):
            chunks['chunk_id'].append(f"chunk{chunk_id}_{i}")
            chunks['chunk_text'].append(chunk.page_content)
            chunks['source'].append(document[0].metadata['source'])


        chunk_id += 1
    return chunks

chunks_with_md_headers = create_chunks_md_headers(final_markdown_documents["document_object"])

Creating chunks...

100%|██████████| 397/397 [00:00<00:00, 1792.54it/s]

df_chunks_with_md_headers = pd.DataFrame(chunks_with_md_headers)

df_chunks_with_md_headers.head()

	chunk_id	chunk_text	source
0	chunk0_0	Structure of a Sprint \nThe purpose of this d...	../data/solutionops/code-with-engineering/SPRI...
1	chunk1_0	Who We Are \nOur team, ISE (Industry Solution...	../data/solutionops/code-with-engineering/ISE.md
2	chunk2_0	Engineering Fundamentals Checklist \nThis che...	../data/solutionops/code-with-engineering/ENG-...
3	chunk3_0	ISE Code-With Engineering Playbook \nAn engin...	../data/solutionops/code-with-engineering/inde...
4	chunk4_0	Work Item ID \nFor more information about how...	../data/solutionops/code-with-engineering/code...

df_chunks_with_md_headers['source'][0]

'../data/solutionops/code-with-engineering/SPRINT-STRUCTURE.md'

Let’s look at the distribution of lengths in the chunks. We create a new colum called chunk_text_length which contains the length of the chunk_text column.

df_chunks_with_md_headers['chunk_text_length'] = df_chunks_with_md_headers['chunk_text'].apply(lambda x: len(x.split()))

df_chunks_with_md_headers.head(100)

	chunk_id	chunk_text	source	chunk_text_length
0	chunk0_0	Structure of a Sprint \nThe purpose of this d...	../data/solutionops/code-with-engineering/SPRI...	468
1	chunk1_0	Who We Are \nOur team, ISE (Industry Solution...	../data/solutionops/code-with-engineering/ISE.md	263
2	chunk2_0	Engineering Fundamentals Checklist \nThis che...	../data/solutionops/code-with-engineering/ENG-...	793
3	chunk3_0	ISE Code-With Engineering Playbook \nAn engin...	../data/solutionops/code-with-engineering/inde...	357
4	chunk4_0	Work Item ID \nFor more information about how...	../data/solutionops/code-with-engineering/code...	325
...	...	...	...	...
95	chunk90_0	Test-Driven Development Example \nWith this m...	../data/solutionops/code-with-engineering/auto...	1196
96	chunk91_0	Custom Connector Testing \nWhen developing Cu...	../data/solutionops/code-with-engineering/auto...	154
97	chunk92_0	Example: Authoring a unit test \nTo illustrat...	../data/solutionops/code-with-engineering/auto...	1402
98	chunk93_0	~Customer Project~ Case Study \nBackground \...	../data/solutionops/code-with-engineering/auto...	218
99	chunk94_0	Templates \ncase-study-template \ntest-type-...	../data/solutionops/code-with-engineering/auto...	3

100 rows × 4 columns

fig = px.histogram(df_chunks_with_md_headers, x='chunk_text_length', title='Histogram of Chunk Text Length', nbins=80, marginal='box')
fig.show()

Chunking with Markdown headers + recursive text splitter#

def create_chunks_md_headers_and_recsplitter(documents:list) -> list:
    print("Creating chunks...")
    headers_to_split_on = [
        ("#", "Header 1"),
        ("##", "Header 2"),
        ("###", "Header 3"),
    ]

    markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on, strip_headers=False)

    lengths = {}
    all_chunks = {}
    chunk_id = 0

    chunks  = {
        "chunk_id": [],
        "chunk_text": [],
        "source": []
    }

    chunk_size = 250
    chunk_overlap = 30
    
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap
    )

    for document in tqdm.tqdm(documents):
        current_chunks_text_list_with_md_headers = markdown_splitter.split_text(document[0].page_content)
        current_chunks_text_list = text_splitter.split_documents(current_chunks_text_list_with_md_headers)



        for i, chunk in enumerate(current_chunks_text_list):
            chunks['chunk_id'].append(f"chunk{chunk_id}_{i}")
            chunks['chunk_text'].append(chunk.page_content)
            chunks['source'].append(document[0].metadata['source'])


        chunk_id += 1
    return chunks

chunks_with_md_headers_and_recsplitter = create_chunks_md_headers_and_recsplitter(final_markdown_documents["document_object"])

Creating chunks...

100%|██████████| 397/397 [00:01<00:00, 249.19it/s]

df_chunks_with_md_headers_and_recsplitter= pd.DataFrame(chunks_with_md_headers_and_recsplitter)

df_chunks_with_md_headers_and_recsplitter.head()

	chunk_id	chunk_text	source
0	chunk0_0	Structure of a Sprint \nThe purpose of this d...	../data/solutionops/code-with-engineering/SPRI...
1	chunk0_1	Extensible hierarchy to allow teams to share d...	../data/solutionops/code-with-engineering/SPRI...
2	chunk0_2	Before starting the project \n[ ] Discuss and...	../data/solutionops/code-with-engineering/SPRI...
3	chunk0_3	Estimation \n[ ] Set up the repository/reposi...	../data/solutionops/code-with-engineering/SPRI...
4	chunk0_4	[ ] Build a Product Backlog \nSet up a projec...	../data/solutionops/code-with-engineering/SPRI...

df_chunks_with_md_headers_and_recsplitter['chunk_text_length'] = df_chunks_with_md_headers_and_recsplitter['chunk_text'].apply(lambda x: len(x.split()))

fig = px.histogram(df_chunks_with_md_headers_and_recsplitter, x='chunk_text_length', title='Histogram of Chunk Text Length', nbins=80, marginal='box')
fig.show()

Doing both is too restrictive, and we end up with a lot of chunks that are too small.

Create chunks using the MarkdownTextsplitter using tiktoken encoder#

To decide the length of each chunk, we will look at the distribution of the document lengths and the distribution of the chunk lengths. We will then decide on a chunk length that will give us a good distribution of chunk lengths. We will use a splitter that uses the tiktoken tokenizer to split the documents into chunks. Hence we do the following:

Tokenize all the documents, and look at the distribution of the document lengths
Based on the above distribution, decide on a chunk length
Split the documents into chunks of the decided length using MarkdownTextSplitter.from_tiktoken_encoder()

import tiktoken

encoding = tiktoken.get_encoding("cl100k_base")

subset_df_markdown_documents['tokens'] = subset_df_markdown_documents['document_text'].apply(lambda x: encoding.encode(x))
subset_df_markdown_documents['n_tokens'] = subset_df_markdown_documents['tokens'].apply(len)

/var/folders/sp/zdlscr9s4kg_ym9ndt_4x9m40000gn/T/ipykernel_1660/1682601717.py:6: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/var/folders/sp/zdlscr9s4kg_ym9ndt_4x9m40000gn/T/ipykernel_1660/1682601717.py:7: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

subset_df_markdown_documents.head()

	solutionops_section	document_object	document_text	document_length	document_path	tokens	n_tokens
346	code-with-engineering	[page_content='Structure of a Sprint\n\nThe pu...	Structure of a Sprint\n\nThe purpose of this d...	468	../data/solutionops/code-with-engineering/SPRI...	[23807, 315, 264, 45912, 271, 791, 7580, 315, ...	615
347	code-with-engineering	[page_content='Who We Are\n\nOur team, ISE (In...	Who We Are\n\nOur team, ISE (Industry Solution...	263	../data/solutionops/code-with-engineering/ISE.md	[15546, 1226, 8886, 271, 8140, 2128, 11, 358, ...	302
348	code-with-engineering	[page_content="Engineering Fundamentals Checkl...	Engineering Fundamentals Checklist\n\nThis che...	793	../data/solutionops/code-with-engineering/ENG-...	[87100, 13492, 78114, 96069, 271, 2028, 53673,...	994
349	code-with-engineering	[page_content='ISE Code-With Engineering Playb...	ISE Code-With Engineering Playbook\n\nAn engin...	357	../data/solutionops/code-with-engineering/inde...	[9311, 6247, 84256, 17005, 7199, 2239, 271, 21...	454
350	code-with-engineering	[page_content="Work Item ID\n\nFor more inform...	Work Item ID\n\nFor more information about how...	325	../data/solutionops/code-with-engineering/code...	[6919, 5858, 3110, 271, 2520, 810, 2038, 922, ...	402

[TODO: ADD CORRELATION between n_tokens and document_lengths]

fig = px.histogram(subset_df_markdown_documents, x='n_tokens', title='Distribution of tokenization lengths', nbins=80, marginal='box')
fig.show()

We decide to use chunk size = q1/2 = 360/2 = 180 tokens. so that 75%+ of the documents will be represented by at least 2 chunks.

def create_chunks_tokens(documents:list) -> list:
    print("Creating chunks...")

    
    markdown_splitter = MarkdownTextSplitter.from_tiktoken_encoder(
                chunk_size=180, chunk_overlap=30
    )

    lengths = {}
    all_chunks = {}
    chunk_id = 0

    chunks  = {
        "chunk_id": [],
        "chunk_text": [],
        "source": []
    }

    for document in tqdm.tqdm(documents):
        current_chunks_text_list = markdown_splitter.split_text(document[0].page_content)
        for i, chunk in enumerate(current_chunks_text_list):
            chunks['chunk_id'].append(f"chunk{chunk_id}_{i}")
            chunks['chunk_text'].append(chunk)
            chunks['source'].append(document[0].metadata['source'])


        chunk_id += 1
    return chunks

chunks_with_tokens = create_chunks_tokens(final_markdown_documents["document_object"])

Creating chunks...

100%|██████████| 397/397 [00:03<00:00, 128.19it/s]

df_chunks_with_tokens = pd.DataFrame(chunks_with_tokens)

df_chunks_with_tokens.head()

	chunk_id	chunk_text	source
0	chunk0_0	Structure of a Sprint\n\nThe purpose of this d...	../data/solutionops/code-with-engineering/SPRI...
1	chunk0_1	[ ] Build a Product Backlog\n\nSet up a projec...	../data/solutionops/code-with-engineering/SPRI...
2	chunk0_2	Design the first test cases\n\n[ ] Decide on b...	../data/solutionops/code-with-engineering/SPRI...
3	chunk0_3	Day 3\n\n[ ] Agree on code style and on how to...	../data/solutionops/code-with-engineering/SPRI...
4	chunk0_4	[ ] Agree on how to Design a feature and condu...	../data/solutionops/code-with-engineering/SPRI...

df_chunks_with_tokens['chunk_text_length'] = df_chunks_with_tokens['chunk_text'].apply(lambda x: len(x.split()))

fig = px.histogram(df_chunks_with_tokens, x='chunk_text_length', title='Histogram of Chunk Text Length', nbins=80, marginal='box')
fig.show()

QA generation#

chunks = create_chunks_tokens(final_markdown_documents["document_object"])

Creating chunks...

100%|██████████| 397/397 [00:03<00:00, 129.98it/s]

def load_documents_from_folder_and_reduce(path: str, subset=list[str]) -> list[str]:
    print("Loading documents...")
    data_documents = {
        "solutionops_section": [],
        "document_object": [],
        "document_text": [],
        "document_length": [],
        "document_path": []
    }

    documents = []
    
    for file in tqdm.tqdm(glob.glob(path, recursive=True)):
        loader = UnstructuredFileLoader(file) 
        document = loader.load()
        data_documents["solutionops_section"].append(file.split("/")[3])
        data_documents["document_object"].append(document)
        data_documents["document_text"].append(document[0].page_content)
        data_documents["document_path"].append(file)

        # markdown_documents.append(document)
        df_documents = pd.DataFrame(data_documents)
        subset_df_documents = df_documents[df_documents['solutionops_section'].isin(subset)]
    return subset_df_documents.to_dict('list')

def process_documents_from_path(path: str, subset:list[str]) -> dict:
    documents = load_documents_from_folder_and_reduce(path, subset)
    chunks = create_chunks_tokens(documents['document_object'])
    return chunks

def create_qa_generation_prompt_from_chunk_context(chunk_text: str) -> list[dict]:
    system_prompt = """you are a prompt creator and have ability to generate new JSON prompts based on the given CONTEXT.
Generate 1 most relevant new prompt in valid json format according to "RESPONSE SCHEMA EXAMPLE" completely from scratch.
"RESPONSE SCHEMA EXAMPLE":
[
    {
        "role: "user",
        "content": "This is the generated prompt text",
    },
    {
        "role: "assistant",
        "content": "the expected, rightful and correct answer for the question"
    },
]
"""
    user_prompt = """The response must be valid JSON array containing two objects. The first object must contain the keys "role" and "content". The second object must also contain the keys "role" and "content".
    The response must follow the "RESPONSE SCHEMA EXAMPLE".
    The most important thing is that the response must be valid JSON ARRAY. DO NOT include anything other than valid schema.

    CONTEXT:

    """
    final_messages = [
        {
            "role": "system",
            "content": system_prompt
        },
        {
            "role": "user",
            "content": user_prompt + chunk_text + "\nEND OF CONTEXT"
        }
    ]
    return final_messages

def create_qa_generation_prompt_from_chunk_context_mlflow(chunk_text: str) -> list[dict]:
    prompt = f"""Please generate a question asking for the key information in the given paragraph.
    Also answer the questions using the information in the given paragraph.
    Please ask the specific question instead of the general question, like
    'What is the key information in the given paragraph?'.
    Please generate the answer using as much information as possible.
    If you are unable to answer it, please generate the answer as 'I don't know.'
    The answer should be informative and should be more than 3 sentences.

    Paragraph: {chunk_text}

    Please call the submit_function function to submit the generated question and answer.
    """

    messages = [{"role": "user", "content": prompt}]
    return final_messages

def generate_qa_from_chunk_text(chunk_text:str) -> dict:
    final_prompt = create_qa_generation_prompt_from_chunk_context(chunk_text)
    return final_prompt

openai.api_key = os.getenv("AZURE_OPENAI_KEY")
openai.api_base = os.getenv("AZURE_OPENAI_ENDPOINT") # your endpoint should look like the following https://YOUR_RESOURCE_NAME.openai.azure.com/
openai.api_type = 'azure'
openai.api_version = '2023-05-15' # this might change in the future

def call_aoai_gpt4(messages: list[dict]):
    response = openai.ChatCompletion.create(
        engine="dep-gpt4", # engine = "deployment_name".
        messages=messages
    )
    
    return response.choices[0].message.content

# @backoff.on_exception(backoff.expo, openai.RateLimitError)
async def call_aoai_gpt4_async(
    client: AsyncAzureOpenAI,
    list_pairs: list,
    chunk_id: str,
    chunk_text: str,
    source: str):
    

    data = {
        "user_prompt": [],
        "output_prompt": [],
        "context": [],
        "chunk_id": [],
        "source": []
    }
    try:

        prompt_from_chunk_context = generate_qa_from_chunk_text(chunk_text)

        response = await client.chat.completions.create(model="gpt-4-1106", messages=prompt_from_chunk_context)
        response_dict = json.loads(response.choices[0].message.content)
        for item in response_dict:
            if item["role"] == "user":
                user_prompt = item["content"]
            if item["role"] == "assistant":
                output_prompt = item["content"]

        data = {
            "user_prompt": user_prompt,
            "output_prompt": output_prompt,
            "context": chunk_text,
            "chunk_id": chunk_id,
            "source": source
        }
        list_pairs.append(data)
            # final_df = final_df._append(data, ignore_index=True) #logic needs to be rewritten here to write better code to create DF (lambd/map)
    except Exception as e:
        print(f"Error: {e}")

Creation of QA Pairs#

Creation of jsonl files containing messages#

def export_messages_to_jsonl(model:str, list_messages: list[str]) -> None:
    with open("../data/jsonl/new_messages_solutionops.jsonl", "w") as f:
        for message in list_messages:
            data = {
                "model": model,
                "messages": message
            }
            json_string = json.dumps(data)
            f.write(json_string + "\n")

chunks = process_documents_from_path("../data/solutionops/**/*.md", subset=['code-with-engineering', 'code-with-mlops'])

Loading documents...

100%|██████████| 776/776 [02:30<00:00,  5.14it/s]

Creating chunks...

100%|██████████| 397/397 [00:02<00:00, 134.37it/s]

list_messages = [generate_qa_from_chunk_text(chunks['chunk_text'][i]) for i in range(len(chunks['chunk_text']))]

list_messages[0]

[{'role': 'system',
  'content': 'you are a prompt creator and have ability to generate new JSON prompts based on the given CONTEXT.\nGenerate 1 most relevant new prompt in valid json format according to "RESPONSE SCHEMA EXAMPLE" completely from scratch.\n"RESPONSE SCHEMA EXAMPLE":\n[\n    {\n        "role: "user",\n        "content": "This is the generated prompt text",\n    },\n    {\n        "role: "assistant",\n        "content": "the expected, rightful and correct answer for the question"\n    },\n]\n'},
 {'role': 'user',
  'content': 'The response must be valid JSON array containing two objects. The first object must contain the keys "role" and "content". The second object must also contain the keys "role" and "content".\n    The response must follow the "RESPONSE SCHEMA EXAMPLE".\n    The most important thing is that the response must be valid JSON ARRAY. DO NOT include anything other than valid schema.\n\n    CONTEXT:\n\n    Structure of a Sprint\n\nThe purpose of this document is to:\n\nOrganize content in the playbook for quick reference and discoverability\n\nProvide content in a logical structure which reflects the engineering process\n\nExtensible hierarchy to allow teams to share deep subject-matter expertise\n\nThe first week of an ISE Project\n\nBefore starting the project\n\n[ ] Discuss and start writing the Team Agreements. Update these documents with any process decisions made throughout the project\n\nWorking Agreement\n\nDefinition of Ready\n\nDefinition of Done\n\nEstimation\n\n[ ] Set up the repository/repositories\n\nDecide on repository structure/s\n\nAdd README.md, LICENSE, CONTRIBUTING.md, .gitignore, etc\n\n[ ] Build a Product Backlog\nEND OF CONTEXT'}]

export_messages_to_jsonl("gpt-4-1106", list_messages)

Use parallel script#

# async def generate_qa_pairs(path: str, subset:list[str]) -> Tuple[pd.DataFrame, list]:
#     chunks_dict = process_documents_from_path(path, subset)

#     print("Got chunks")
    
#     client = AsyncAzureOpenAI(
#         api_key=aoai_key,
#         api_version="2023-05-15",
#         azure_endpoint=aoai_endpoint
#     )

#     column_names = ["user_prompt", "output_prompt", "context"]
#     final_df = pd.DataFrame(columns=column_names)
#     semaphore = asyncio.Semaphore(130)
#     list_pairs = []
#     tasks = [call_aoai_gpt4_async(client, list_pairs, chunks_dict['chunk_id'][i], chunks_dict['chunk_text'][i], chunks_dict['source'][i]) for i in range(len(chunks_dict['chunk_id']))]
#     with tqdm.tqdm(total=len(tasks), bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt}" +
#      " [{elapsed}<{remaining}, {rate_noinv_fmt}]") as pbar:
#         for task in asyncio.as_completed(tasks):
#             async with semaphore:
#                 await task
#                 pbar.update()

#     dataframe = pd.DataFrame(list_pairs)
#     return (dataframe, list_pairs)

final_df, list_qa_pairs = await generate_qa_pairs("../data/solutionops/**/*.md", ['code-with-engineering', 'code-with-mlops'])

final_df

list_qa_pairs

[{'user_prompt': 'Can you explain the Transactional Systems pillar in the context of the capability hierarchy?',
  'output_prompt': "I'm sorry, I don't have that specific information about the 'Transactional Systems' pillar in the context of the capability hierarchy at the moment. However, in general, it would likely involve capabilities related to managing systems that support real-time, transaction-based data processing.",
  'context': 'Capabilities are the most granular items in the hierarchy that describe a generalized functionality and it\'s where bulk of the content resides.\n\nIn summary, this hierarchy can be represented as below:\n\ntext\nCapability Pillars (has Capability Maps)\n  - Capability Groups\n    - Capabilities\n\nCapability Pillars Description\n\nDevOps for Data\n\n"DevOps for Data" is the core functional area that sets the foundation for implementing DevOps capabilities for data solutions. This pillar is fundamental and supports all other pillars and capabilities.\n\nFor detailed information, see DataOps: DevOps for Data.\n\nAnalytical Systems\n\nThe "Analytical Systems" pillar contains guidance in relation to analytical systems. The primary use case is to support data-driven business decisions in the form of data products, reports, interactive dashboards, etc. Common technical patterns include traditional ETL and Business Intelligence, Modern Data Warehouse (ingest, process, and serve) and Lambda / Kappa architectures for real-time data systems. Most of the data sharing scenarios will sit under this pillar.\n\nFor detailed information, see DataOps: Analytical Systems.\n\nTransactional Systems',
  'chunk_id': 'chunk1_1',
  'source': '../data/solutionops/code-with-dataops/capabilities/index.md'},
 {'user_prompt': 'What are some fundamental principles I need to keep in mind while ensuring my software solution is secure, portable, and integrates well with various platforms?',
  'output_prompt': "You need to follow certain best practices such as always writing code keeping security in mind. This should be done at all stages of the engineering lifecycle. Try to make your solutions portable so that they can run on various platforms like edge devices, on-premises data centers, or different clouds. Always aim for integrated solutions incorporating various products and services to create end-to-end solutions. It's also important to ensure that your code is proven with real customers and has generalized documentation to remove confidential details.",
  'context': "Making application deployments and operations secure and observable.\n\nSecuring applications by following security best practices at all stages of the engineering lifecycle.\n\nWriting portable solutions that run in multiple locations, including edge devices, on-premises data centers, the Microsoft cloud, and competitor's clouds.\n\nIntegrated solutions\n\nPlaybook solutions span multiple Microsoft products and services and focus on creating integrated end-to-end solutions often using a range of open-source software libraries.\n\nProven with real customers\n\nAll code linked from playbook solutions and capabilities was created working with our customers to develop production solutions. This documentation and code is generalized to remove confidential details.\n\nReferences\n\nEngineering Fundamentals\n\nAzure Architecture Center: Data Checklist",
  'chunk_id': 'chunk0_1',
  'source': '../data/solutionops/code-with-dataops/index.md'},
 {'user_prompt': 'Can you provide an overview of the Data Governance and Protection process?',
  'output_prompt': "Data Governance and Protection is concerned with the governance process for an organization's data, focusing on enterprise systems. It embraces security and compliance to ensure the data is reliable and meets preset standards. This includes adherence to both internal and external regulations and policies. A high-level capability map can be created to visualize and understand these processes better.",
  'context': "rings:\n  - public\n\nData Governance and Protection\n\nThis pillar describes the governance process of the data for enterprise systems. It also covers the security and compliance aspects of managing data so that it's reliable and meets standards. Additionally, it covers regulations and policies of the organization.\n\nCapability Map\n\nHere is a high-level capability map for Data Governance and Protection:",
  'chunk_id': 'chunk2_0',
  'source': '../data/solutionops/code-with-dataops/capabilities/data-governance-and-protection/index.md'},
 {'user_prompt': 'Could you explain more about the importance of DataOps in Data Governance and how it contributes to the overall data strategy of an organization?',
  'output_prompt': "DataOps is crucial in Data Governance as it ensures high-quality, reliable data is available when and where it's needed. It lets organizations map, catalog, classify, and label data with a common business glossary, fostering better data discovery. By capturing data's lineage and versioning, it provides visibility into data modifications and history. This improves transparency and trust in data, aiding in data-based decision making and supporting overall business objectives.",
  'context': 'rings:\n  - public\n\nData Governance\n\nData Governance is a set of processes, tools, technologies, people, standards/policies that enable an organization implement control over data/data assets effectively. It permits the support of its business objectives. Data Governance defines the processes, tools and responsibilities to discover and ensure the quality and security of data.\n\nImplementing unified data governance across any organization comes with its own challenges and the complexity multiplies as the size of the organization grows. So, it becomes crucial to think about data governance as a fundamental block of an enterprise’s data strategy.\n\nThere are several motivations for implementing a data governance strategy. It helps in creating a unified map of data across the entire data estate, which facilitates easy data discovery. Once the data is visible, it can be cataloged, classified, labeled, and have a common business glossary applied to it. Integrated with the catalog, the data systems can then capture the lineage of data as it flows through different systems in the data estate.\n\nThe following characteristics of Data Governance are discussed in context of "data":\n\nDataOps: Data Discovery\n\nDataOps: Data Catalog\n\nData Classification\n\nGlossary\n\nDataOps: Data Lineage\n\nDataOps: Data Versioning\n\nDataOps: Data Quality',
  'chunk_id': 'chunk3_0',
  'source': '../data/solutionops/code-with-dataops/capabilities/data-governance-and-protection/data-governance/index.md'},
 {'user_prompt': 'What is the purpose of Data Governance and Protection pillar in DataOps?',
  'output_prompt': "The 'Data Governance and Protection' pillar in DataOps refers to the guidelines to manage, govern, and protect your complete data estate, which includes both OLTP (Online Transaction Processing systems) and OLAP (Online Analytical Processing systems). If there is any technical guidance specific to individual tasks, it would be added to the individual capabilities within other pillars. Additionally, the content is linked back to this pillar in order to refer to common governance practices.",
  'context': 'For detailed information, see DataOps: Analytical Systems.\n\nTransactional Systems\n\nThe "Transactional Systems" pillar contains guidance in relation to systems that underpins operational business transactions. Traditionally, transactional systems primarily encompassed data systems used for real commercial transactions like sales, online banking, or stock trading. However, the scope has been expanded to include any data system that requires low latency read and writes.\n\nIn most cases, OLTP systems provide ACID guarantees; however, certain systems may prioritize lower latency and intentionally relax one of the ACID properties.\n\n{% if extra.ring == \'internal\' %}For detailed information, see DataOps: Transactional Systems.{% endif %}\n{% if extra.ring == \'public\' %}This capability pillar is not currently covered in the playbook. It will be added in the future.{% endif %}\n\nData Governance and Protection\n\nThe "Data Governance and Protection" pillar contains all related guidance to manage, govern and protect the entire data estate including both OLTP and OLAP data systems. If there is specific technical guidance, it would be added to individual capabilities within other pillars. The content would then be linked back to this pillar to refer to common governance practices.\n\nFor detailed information, see DataOps: Data Governance and Protection.\n\nData Platform Infrastructure',
  'chunk_id': 'chunk1_2',
  'source': '../data/solutionops/code-with-dataops/capabilities/index.md'},
 {'user_prompt': "Can you explain more about the Data Platform Infrastructure pillar in the context of Azure's data estate?",
  'output_prompt': "The 'Data Platform Infrastructure' is a pillar in Azure's data estate that provides guidelines for designing and building the platform or infrastructure layer. It heavily draws upon the existing guidance of the Cloud Adoption Framework (CAF), specifically the CAF: Cloud-Scale Analytics scenario. The Data Platform Infrastructure pillar assists in organizing Azure services into a logical architecture and helps in implementing central security and governance.",
  'context': 'For detailed information, see DataOps: Data Governance and Protection.\n\nData Platform Infrastructure\n\nThe "Data Platform Infrastructure" pillar contains guidance as it relates to designing and building the platform / infrastructure layer of a data estate on Azure. This pillar will draw heavily on the existing guidance of the Cloud Adoption Framework (CAF) specifically the CAF: Cloud-Scale Analytics scenario. It includes guidance on how to organize Azure services into a logical architecture and implement central security and governance.\n\n{% if extra.ring == \'internal\' %}For detailed information, see DataOps: Data Platform Infrastructure.{% endif %}\n{% if extra.ring == \'public\' %}This capability pillar is not currently covered in the playbook. It will be added in the future.{% endif %}',
  'chunk_id': 'chunk1_3',
  'source': '../data/solutionops/code-with-dataops/capabilities/index.md'},
 {'user_prompt': 'What are the key components of a Data Catalog?',
  'output_prompt': "A Data Catalog is a central collection of metadata of the organization's data assets and includes related search and data management tools. Its key components contain both technical properties and business metadata of the dataset. Technical properties include data schema, structure, physical location, type, format, approximate size, and lineage. On the other hand, business metadata can include related glossary terms, data owners, data stewards, data classifications, data sensitivity, and more.",
  'context': 'rings:\n  - public\n\nData Catalog\n\nA core component of an effective Data Governance strategy is having a well-curated and complete Data Catalog. A data catalog is a central collection of metadata of the organizations data assets including related search and data management tools. Key metadata consists both technical properties of the dataset and business metadata. Technical properties include data schema and structure, physical location, type, format, approximate size, and lineage. Business metadata, on the other hand, can include:\n    - related glossary terms\n    - data owners\n    - data stewards\n    - data classifications\n    - data sensitivity\nand more.\n\nImportance of having a Data Catalog\n\nFor data governance and policy officers, a data catalog allows them to have a single pane of glass to govern the organization’s data estate. This fact is important for datasets that must adhere to external regulatory requirements such as GDPR. Furthermore, it can provide other insights to understand how data is being created and used across the data estate.',
  'chunk_id': 'chunk4_0',
  'source': '../data/solutionops/code-with-dataops/capabilities/data-governance-and-protection/data-governance/data-catalog/index.md'},
 {'user_prompt': 'What are the key features a data catalog should enable for efficient data governance and management?',
  'output_prompt': 'A data catalog should enable the following key features for effective data governance and management: 1. Onboarding dataset metadata that ideally updates regularly, automatically and at scale, generally through data source scanning. This would capture all technical metadata associated with each dataset including data lineage. 2. It should allow data owners and stewards to curate and enrich metadata with business-specific metadata such as associated glossary terms, enabling automatic classification of dataset at scale. 3. Lastly, it should enable data discoverability and evaluation, allowing users to efficiently search and browse for datasets across technical metadata dimensions and business semantics. Also, it should permit users to quickly and easily evaluate datasets to determine if they are fit for purpose.',
  'context': 'For data engineers and business users, it enables better discoverability of datasets, reduces dataset duplication, and clarifies dataset ownership and stewardship. This feature also allows users to quickly identify any associated considerations that are needed for appropriate dataset usage. For example, the dataset’s lineage, data sensitivity, and any associated data policies, guardrails, and controls. This feature enables better cross-team collaboration for data engineering efforts across the organization.\n\nFor MLOps perspective on Data Catalogs, see MLOps: Data Catalog.\n\nFunctions of a Data Catalog\n\nAt a minimum, a data catalog should enable the following features:\n\nOnboard dataset metadata – Ideally metadata updates happens regularly, automatically and at scale, generally through data source scanning, which would capture all technical metadata associated with each dataset including data lineage. Check DataOps: Data Lineage for more information.\n\nCurate and enrich – Once the dataset metadata have been onboarded, the catalog should enable dataset owners and stewards to enrich metadata with any business-specific metadata such as associated glossary terms. It should also ideally enable automatic classification of dataset at scale for effective data governance.\n\nEnable data discoverability and evaluation – Finally, the data catalog should enable users to efficiently search and browse for datasets across technical metadata dimensions and business semantics. It should also allow users to quickly and easily evaluate datasets to determine if they are fit for purpose. This check can be done by:',
  'chunk_id': 'chunk4_1',
  'source': '../data/solutionops/code-with-dataops/capabilities/data-governance-and-protection/data-governance/data-catalog/index.md'},
 {'user_prompt': 'What are the five capability pillars in the DataOps section of the playbook?',
  'output_prompt': 'The five capability pillars in the DataOps section of the playbook are: DevOps for Data, Transactional Systems, Analytical Systems, Data Governance and Protection, and Data Platform Infrastructure.',
  'context': "rings:\n  - public\n\nCapabilities\n\nCapabilities define the fundamental conceptual building blocks within a technical area. This section not only describes the core concept behind each capability, but also provides specific implementations using various Azure services, programming languages and usage.\n\nUnderstanding the Structure\n\nThe playbook uses Capability Maps to organize different capabilities.\n\nCapability Pillars\n\nThe capability pillars are the highest level grouping with DataOps section of playbook. Currently, there are five such pillars:\n\nDevOps for Data\n\nTransactional Systems\n\nAnalytical Systems\n\nData Governance and Protection\n\nData Platform Infrastructure\n\n{% if extra.ring == 'internal' %}\n\n{% endif %}\n{% if extra.ring == 'public' %}\n\n{% endif %}\n\nEach capability pillar also has a capability map that provides a depiction of how different capabilities are grouped together and interact with each other. The different sections of the capability maps are clickable and takes the reader directly to the specific capability or group.\n\nCapability Groups\n\nThe capability groups are the next level grouping within each capability pillar. These groups logically put together related capabilities. And all the capability groups together cover the overall capability pillar.\n\nIndividual Capabilities\n\nCapabilities are the most granular items in the hierarchy that describe a generalized functionality and it's where bulk of the content resides.",
  'chunk_id': 'chunk1_0',
  'source': '../data/solutionops/code-with-dataops/capabilities/index.md'}]

Other#

system_prompt = """you are a prompt creator and have ability to generate new JSON prompts based on the given CONTEXT.
Generate 1 most relevant new prompt in valid json format according to "RESPONSE SCHEMA EXAMPLE" completely from scratch.
"RESPONSE SCHEMA EXAMPLE":
[
    {
        "role: "user",
        "content": "This is the generated prompt text",
    },
    {
        "role: "assistant",
        "content": "the expected, rightful and correct answer for the question"
    },
]
"""
user_prompt = """The response must be valid JSON array containing two objects. The first object must contain the keys "role" and "content". The second object must also contain the keys "role" and "content".
The response must follow the "RESPONSE SCHEMA EXAMPLE".
The most important thing is that the response must be valid JSON ARRAY. DO NOT include anything other than valid schema.

CONTEXT:

"""
context = "\n\n    rings:\n  - public\n\nData Playbook\n\nThe Data Playbook provides enterprise software engineers with solutions, capabilities, and code developed to solve real-world problems. Everything in the playbook is developed with, and validated by, some of Microsoft\'s largest and most influential customers and partners.\n\n{% if extra.ring == \'internal\' %}\nYou are invited to share your enterprise-grade production solutions as well. Refer to Contributing to the Solutions Playbook.\n\n{% endif %}\n\nData Solutions\n\nModern Data Warehouse solution\n{% if extra.ring == \'internal\' %}\n\nData Mesh solution\n\nAnalytics and ML for enterprise business applications solution\n\nEnterprise Data Sharing solution\n{% endif %}\n\n{% if extra.ring == \'internal\' %}\n\n{% else %}\n\n{% endif %}\n\nAbout the Data Playbook\n\nThese Playbook solutions employ good engineering practices to accelerate real-world application development. Common themes include:\n\nImproving application design and developer productivity by sharing code and knowledge developed by experts for Microsoft customers.\n\nUsing automation to make repetitive tasks faster, more reliable, and auditable\n\nMaking application deployments and operations secure and observable.\n\nSecuring applications by following security best practices at all stages of the engineering lifecycle.\nEND OF CONTEXT"

messages = create_qa_generation_prompt_from_chunk_context(context)

client = AsyncAzureOpenAI(
    api_key=aoai_key,
    api_version="2023-05-15",
    azure_endpoint=aoai_endpoint
)
response = await client.chat.completions.create(model="dep-gpt4", messages=messages)

print(response.choices[0].message.content)

[
    {
        "role": "user",
        "content": "Can you tell me more about the Modern Data Warehouse solution in the Data Playbook?"
    },
    {
        "role": "assistant",
        "content": "The Modern Data Warehouse solution in the Data Playbook is designed to help enterprise software engineers develop efficient and robust data warehouse systems. It provides solutions, capabilities, and code to solve real-world data problems that are validated by Microsoft's largest customers. Also, it adheres to good engineering practices to enhance application development, productivity, and security across all stages of the engineering lifecycle."
    }
]

response_dict = json.loads(response.choices[0].message.content)
for item in response_dict:
    if item["role"] == "user":
        print(item["content"])
        # user_prompt = item["content"]
    if item["role"] == "assistant":
        print(item["content"])
        # output_prompt = item["content"]

Can you tell me more about the Modern Data Warehouse solution in the Data Playbook?
The Modern Data Warehouse solution in the Data Playbook is designed to help enterprise software engineers develop efficient and robust data warehouse systems. It provides solutions, capabilities, and code to solve real-world data problems that are validated by Microsoft's largest customers. Also, it adheres to good engineering practices to enhance application development, productivity, and security across all stages of the engineering lifecycle.

asyncio.run(generate_qa_pairs("../data/solutionops/**/*.md"))

Creating df from output jsonl file#

import json 
import pandas as pd 
import re
import tqdm

with open("../data/jsonl/new_messages_solutionops_result.jsonl", "r") as file:

    data_final = {
        'question': [],
        'answer': [],
        'source': []
    }
    n_failures = 0
    print("exporting...")
    for line in tqdm.tqdm(file):
        try:
            dict_line = json.loads(line)
            qa_pair_dict = json.loads(dict_line[1]['choices'][0]['message']['content'])
            for item in qa_pair_dict:
                if item["role"] == "user":
                    question = item["content"]
                if item["role"] == "assistant":
                    answer = item["content"]
            # print(dict_line[0]['messages'][1]['content'])
            source = re.search(r'CONTEXT:(.*)END OF CONTEXT', dict_line[0]['messages'][1]['content'], re.DOTALL).group(1).strip()
            # source = re.search(r'CONTEXT:(.*)+', dict_line[0]['messages'][1]['content']).group(1)
            
            if question and answer and source:
                data_final['question'].append(question)
                data_final['answer'].append(answer)
                data_final['source'].append(source)

        except Exception as e:
            n_failures += 1
    print(f"Number of failures: {n_failures}")

exporting...

679it [00:00, 21684.82it/s]

Number of failures: 480

len(data_final['question'])

df_final = pd.DataFrame(data_final)

df_final

	question	answer	source
0	What are the steps one should follow before cr...	Before creating a new pull request, make sure ...	Pull Requests\n\nChanges to any main codebase ...
1	Can you summarize the key points to enhance ou...	Certainly! To boost the team's efficiency, adh...	To increase overall efficiency for team member...
2	What are some best practices for submitting a ...	Best practices for submitting a PR include ens...	be consistent,\n\nnot break the build, and\n\n...
3	Why do we need code reviews?	Code reviews are essential for ensuring that t...	Code Review Pull Request Source code focused I...
4	I heard that golint is no longer being maintai...	Yes, that's correct. The golint library has be...	golint\n\n:exclamation: NOTICE: The golint lib...
...	...	...	...
194	How do I use Postman to create a Collection fo...	To create a Collection in Postman for API test...	Use Case - Hands-on Functional Testing Of Endp...
195	Can you explain the concept of Consumer-driven...	Certainly! Consumer-Driven Contract Testing (C...	Consumer-driven Contract Testing Design Blocks...
196	Considering the provided context about mocking...	Mocks are objects pre-programmed with expectat...	Some would argue that in the example above, th...
197	How would I go about creating a YAML file for ...	To create a YAML file for a continuous integra...	```\n{% endraw %}\n\nCreate a yaml file and de...
198	Why is E2E testing important for commercial so...	E2E testing is crucial for commercial software...	For any commercial release of the software, E2...

199 rows × 3 columns