Evaluate RAG - LLM's using RAGAS Python Pytest Framework

Data will be fed to embedded model which stores the data into vector database based on semantic understanding.
LLM(like chatgpt, gemini, anthropic) interacts with this vector database and replies based on the user prompts
RAG (Retrieval Augmented Generation)

data retrieved from vector database
articulation of whole question with necessary context is called augmentation
Generation is about giving the answer

In traditional software testing, we assert on the system responses, but in LLM's we evaluate quality of the response using benchmark metrics.
Here we make use of RAGAS framework
Benchmark metrics for document retrieval system

context precision --> no of relavant docs retrieved / total number of docs retrieved => 3/6 50%
context recall => no of relavent docs retrieved / total number of relavent docs retrieved

Install Python
Install Pycharm IDE

make sure to configure path of python while importing project (use which python3 to know the path of installation

Let's install RAGAS framework packages in python interpreter

Let's install langchain, as RAGAS communicates with it

install pytest library for testing

install request library for invoking api calls

Let's understand this code

import os

import pytest
import requests
from langchain_openai import ChatOpenAI
from ragas import SingleTurnSample
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import LLMContextRecall

os.environ["OPENAI_API_KEY"] = "provide valid key"
@pytest.mark.asyncio
async def test_context_recall():
    question = "how many env docs are there?"
    llm = ChatOpenAI(model="gpt-4", temperature=0) #let's create a langchain object
    langchain_llm = LangchainLLMWrapper(llm) #using wrapper for RAGAS framework
    context_recall = LLMContextRecall(llm=langchain_llm)
    responseDict = requests.post("https://endpoint_url",
                                 json={
                                     "question": question,
                                     "chat_history": []
                                 }).json()
    sample = SingleTurnSample(
        user_input=question,
        retrieved_contexts=[responseDict["retrieved_docs"][0]["page_content"]],
        reference="23"
    )

    score = await context_recall.single_turn_score(sample)
    print(score)
    assert score > 0.7

Sure! Let's walk through this Python test code step by step. This code tests the context recall ability of a language model using the RAGAS evaluation framework, which is designed to assess retrieval-augmented generation (RAG) systems.

🔧 Imports and Setup

import os
import pytest
import requests
from langchain_openai import ChatOpenAI
from ragas import SingleTurnSample
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import LLMContextRecall

os: Used to set environment variables (like your API key).
pytest: A popular Python testing framework.
requests: To make HTTP POST requests (e.g., to a RAG API endpoint).
ChatOpenAI: Langchain's wrapper for OpenAI's models.
RAGAS imports: Used for RAG evaluation. SingleTurnSample models a single query-answer pair. LLMContextRecall is a metric that evaluates how well the retrieved context supports the reference answer.

🔑 Set API Key

os.environ["OPENAI_API_KEY"] = "provide valid key"

This sets your OpenAI API key as an environment variable so that Langchain can use it.

🧪 Test Function: `test_context_recall`

@pytest.mark.asyncio
async def test_context_recall():

Marked with @pytest.mark.asyncio because it uses async/await.
This is a test case meant to be run with pytest.

❓ Setup: Question & LLM

question = "how many env docs are there?"
llm = ChatOpenAI(model="gpt-4", temparature=0)

Defines the user’s question.
Creates a GPT-4 model instance with temperature=0 for deterministic output.
- ⚠️ Typo here: temparature should be temperature.

🔁 Wrap the LLM for RAGAS

langchain_llm = LangchainLLMWrapper(llm)

Wraps the Langchain LLM for compatibility with RAGAS.

📏 Context Recall Metric

context_recall = LLMContextRecall(llm=langchain_llm)

Creates a context recall metric object using the wrapped LLM.

🌐 Call RAG API to Get Context

responseDict = requests.post("https://endpoint_url",
                             json={
                                 "question": question,
                                 "chat_history": []
                             }).json()

Makes a POST request to a RAG system endpoint (placeholder URL).
Sends the question, gets a response containing retrieved documents.
Expected JSON structure includes a key "retrieved_docs".

📦 Create RAGAS Sample

sample = SingleTurnSample(
    user_input=question,
    retrieved_contexts=[responseDict["retrieved_docs"][0]["page_content"]],
    reference="23"
)

Creates a SingleTurnSample with:

user_input: the original question.
retrieved_contexts: the context(s) returned by the RAG system.
reference: the expected/true answer (here it's "23").

📊 Score Calculation

score = await context_recall.single_turn_score(sample)
print(score)
assert score > 0.7

Asynchronously computes the context recall score (how well the context supports the reference).
Asserts that the score is > 0.7 (passes the test if context is good enough).

✅ Summary

This code tests a RAG system's ability to retrieve relevant context for a given question, using the LLMContextRecall metric. It fails the test if the retrieved context does not provide enough support for the expected answer.

14. Let's do some modifications

you can add fixtures for common usuage blocks

import os

import pytest
import requests
from ragas import SingleTurnSample
from ragas.metrics import LLMContextRecall

os.environ["OPENAI_API_KEY"] = "provide valid key"
@pytest.mark.asyncio
async def test_context_recall(llm_wrapper, getdata):
    context_recall = LLMContextRecall(llm=llm_wrapper)
    score = await context_recall.single_turn_score(getdata)
    print(score)
    assert score > 0.7

@pytest.fixture
def getdata():
    question = "how many env docs are there?"
    responseDict = requests.post("https://endpoint_url",
                                 json={
                                     "question": question,
                                     "chat_history": []
                                 }).json()
    sample = SingleTurnSample(
        user_input=question,
        retrieved_contexts=[responseDict["retrieved_docs"][0]["page_content"]],
        reference="23"
    )
    return sample

if it's common to all tests we can keep it in conftest.py, so that fixture will be available for all the tests

import pytest
from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper


@pytest.fixture
def llm_wrapper():
    llm = ChatOpenAI(model="gpt-4", temperature=0)  # let's create a langchain object
    langchain_llm = LangchainLLMWrapper(llm)  # using wrapper for RAGAS framework
    return langchain_llm

15. Let's parameterize this

import os

import pytest
import requests
from ragas import SingleTurnSample
from ragas.metrics import LLMContextRecall
from sqlalchemy.sql.operators import truediv

os.environ["OPENAI_API_KEY"] = "provide valid key"


@pytest.mark.asyncio
@pytest.mark.parametrize("getdata", [{
    "question": "give a question",
    "reference": "24"
}], indirect=True)
async def test_context_recall(llm_wrapper, getdata):
    context_recall = LLMContextRecall(llm=llm_wrapper)
    score = await context_recall.single_turn_score(getdata)
    print(score)
    assert score > 0.7


@pytest.fixture
def getdata(request):
    testdata = request.param
    question = "how many env docs are there?"
    responseDict = requests.post("https://endpoint_url",
                                 json={
                                     "question": testdata["question"],

                     "chat_history": []
                                 }).json()
    sample = SingleTurnSample(
        user_input=testdata["question"],
        retrieved_contexts=[responseDict["retrieved_docs"][0]["page_content"]],
        reference=testdata["reference"]
    )
    return sample

here we are parameterizing getdata
observe data is given and changes made to fixture
indirect=True will apply changes to fixture, then they are applied to test
16. We did further optimization by driving data from json, creating utils class, for loop for retrieved contexts, observe how it's implemented

import os

import pytest
from ragas import SingleTurnSample
from ragas.metrics import LLMContextRecall

from utils import get_llm_response, load_test_data

os.environ["OPENAI_API_KEY"] = "provide valid key"


@pytest.mark.asyncio
@pytest.mark.parametrize("getdata", load_test_data(), indirect=True)
async def test_context_recall(llm_wrapper, getdata):
    context_recall = LLMContextRecall(llm=llm_wrapper)
    score = await context_recall.single_turn_score(getdata)
    print(score)
    assert score > 0.7


@pytest.fixture
def getdata(request):
    testdata = request.param
    responseDict = get_llm_response(testdata)
    sample = SingleTurnSample(
        user_input=testdata["question"],
        retrieved_contexts=[doc["page_content"] for doc in responseDict.get("retrieved_docs")],
        reference=testdata["reference"]
    )
    return sample

import json

import os

import requests

os.environ["OPENAI_API_KEY"] = "provide valid key"

def load_test_data():
    test_data_path = "/Users/vraghumanda/PycharmProjects/PythonProject/dataseeder.json"
    with open(test_data_path) as f:
        return json.load(f)

def get_llm_response(testdata):
    responseDict = requests.post("https://endpoint_url",
                                 json={
                                     "question": testdata["question"],
                                     "chat_history": []
                                 }).json()
    return responseDict

[{
    "question": "give a question",
    "reference": "24"
}]

17. Faithfulness score = no of claims in the generated answer that can be inferred from given context / total number of claims in the generated answer
18. Response Relavance: How pertinant the generated answer is relavant to the given prompt.
19. Factual correctness: ground truth and response match will be compared.
20. you can calculate multiple metrics in single test like below
by using evaluation dataset

21. Let's see how results are uploaded into ragas portal
https://apps.ragas.io/dashboard
create a token

add token in your test as a environment variable

upload results at end of the test.

22. Topicadherence is one more metric that measures how strictly it is sticking to the original question you asked.

23. In a multiconversation endpoint, user type will differ like human, system

24. we use multi_turn_score metric while calculating topic adherance.

Search This Blog

Vinay Raghumanda