Evaluate RAG - LLM's using RAGAS Python Pytest Framework
- Data will be fed to embedded model which stores the data into vector database based on semantic understanding.
- LLM(like chatgpt, gemini, anthropic) interacts with this vector database and replies based on the user prompts
- RAG (Retrieval Augmented Generation)
- data retrieved from vector database
- articulation of whole question with necessary context is called augmentation
- Generation is about giving the answer
- In traditional software testing, we assert on the system responses, but in LLM's we evaluate quality of the response using benchmark metrics.
- Here we make use of RAGAS framework
- Benchmark metrics for document retrieval system
- context precision --> no of relavant docs retrieved / total number of docs retrieved => 3/6 50%
- context recall => no of relavent docs retrieved / total number of relavent docs retrieved
- Install Python
- Install Pycharm IDE
- make sure to configure path of python while importing project (use which python3 to know the path of installation
- Let's install RAGAS framework packages in python interpreter
- Let's install langchain, as RAGAS communicates with it
- install pytest library for testing
- install request library for invoking api calls
- Let's understand this code
import os
import pytest
import requests
from langchain_openai import ChatOpenAI
from ragas import SingleTurnSample
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import LLMContextRecall
os.environ["OPENAI_API_KEY"] = "provide valid key"
@pytest.mark.asyncio
async def test_context_recall():
question = "how many env docs are there?"
llm = ChatOpenAI(model="gpt-4", temperature=0) #let's create a langchain object
langchain_llm = LangchainLLMWrapper(llm) #using wrapper for RAGAS framework
context_recall = LLMContextRecall(llm=langchain_llm)
responseDict = requests.post("https://endpoint_url",
json={
"question": question,
"chat_history": []
}).json()
sample = SingleTurnSample(
user_input=question,
retrieved_contexts=[responseDict["retrieved_docs"][0]["page_content"]],
reference="23"
)
score = await context_recall.single_turn_score(sample)
print(score)
assert score > 0.7
Sure! Let's walk through this Python test code step by step. This code tests the context recall ability of a language model using the RAGAS evaluation framework, which is designed to assess retrieval-augmented generation (RAG) systems.
π§ Imports and Setup
import os
import pytest
import requests
from langchain_openai import ChatOpenAI
from ragas import SingleTurnSample
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import LLMContextRecall
-
os: Used to set environment variables (like your API key).
-
pytest: A popular Python testing framework.
-
requests: To make HTTP POST requests (e.g., to a RAG API endpoint).
-
ChatOpenAI: Langchain's wrapper for OpenAI's models.
-
RAGAS imports: Used for RAG evaluation.
SingleTurnSample
models a single query-answer pair.LLMContextRecall
is a metric that evaluates how well the retrieved context supports the reference answer.
π Set API Key
os.environ["OPENAI_API_KEY"] = "provide valid key"
This sets your OpenAI API key as an environment variable so that Langchain can use it.
π§ͺ Test Function: test_context_recall
@pytest.mark.asyncio
async def test_context_recall():
-
Marked with
@pytest.mark.asyncio
because it uses async/await. -
This is a test case meant to be run with
pytest
.
❓ Setup: Question & LLM
question = "how many env docs are there?"
llm = ChatOpenAI(model="gpt-4", temparature=0)
-
Defines the user’s question.
-
Creates a GPT-4 model instance with
temperature=0
for deterministic output.-
⚠️ Typo here:
temparature
should betemperature
.
-
π Wrap the LLM for RAGAS
langchain_llm = LangchainLLMWrapper(llm)
Wraps the Langchain LLM for compatibility with RAGAS.
π Context Recall Metric
context_recall = LLMContextRecall(llm=langchain_llm)
Creates a context recall metric object using the wrapped LLM.
π Call RAG API to Get Context
responseDict = requests.post("https://endpoint_url",
json={
"question": question,
"chat_history": []
}).json()
-
Makes a POST request to a RAG system endpoint (placeholder URL).
-
Sends the question, gets a response containing retrieved documents.
-
Expected JSON structure includes a key
"retrieved_docs"
.
π¦ Create RAGAS Sample
sample = SingleTurnSample(
user_input=question,
retrieved_contexts=[responseDict["retrieved_docs"][0]["page_content"]],
reference="23"
)
Creates a SingleTurnSample
with:
-
user_input
: the original question. -
retrieved_contexts
: the context(s) returned by the RAG system. -
reference
: the expected/true answer (here it's"23"
).
π Score Calculation
score = await context_recall.single_turn_score(sample)
print(score)
assert score > 0.7
-
Asynchronously computes the context recall score (how well the context supports the reference).
-
Asserts that the score is > 0.7 (passes the test if context is good enough).
✅ Summary
This code tests a RAG system's ability to retrieve relevant context for a given question, using the LLMContextRecall metric. It fails the test if the retrieved context does not provide enough support for the expected answer.
14. Let's do some modifications
you can add fixtures for common usuage blocks
import os
import pytest
import requests
from ragas import SingleTurnSample
from ragas.metrics import LLMContextRecall
os.environ["OPENAI_API_KEY"] = "provide valid key"
@pytest.mark.asyncio
async def test_context_recall(llm_wrapper, getdata):
context_recall = LLMContextRecall(llm=llm_wrapper)
score = await context_recall.single_turn_score(getdata)
print(score)
assert score > 0.7
@pytest.fixture
def getdata():
question = "how many env docs are there?"
responseDict = requests.post("https://endpoint_url",
json={
"question": question,
"chat_history": []
}).json()
sample = SingleTurnSample(
user_input=question,
retrieved_contexts=[responseDict["retrieved_docs"][0]["page_content"]],
reference="23"
)
return sample
if it's common to all tests we can keep it in conftest.py, so that fixture will be available for all the tests
import pytest
from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper
@pytest.fixture
def llm_wrapper():
llm = ChatOpenAI(model="gpt-4", temperature=0) # let's create a langchain object
langchain_llm = LangchainLLMWrapper(llm) # using wrapper for RAGAS framework
return langchain_llm
15. Let's parameterize this
import os
import pytest
import requests
from ragas import SingleTurnSample
from ragas.metrics import LLMContextRecall
from sqlalchemy.sql.operators import truediv
os.environ["OPENAI_API_KEY"] = "provide valid key"
@pytest.mark.asyncio
@pytest.mark.parametrize("getdata", [{
"question": "give a question",
"reference": "24"
}], indirect=True)
async def test_context_recall(llm_wrapper, getdata):
context_recall = LLMContextRecall(llm=llm_wrapper)
score = await context_recall.single_turn_score(getdata)
print(score)
assert score > 0.7
@pytest.fixture
def getdata(request):
testdata = request.param
question = "how many env docs are there?"
responseDict = requests.post("https://endpoint_url",
json={
"question": testdata["question"],
"chat_history": []
}).json()
sample = SingleTurnSample(
user_input=testdata["question"],
retrieved_contexts=[responseDict["retrieved_docs"][0]["page_content"]],
reference=testdata["reference"]
)
return sample
here we are parameterizing getdata
observe data is given and changes made to fixture
indirect=True will apply changes to fixture, then they are applied to test
16. We did further optimization by driving data from json, creating utils class, for loop for retrieved contexts, observe how it's implemented
import os
import pytest
from ragas import SingleTurnSample
from ragas.metrics import LLMContextRecall
from utils import get_llm_response, load_test_data
os.environ["OPENAI_API_KEY"] = "provide valid key"
@pytest.mark.asyncio
@pytest.mark.parametrize("getdata", load_test_data(), indirect=True)
async def test_context_recall(llm_wrapper, getdata):
context_recall = LLMContextRecall(llm=llm_wrapper)
score = await context_recall.single_turn_score(getdata)
print(score)
assert score > 0.7
@pytest.fixture
def getdata(request):
testdata = request.param
responseDict = get_llm_response(testdata)
sample = SingleTurnSample(
user_input=testdata["question"],
retrieved_contexts=[doc["page_content"] for doc in responseDict.get("retrieved_docs")],
reference=testdata["reference"]
)
return sample
import json
import os
import requests
os.environ["OPENAI_API_KEY"] = "provide valid key"
def load_test_data():
test_data_path = "/Users/vraghumanda/PycharmProjects/PythonProject/dataseeder.json"
with open(test_data_path) as f:
return json.load(f)
def get_llm_response(testdata):
responseDict = requests.post("https://endpoint_url",
json={
"question": testdata["question"],
"chat_history": []
}).json()
return responseDict
[{
"question": "give a question",
"reference": "24"
}]
17. Faithfulness score = no of claims in the generated answer that can be inferred from given context / total number of claims in the generated answer
18. Response Relavance: How pertinant the generated answer is relavant to the given prompt.
19. Factual correctness: ground truth and response match will be compared.
20. you can calculate multiple metrics in single test like below
by using evaluation dataset
https://apps.ragas.io/dashboard
create a token
Comments
Post a Comment