How to Set Up RAGAS and Run Your First LLM Evaluation Test

How to Set Up RAGAS and Run Your First LLM Evaluation Test

What is RAGAS?

RAGAS stands for Retrieval-Augmented Generation Assessment. It’s an open-source evaluation framework built specifically to measure the performance of RAG systems (like those that combine vector search with LLMs to answer questions).

In simple terms, RAGAS helps you test how good your LLM-based system is at:

  • Retrieving the right documents
  • Generating accurate and relevant answers
  • Sticking to the facts

In a world where LLMs can sound confident but still be wrong, evaluating their accuracy and reliability is critical, especially when they’re used in customer-facing, medical, legal, or internal tools.

What You’ll Learn in This Guide

In this guide, you’ll learn how to:

  • Set up a Python project for LLM evaluation
  • Install RAGAS and required dependencies
  • Run your first test case using real inputs
  • Understand what the evaluation scores mean

Step 1: Prerequisites (Before You Begin)

Before we jump into setting up RAGAS, make sure you have the following ready on your computer.

1. Python 3 Installed

RAGAS runs on Python, so you’ll need Python 3.8 or higher.

To check if it’s already installed, open your terminal or command prompt and type.

				
					python3 --version
				
			

If you don’t have it, download and install it from: https://www.python.org/downloads/

2. A Code Editor (We Recommend PyCharm)

You’ll need a code editor to write and run your Python scripts.

We recommend using PyCharm Community Edition, especially if you’re new to Python. It’s beginner-friendly and has great support for virtual environments.

That’s it! Once these two are ready, you can proceed to the next step: setting up your project.

Step 2: Create Your Project in PyCharm

Now that Python and PyCharm are installed, let’s create your project and set things up inside PyCharm.

1. Open PyCharm and Create a New Project

  • Launch PyCharm
  • Click on “New Project”
  • Set the project name as: LlmEvaluation
  • Make sure New environment using Virtualenv is selected
  • Under Base interpreter, choose the same Python version you saw when you ran: python3 –version
  • This makes sure you’re using the right version of Python.
  • Click Create

You’ll now land in your empty project, ready to install the necessary libraries.

2. Install Required Libraries via PyCharm Settings

Go to:

PyCharm → Settings → Project: LlmEvaluation → Python Interpreter

Click the + (Add) button and search for each of these packages:

Once all are added, click Apply and then OK to finish the setup.

Package Why You Need It
ragas The main framework for evaluating LLM responses
langchain-openai Lets us connect OpenAI to RAGAS using LangChain’s LLM wrapper
pytest Helps us write and run test cases easily
pytest-asyncio Required to run async test cases (which RAGAS depends on)
requests Used to send API requests to get responses and documents

That’s it! You’ve now set up your test project and environment. Next, we’ll write your first test using RAGAS.

Step 3: Write Your First LLM Evaluation Test

Now that your environment is ready and packages are installed, let’s write a simple test to check how well an answer uses the retrieved documents, using a metric called Context Precision.

Create a new Python file

Right-click your LLMEvaluation project folder in PyCharm and select:

New → Python File, Name it: test_context_precision.py

Paste the following code into the file:

				
					# Pytest -
import os

import pytest
from langchain_openai import ChatOpenAI
from ragas import SingleTurnSample
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import LLMContextPrecisionWithoutReference


@pytest.mark.asyncio
async def test_context_precision():
    # create object of class for that specific metric
    os.environ[
        "OPENAI_API_KEY"] = "replace your API key"
    llm = ChatOpenAI(model="gpt-3.5-turbo",
                     temperature=0)  # If the answer need more explanation, can increase the temperature
    langchain_llm = LangchainLLMWrapper(llm)

    context_precision = LLMContextPrecisionWithoutReference(llm=langchain_llm)

    # Feed data
    sample = SingleTurnSample(
        user_input="What is the cost of living for a single?",
        response="Around S$2,660/month",
        retrieved_contexts=[
            "EP holders living alone spend around S$2,660/month, while S Pass holders spend about S$1,603/month",
            "Monthly expenses range between S$4,800 and S$5,500 for couples without kids. Monthly expenses range between S$6,500 and S$7,500 for couples with kids."]
    )

    # get the score
    score = await context_precision.single_turn_ascore(sample)
    print(score)
    assert score > 0.8

				
			

What this does:

  • os: Helps set environment variables
  • pytest: The testing framework you’ll use to run this test
  • ChatOpenAI: Lets us talk to OpenAI’s GPT model
  • SingleTurnSample: Represents a single question-answer pair + context
  • LangchainLLMWrapper: Converts the LLM into a format RAGAS understands
  • LLMContextPrecisionWithoutReference: The metric we are going to test
				
					@pytest.mark.asyncio
async def test_context_precision():

				
			

This defines your test function. Since RAGAS runs LLM calls asynchronously, we use async + @pytest.mark.asyncio to support that.

				
					    os.environ["OPENAI_API_KEY"] = "your-api-key-here"

				
			

Replace “your-api-key-here” with your real OpenAI API key (or load it from a .env file for better security). This lets the code connect to GPT-3.5.

				
					    llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

				
			

This line sets up the GPT model.

  • We’re using “gpt-3.5-turbo”
  • temperature=0 keeps the answer consistent and focused
				
					    langchain_llm = LangchainLLMWrapper(llm)

				
			

This wraps the LLM in a format that RAGAS understands internally.

What’s LangChain Doing Here?

LangChain is a Python library that makes it easier to work with large language models like GPT.
In this setup, we use LangChain’s `ChatOpenAI` class to talk to OpenAI’s GPT model, and the `LangchainLLMWrapper` to convert it into a format that RAGAS understands.

  • Think of LangChain as a bridge between:
  • You (the developer)
    GPT (the model)
  • RAGAS (the evaluation tool)

Without this wrapper, RAGAS wouldn’t be able to communicate directly with the LLM. That’s why wrapping your model using LangChain is a required step.

				
					    context_precision = LLMContextPrecisionWithoutReference(llm=langchain_llm)

				
			

Now we create the actual metric object:

Context Precision checks:

“Out of all the documents retrieved, how many were actually useful in answering the question?”

				
					    sample = SingleTurnSample(
        user_input="What is the cost of living for a single?",
        response="Around S$2,660/month",
        retrieved_contexts=[
            "EP holders living alone spend around S$2,660/month, while S Pass holders spend about S$1,603/month",
            "Monthly expenses range between S$4,800 and S$5,500 for couples without kids. Monthly expenses range between S$6,500 and S$7,500 for couples with kids."]
    )

				
			

Here we define a test input:

  • user_input: The actual question
  • response: What the LLM answered
  • retrieved_contexts: The documents that were retrieved from the database

RAGAS will evaluate how well the answer matches the context.

This is Mock Data for Learning

  • For simplicity, we’ve hardcoded the user_input, response, and retrieved_contexts in this test.
  • But in a real-world scenario, these values should come dynamically from your RAG API.
  • Here’s what that means:
    • The user_input would be the question a user asks.
    • The response should be the actual output from your LLM (after RAG).
    • The retrieved_contexts should be the documents returned by your vector database or retrieval pipeline (not manually typed).
				
					    score = await context_precision.single_turn_ascore(sample)
    print(score)

				
			

This runs the actual test and prints the precision score — a value between 0 and 1.

  • 1.0 means all retrieved context was relevant
  • 0.0 means none of it helped answer the question
				
					    assert score > 0.8

				
			

✅ Finally, we’re saying: “We expect this test to pass only if the score is above 0.8.”
That means at least 80% of the retrieved context should be useful.

How to Run the Test

Open your terminal (inside PyCharm) and run:

				
					pytest test_context_precision.py

				
			

You’ll see output showing whether the test passed and the actual score.

				
					============================= test session starts ==============================
collecting ... collected 1 item

test_context_precision.py::test_context_precision PASSED                                  [100%]0.9999999999


======================== 1 passed, 7 warnings in 2.08s =========================
				
			

Try a Failed Case (Low Context Precision)

Let’s now simulate a failed test by giving irrelevant context.

Replace your previous sample = SingleTurnSample(…) block with this:

				
					sample = SingleTurnSample(
    user_input="What is the cost of living for a single in Singapore?",
    response="Couldn't find answer",
    retrieved_contexts=[
        "Singapore is known for its cultural diversity and vibrant nightlife.",
        "Tourist arrivals in Singapore dropped significantly in 2020 due to the pandemic."
    ]
)

				
			

You’ll see output showing whether the test failed and the actual score.

				
					============================= test session starts ==============================
collecting ... collected 1 item

test_context_precision.py::test_context_precision FAILED                                  [100%]0.0

Test1.py:10 (test_context_precision)
0.0 != 0.8

Expected :0.8
Actual   :0.0

				
			

This shows how RAGAS catches bad retrieval

That’s it! You’ve now successfully set up RAGAS, written your first test case, and learned how to evaluate the quality of your LLM’s answers using context precision. Even though we used hardcoded data for this guide, you now have the foundation to test real RAG systems. In the next article, we’ll explore how to run dynamic evaluations using real API responses. That means instead of hardcoding sample data, we’ll connect to an actual RAG system and evaluate its live answers. Stay tuned!

That’s it for today, guys. Thank You for Reading! I hope you found this article informative and useful.

If you think it could benefit others, please share it on your social media networks with friends and family who might also appreciate it.

If you find the article useful, please rate it and leave a comment. It will motivate me to devote more time to writing.

If you’d like to support the ongoing efforts to provide quality content, consider contributing via PayNow or Ko-fi. Your support helps keep this resource thriving and improving!

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top