Metric Analysis#

This vignette demonstrates a more complete use of FlexEval: constructing an Eval with both rubric and function metrics, running that eval via an EvalRun, and using FlexEval utility functions to retrieve and interpret the results.

Author: Zachary Levonian
Date: July 2025

Part 1: Running FlexEval to compute metrics#

We’ll create some test data, build an eval, and execute it.

import dotenv

assert dotenv.load_dotenv("../.env"), (
    "This vignette assumes access to API keys in a .env file."
)

Generating test data#

Let’s evaluate the quality of grade-appropriate explanations.

concepts = ["integer addition", "factoring polynomials", "logistic regression"]
grades = ["3rd", "5th", "7th", "9th"]

user_queries = []
for concept in concepts:
    for grade in grades:
        user_queries.append(
            f"Concisely summarize {concept} at the United States {grade}-grade level."
        )
len(user_queries)

We can imagine that our system under test involves a particular system prompt, or perhaps multiple candidate prompts.

In this case, we’ll imagine a single, simple system prompt.

system_prompt = """You are a friendly math tutor.

You attempt to summarize any mathematical topic the student is interested in, even if it's not appropriate for their grade level."""

# convert to JSONL
import json
from pathlib import Path

concept_queries_path = Path("concept_queries.jsonl")
with open(concept_queries_path, "w") as outfile:
    for user_query in user_queries:
        outfile.write(
            json.dumps(
                {
                    "input": [
                        {"role": "system", "content": system_prompt},
                        {"role": "user", "content": user_query},
                    ]
                }
            )
            + "\n"
        )

Each line of concept_queries.jsonl will become a unique Thread to be processed.

Now that we have test data, we can build a FlexEval configuration and execute it.

Defining an Eval#

An Eval describes the computations that need to happen to compute the required metrics.

In this case, we’ll set a few details:

We want to generate new LLM completions, rather than just using any existing assistant messages in our threads. To do that, we’ll set do_completion to true, and define the function to actually generate those completions from those provided in flexeval.configuration.completion_functions. In this case, we’ll use litellm_completion(), which uses LiteLLM to provide access to many different model APIs.
We’ll compute two FunctionItems, a Flesch reading ease score and is_role(). We need is_role because we can use its value to compute particular metrics only for assistant messages (like the new completions we’ll be generating).
Finally, we can specify a custom RubricItems. We’ll write a prompt that describes the assessment we want to make. In this case, we try to determine if the assistant response is grade appropriate.

import flexeval
from flexeval.schema import (
    Eval,
    Rubric,
    GraderLlm,
    DependsOnItem,
    Metrics,
    FunctionItem,
    RubricItem,
    CompletionLlm,
)

# by specifying an OpenAI model name here, we'll need OPENAI_API_KEY to exist in our environment variables or in our .env file
completion_llm = CompletionLlm(
    function_name="litellm_completion",
    kwargs={"model": "gpt-4o-mini"},
)

A note about CompletionLlm: if you’re using LiteLLM, you can replace an API call with a pre-written response by passing mock_response as an additional keyword argument.

For example, this configuration will always return “I can’t help with that!”:

kwargs={"model": "gpt-4o-mini", "mock_response": "I can't help with that!"}

Now let’s define a rubric. We just need a prompt, a set of choice scores mapping from string responses to a numeric metric value, and information about what LLM to use. For simplicity, we’ll use the same model for evaluating the completions that we use to generate them – but this should probably be avoided in general due to LLMs’ preference for their own outputs.

bad_rubric_prompt = """Read the following input and output, assessing if the output is grade-appropriate.
[Input]: {context}
[Output]: {content}

On a new line after your explanation, print:
- YES if the Output is fully appropriate for the grade level
- NO if the Output would uses language or concepts that would be inappropriate for that grade level

Only print YES or NO on the final line.
"""
rubric = Rubric(
    prompt=bad_rubric_prompt,
    choice_scores={"YES": 1, "NO": 0},
)
rubrics = {
    "is_grade_appropriate": rubric,
}
grader_llm = GraderLlm(
    function_name="litellm_completion", kwargs={"model": "gpt-4o-mini"}
)

We’ll define the metrics that need to be computed using the rubric name defined above (is_grade_appropriate).

We’ll also define the computation of our two function metrics: is_role() and flesch_reading_ease().

We want to evaluate flesch_reading_ease and is_grade_appropriate only on assistant turns, so we need to declare a dependency.

Here’s how we declare a dependency:

A DependsOnItem requires a name that matches a defined metric.
Providng kwargs are optional, but if you use the same name with different kwargs you need to provide them so that we know which metric you’re depending on.
Set metric_min_value or metric_max_value or both. The is_role() documentation tell us that it returns 1 (true) if the turn has the role provided in the keyword args and 0 (false) otherwise. So we just need to set metric_min_value to 1.
When we define a MetricItem that has dependencies, provide one or more DependsOnItems in the depends_on list.

is_assistant_dependency = DependsOnItem(
    name="is_role", kwargs={"role": "assistant"}, metric_min_value=1
)
metrics = Metrics(
    function=[
        FunctionItem(name="is_role", kwargs={"role": "assistant"}),
        FunctionItem(
            name="flesch_reading_ease",
            depends_on=[is_assistant_dependency],
        ),
    ],
    rubric=[
        RubricItem(name="is_grade_appropriate", depends_on=[is_assistant_dependency])
    ],
)

We’ll finish building our Eval by providing all the info we defined above.

An eval’s name is optional, but providing one can help you later if you’re running lots of different Evals against the same dataset.

I’ll call this eval grade_appropriateness, since that’s what we’re trying to evaluate.

eval = Eval(
    name="grade_appropriateness",
    metrics=metrics,
    grader_llm=grader_llm,
    do_completion=True,
    completion_llm=completion_llm,
)
eval

Eval(do_completion=True, name='grade_appropriateness', notes='', metrics=Metrics(function=[FunctionItem(name='is_role', depends_on=[], metric_level='Turn', kwargs={'role': 'assistant'}), FunctionItem(name='flesch_reading_ease', depends_on=[DependsOnItem(name='is_role', type=None, kwargs={'role': 'assistant'}, metric_name=None, metric_level=None, relative_object_position=0, metric_min_value=1.0, metric_max_value=1.7976931348623157e+308)], metric_level='Turn', kwargs={})], rubric=[RubricItem(name='is_grade_appropriate', depends_on=[DependsOnItem(name='is_role', type=None, kwargs={'role': 'assistant'}, metric_name=None, metric_level=None, relative_object_position=0, metric_min_value=1.0, metric_max_value=1.7976931348623157e+308)], metric_level='Turn', kwargs={})]), completion_llm=CompletionLlm(function_name='litellm_completion', include_system_prompt=True, kwargs={'model': 'gpt-4o-mini'}), grader_llm=GraderLlm(function_name='litellm_completion', kwargs={'model': 'gpt-4o-mini'}))

Building an EvalRun#

To execute the defined Eval, we need to build an EvalRun.

Here’s what we need (other than the Eval):

data_sources should be a list of datasets. We already saved a jsonl with our inputs, so we’ll wrap the path in a FileDataSource.
database_path defines the location of the SQLite file that FlexEval produces.
rubric_paths is where we provide the rubric prompts we defined above (in the form of a RubricsCollection).
config is optional, but you can provide a Config there to override settings that define how the Eval will be executed. In this case, we set clear_tables to True in order to delete any outputs in the provided database_path.

from flexeval.schema import Config, EvalRun, FileDataSource, RubricsCollection

input_data_sources = [FileDataSource(path=concept_queries_path)]
database_path = Path("eval_results.db")
config = Config(clear_tables=True)
eval_run = EvalRun(
    data_sources=input_data_sources,
    database_path=database_path,
    eval=eval,
    config=config,
    rubric_paths=[RubricsCollection(rubrics=rubrics)],
)
eval_run

EvalRun(data_sources=[FileDataSource(name=None, notes=None, path=PosixPath('concept_queries.jsonl'), format='jsonl')], database_path=PosixPath('eval_results.db'), eval=Eval(do_completion=True, name='grade_appropriateness', notes='', metrics=Metrics(function=[FunctionItem(name='is_role', depends_on=[], metric_level='Turn', kwargs={'role': 'assistant'}), FunctionItem(name='flesch_reading_ease', depends_on=[DependsOnItem(name='is_role', type=None, kwargs={'role': 'assistant'}, metric_name=None, metric_level=None, relative_object_position=0, metric_min_value=1.0, metric_max_value=1.7976931348623157e+308)], metric_level='Turn', kwargs={})], rubric=[RubricItem(name='is_grade_appropriate', depends_on=[DependsOnItem(name='is_role', type=None, kwargs={'role': 'assistant'}, metric_name=None, metric_level=None, relative_object_position=0, metric_min_value=1.0, metric_max_value=1.7976931348623157e+308)], metric_level='Turn', kwargs={})]), completion_llm=CompletionLlm(function_name='litellm_completion', include_system_prompt=True, kwargs={'model': 'gpt-4o-mini'}), grader_llm=GraderLlm(function_name='litellm_completion', kwargs={'model': 'gpt-4o-mini'})), config=Config(logs_path=None, env_filepath=None, env={}, clear_tables=True, max_workers=1, random_seed_conversation_sampling=42, max_n_conversation_threads=50, nb_evaluations_per_thread=1, raise_on_completion_error=False, raise_on_metric_error=False), rubric_paths=[RubricsCollection(rubrics={'is_grade_appropriate': Rubric(prompt='Read the following input and output, assessing if the output is grade-appropriate.\n[Input]: {context}\n[Output]: {content}\n\nOn a new line after your explanation, print:\n- YES if the Output is fully appropriate for the grade level\n- NO if the Output would uses language or concepts that would be inappropriate for that grade level\n\nOnly print YES or NO on the final line.\n', choice_scores={'YES': 1, 'NO': 0}, name=None, notes=None)})], function_modules=[<module 'flexeval.configuration.function_metrics' from '/Users/zacharylevonian/repos/FlexEval/src/flexeval/configuration/function_metrics.py'>], add_default_functions=True)

Running the EvalRun#

Once we’ve built an EvalRun, running it is easy: we can just use run()!

_ = flexeval.run(eval_run)

Now that we’ve run our Eval, we can analyze our results!

Part 2: Analyzing our results#

We’ll analyze the data we created in Part 1.

flexeval.metrics.access#

A few utility functions are exposed in flexeval.metrics.access for returning computed metrics.

from pathlib import Path

import pandas as pd
from IPython import display  # for prettier printing

from flexeval import db_utils
from flexeval.metrics import access as metric_access

Note: We use Peewee’s database methods to access the output at database_path. Using run() will set the appropriate variables for you, but if you just want to access the data from an eval you can use flexeval.db_utils.ensure_database() to ensure the Peewee connection is available.

database_path = Path("eval_results.db")
db_utils.ensure_database(database_path)

To start, let’s just get all of the metrics as a Pandas dataframe.

get_all_metrics() returns a list of dictionaries.

df = pd.DataFrame(metric_access.get_all_metrics())
df.head(n=3)

	id	evalsetrun	dataset	thread	turn	message	toolcall	evaluation_name	evaluation_type	metric_name	...	metric_value	kwargs	source	depends_on	rubric_prompt	rubric_completion	rubric_model	rubric_completion_tokens	rubric_prompt_tokens	rubric_score
0	1	1	1	1	1	None	None	is_role	function	assistant	...	0.000000	{'role': 'assistant'}	def is_role(object: Union[Turn, Message], role...	[]	None	None	None	NaN	NaN	None
1	2	1	1	1	13	None	None	is_role	function	assistant	...	1.000000	{'role': 'assistant'}	def is_role(object: Union[Turn, Message], role...	[]	None	None	None	NaN	NaN	None
2	3	1	1	1	13	None	None	flesch_reading_ease	function	flesch_reading_ease	...	76.191959	{}	def flesch_reading_ease(turn: str) -> float:\n...	[{"name": "is_role", "type": null, "kwargs": {...	None	None	None	NaN	NaN	None

3 rows × 21 columns

We have access to all of the columns in Metric.

We can see that FlexEval generates one metric result each time it runs, along with information about what was run. Let’s look at the columns we have for our metrics.

display.Markdown("Columns returned by `get_all_metrics()`: " + ", ".join(df.columns))

Columns returned by get_all_metrics(): id, evalsetrun, dataset, thread, turn, message, toolcall, evaluation_name, evaluation_type, metric_name, metric_level, metric_value, kwargs, source, depends_on, rubric_prompt, rubric_completion, rubric_model, rubric_completion_tokens, rubric_prompt_tokens, rubric_score

Let’s start by looking at how many times each metric was computed:

pd.DataFrame(df.evaluation_name.value_counts())

	count
evaluation_name
is_role	24
flesch_reading_ease	12
is_grade_appropriate	12

These counts look right:

Our dataset contained 12 input sentences.
We generated 12 LLM responses.
is_role was evaluated on all 24 turns…
…but flesch_reading_ease and is_grade_appropriate were computed only on the 12 LLM-generated (“assistant”) turns.

Let’s start our analysis by looking at the numeric metric: Flesch reading ease.

Analyzing a function metric#

A quick aside: Flesch reading ease is widely used, but it should be used with care. As summarized by Crossley et al. (CLEAR 2022):

Traditional readability formulas lack construct and theoretical validity because they are based on weak proxies of word decoding (i.e., characters or syllables per word) and syntactic complexity (i.e., number or words per sentence) and ignore many text features that are important components of reading models including text cohesion and semantics. Additionally, many traditional readability formulas were normed using readers from specific age groups on small corpora of texts taken from specific domains.

Nevertheless, many people may expect to see a readability score, and it can be useful to assess the convergent validity for some rubric metrics.

Let’s start by filtering to only the flesch_reading_ease metric results and computing the mean.

flesch_reading_ease = df[df.metric_name == "flesch_reading_ease"]
print(f"Mean Flesch reading ease score: {flesch_reading_ease.metric_value.mean():.2f}")

Mean Flesch reading ease score: 59.89

A Flesch reading ease of 60 is at the “10th to 12th grade” level.

Hmm, seems a bit high.

Plotting the distribution, we never seem to get above 80 (appropriate for 6th grade and lower).

import numpy as np
import matplotlib.pyplot as plt

%config InlineBackend.figure_format = 'svg'

fig, ax = plt.subplots(1, 1, figsize=(5, 3))

ax.hist(flesch_reading_ease.metric_value, bins=np.arange(30, 110, 10))
ax.set_xlabel("Flesch reading ease")
ax.set_ylabel("Concept explanation count")

fig.tight_layout()
plt.show()

../../_images/2dbc589e5a18a60784b65bbe22d8ec0856a3369872e3d927407f7c822578e548.svg

Already, we’ve learned enough to consider going back to our prompt engineering to produce more age-appropriate messages.

But presuming we’re okay with the overall range, we’d prefer to see how things are looking at the original grade ranges we were targeting.

We can extract the original grade from the initial user message in each thread in our dataset. To get those user messages, we can use another utility function in flexeval.metrics.access.

thread_ids = set(flesch_reading_ease.thread)
# get the first message in each thread
first_message_df = pd.DataFrame(
    metric_access.get_first_user_message_for_threads(thread_ids)
)
# join in the first message texts as "first_message_content"
flesch_reading_ease = flesch_reading_ease.merge(
    first_message_df[["thread", "content"]].rename(
        columns={"content": "first_message_content"}
    ),
    how="inner",
    on="thread",
)
# extract grade level from the messages
flesch_reading_ease["grade_level"] = (
    flesch_reading_ease.first_message_content.str.extract(
        r"(\d+)(?:st|nd|rd|th)-grade"
    ).astype(int)
)
pd.DataFrame(flesch_reading_ease.grade_level.value_counts())

	count
grade_level
3	3
5	3
7	3
9	3

With grade level in hand, we can investigate reading ease at each prompted grade level.

fig, ax = plt.subplots(1, 1, figsize=(5, 2.5))

# no has ever accused matplotlib of being concise
ax.scatter(
    flesch_reading_ease.grade_level,
    flesch_reading_ease.metric_value,
    color="black",
    s=6,
)
means = flesch_reading_ease.groupby("grade_level").metric_value.mean()
ax.scatter(
    means.index, means, color="darkgreen", s=50, marker="_", label="Grade-level mean"
)
for x, y in zip(means.index, means):
    ax.text(x, y, f"  {y:.1f}", va="center", ha="left")
ax.set_xticks(sorted(means.index))
ax.set_xlim(means.index.min() - 1, means.index.max() + 1)
ax.legend()
ax.set_xlabel("Grade")
ax.set_ylabel("Flesch reading ease")

fig.tight_layout()
plt.show()

../../_images/7c88314531fb90f1a12a8647f0fa970e3efcd6139bc6a9c188174c197227a894.svg

These group-level means suggest that the completions are more grade-appropriate depending on the prompted grade.

However, the reading ease scores also suggest the responses are potentially too complex.

Even the most readable explanations – generated for the third-grade prompt – are at the 7th-grade level by the Flesch-Kincaid at the 7th-grade level.

Analyzing a rubric metric#

We might attempt to go deeper by investigating the LLM-as-a-judge rubric results.

One of the advantages of a rubric is that we can develop it iteratively. The investigation here is only an initial look; in reality, we would likely update the prompt after investigating the outputs and seeing where we might sharpen the validity of the metric we’re developing.

Let’s start by looking at the raw counts.

is_grade_appropriate = df[df.metric_name == "is_grade_appropriate"]
pd.DataFrame(is_grade_appropriate.metric_value.value_counts())

	count
metric_value
1.0	11
0.0	1

Only one determination of grade inappriateness! Let’s try to get some insight by looking at one of the YES determinations.

def quote_markdown(markdown_string: str) -> str:
    return "\n".join("> " + line for line in markdown_string.splitlines())


def format_rubric_summary(rubric_metric) -> str:
    return f"""Prompt:
{quote_markdown(rubric_metric.rubric_prompt)}

Completion:
{quote_markdown(rubric_metric.rubric_completion)}

Parsed response: {rubric_metric.metric_value} ({rubric_metric.rubric_score})"""


rubric_metric = (
    is_grade_appropriate[is_grade_appropriate.metric_value == 1].sample(n=1).iloc[0]
)
display.Markdown(format_rubric_summary(rubric_metric))

Prompt:

Read the following input and output, assessing if the output is grade-appropriate. [Input]: user: Concisely summarize integer addition at the United States 5th-grade level.

[Output]: assistant: Integer addition involves combining whole numbers to find a total. To add integers, start with the first number and count up by the second number. Here are some key points:

Positive Integers: When you add two positive numbers, you simply find their total (e.g., 3 + 4 = 7).

Negative Integers: When adding negative numbers, you find the total by moving left on the number line (e.g., -3 + -4 = -7).

Mix of Positive and Negative: If you add a positive and a negative number, subtract the smaller absolute value from the larger one, and keep the sign of the number with the larger absolute value (e.g., 5 + -3 = 2 and -5 + 3 = -2).

Practice adding integers using number lines or counters to see how it works!

On a new line after your explanation, print:

YES if the Output is fully appropriate for the grade level

NO if the Output would uses language or concepts that would be inappropriate for that grade level

Only print YES or NO on the final line.

Completion:

The input asks for a concise summary of integer addition suitable for a 5th-grade level.

The output begins by clearly defining integer addition in simple terms, “combining whole numbers to find a total,” which is age-appropriate.

The output includes basic concepts of addition - explaining how to add positive integers, which is straightforward and suitable for 5th graders.

The output then covers negative integers, explaining their addition in a manner that relates to the number line, which may be complex but is part of the 5th-grade curriculum and is explained clearly.

The section on mixing positive and negative integers is also relevant and explained through examples, making it easier for students to understand.

The suggestion to practice with number lines or counters reinforces learning, which is valuable for elementary students.

Overall, the language and concepts used in the output are appropriate for the 5th-grade level, offering explanations and examples that are clear and relatable.

Based on this reasoning, I conclude that the output is indeed grade-appropriate.

YES YES

Parsed response: 1.0 (YES)

Hmm, not quite convincing. I already have a few ideas for how we might improve our prompt.

Let’s finish up by looking at the one No determination.

rubric_metric = (
    is_grade_appropriate[is_grade_appropriate.metric_value == 0].sample(n=1).iloc[0]
)
display.Markdown(format_rubric_summary(rubric_metric))