Metric Analysis#
This vignette demonstrates accessing the results of a completed Eval Run.
Author: Zachary Levonian
Date: July 2025
Part 1: Running FlexEval to compute some metrics#
We’ll create some test data, build an eval, and execute it.
import dotenv
assert dotenv.load_dotenv("../.env"), (
"This vignette assumes access to API keys in a .env file."
)
Generating test data#
Let’s evaluate the quality of grade-appropriate explanations.
concepts = ["integer addition", "factoring polynomials", "logistic regression"]
grades = ["3rd", "5th", "7th", "9th"]
user_queries = []
for concept in concepts:
for grade in grades:
user_queries.append(
f"Concisely summarize {concept} at the United States {grade}-grade level."
)
len(user_queries)
12
We can imagine that our system under test involves a particular system prompt, or perhaps multiple candidate prompts.
In this case, we’ll imagine a single, simple system prompt.
system_prompt = """You are a friendly math tutor.
You attempt to summarize any mathematical topic the student is interested in, even if it's not appropriate for their grade level."""
# convert to JSONL
import json
from pathlib import Path
concept_queries_path = Path("concept_queries.jsonl")
with open(concept_queries_path, "w") as outfile:
for user_query in user_queries:
outfile.write(
json.dumps(
{
"input": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_query},
]
}
)
+ "\n"
)
Each line of concept_queries.jsonl
will become a unique Thread
to be processed.
Now that we have test data, we can build a FlexEval configuration and execute it.
Defining an Eval#
An Eval describes the computations that need to happen to compute the required metrics.
In this case, we’ll set a few details:
We want to generate new LLM completions, rather than just using any existing assistant messages in our threads. To do that, we’ll set
do_completion
to true, and define the function to actually generate those completions from those provided inflexeval.configuration.completion_functions
. In this case, we’ll uselitellm_completion()
, which uses LiteLLM to provide access to many different model APIs.We’ll compute two
FunctionItem
s, a Flesch reading ease score andis_role()
. We needis_role
because we can use its value to compute particular metrics only for assistant messages (like the new completions we’ll be generating).Finally, we can specify a custom
RubricItem
s. We’ll write a prompt that describes the assessment we want to make. In this case, we try to determine if the assistant response is grade appropriate.
import flexeval
from flexeval.schema import (
Eval,
Rubric,
GraderLlm,
DependsOnItem,
Metrics,
FunctionItem,
RubricItem,
CompletionLlm,
)
# by specifying an OpenAI model name here, we'll need OPENAI_API_KEY to exist in our environment variables or in our .env file
completion_llm = CompletionLlm(
function_name="litellm_completion",
kwargs={"model": "gpt-4o-mini", "mock_response": "I can't help with that!"},
)
bad_rubric_prompt = """Read the following input and output, assessing if the output is grade-appropriate.
[Input]: {context}
[Output]: {content}
On a new line after your explanation, print:
- YES if the Output is fully appropriate for the grade level
- SOMEWHAT if the Output uses some language or concepts that would be inappropriate for that grade level
- NO if the Output would be mostly incomprehensible to a student at that grade level
Only print YES, SOMEWHAT, or NO on the final line.
"""
rubric = Rubric(
prompt=bad_rubric_prompt,
choice_scores={"YES": 2, "SOMEWHAT": 1, "NO": 0},
)
grader_llm = GraderLlm(
function_name="litellm_completion", kwargs={"model": "gpt-4o-mini"}
)
rubrics = {
"is_grade_appropriate": rubric,
}
is_assistant_dependency = DependsOnItem(
name="is_role", kwargs={"role": "assistant"}, metric_min_value=1
)
eval = Eval(
name="grade_appropriateness",
metrics=Metrics(
function=[
FunctionItem(name="is_role", kwargs={"role": "assistant"}),
FunctionItem(
name="flesch_reading_ease",
depends_on=[is_assistant_dependency],
),
],
rubric=[
RubricItem(
name="is_grade_appropriate", depends_on=[is_assistant_dependency]
)
],
),
grader_llm=grader_llm,
do_completion=True,
completion_llm=completion_llm,
)
eval
Eval(do_completion=True, name='grade_appropriateness', notes='', metrics=Metrics(function=[FunctionItem(name='is_role', depends_on=[], metric_level='Turn', kwargs={'role': 'assistant'}), FunctionItem(name='flesch_reading_ease', depends_on=[DependsOnItem(name='is_role', type=None, kwargs={'role': 'assistant'}, metric_name=None, metric_level=None, relative_object_position=0, metric_min_value=1.0, metric_max_value=1.7976931348623157e+308)], metric_level='Turn', kwargs={})], rubric=[RubricItem(name='is_grade_appropriate', depends_on=[DependsOnItem(name='is_role', type=None, kwargs={'role': 'assistant'}, metric_name=None, metric_level=None, relative_object_position=0, metric_min_value=1.0, metric_max_value=1.7976931348623157e+308)], metric_level='Turn', kwargs={})]), completion_llm=CompletionLlm(function_name='litellm_completion', include_system_prompt=True, kwargs={'model': 'gpt-4o-mini', 'mock_response': "I can't help with that!"}), grader_llm=GraderLlm(function_name='litellm_completion', kwargs={'model': 'gpt-4o-mini'}))
Building an EvalRun#
from flexeval.schema import Config, EvalRun, FileDataSource, RubricsCollection
input_data_sources = [FileDataSource(path=concept_queries_path)]
output_path = Path("eval_results.db")
config = Config(clear_tables=True)
eval_run = EvalRun(
data_sources=input_data_sources,
database_path=output_path,
eval=eval,
config=config,
rubric_paths=[RubricsCollection(rubrics=rubrics)],
)
eval_run
EvalRun(data_sources=[FileDataSource(name=None, notes=None, path=PosixPath('concept_queries.jsonl'), format='jsonl')], database_path=PosixPath('eval_results.db'), eval=Eval(do_completion=True, name='grade_appropriateness', notes='', metrics=Metrics(function=[FunctionItem(name='is_role', depends_on=[], metric_level='Turn', kwargs={'role': 'assistant'}), FunctionItem(name='flesch_reading_ease', depends_on=[DependsOnItem(name='is_role', type=None, kwargs={'role': 'assistant'}, metric_name=None, metric_level=None, relative_object_position=0, metric_min_value=1.0, metric_max_value=1.7976931348623157e+308)], metric_level='Turn', kwargs={})], rubric=[RubricItem(name='is_grade_appropriate', depends_on=[DependsOnItem(name='is_role', type=None, kwargs={'role': 'assistant'}, metric_name=None, metric_level=None, relative_object_position=0, metric_min_value=1.0, metric_max_value=1.7976931348623157e+308)], metric_level='Turn', kwargs={})]), completion_llm=CompletionLlm(function_name='litellm_completion', include_system_prompt=True, kwargs={'model': 'gpt-4o-mini', 'mock_response': "I can't help with that!"}), grader_llm=GraderLlm(function_name='litellm_completion', kwargs={'model': 'gpt-4o-mini'})), config=Config(logs_path=None, env_filepath=None, env={}, clear_tables=True, max_workers=1, random_seed_conversation_sampling=42, max_n_conversation_threads=50, nb_evaluations_per_thread=1, raise_on_completion_error=False, raise_on_metric_error=False), rubric_paths=[RubricsCollection(rubrics={'is_grade_appropriate': Rubric(prompt='Read the following input and output, assessing if the output is grade-appropriate.\n[Input]: {context}\n[Output]: {content}\n\nOn a new line after your explanation, print:\n- YES if the Output is fully appropriate for the grade level\n- SOMEWHAT if the Output uses some language or concepts that would be inappropriate for that grade level\n- NO if the Output would be mostly incomprehensible to a student at that grade level\n\nOnly print YES, SOMEWHAT, or NO on the final line.\n', choice_scores={'YES': 2, 'SOMEWHAT': 1, 'NO': 0}, name=None, notes=None)})], function_modules=[<module 'flexeval.configuration.function_metrics' from '/Users/zacharylevonian/repos/FlexEval/src/flexeval/configuration/function_metrics.py'>], add_default_functions=True)
Running the EvalRun#
Once we’ve built an EvalRun, running it is easy: we can just use run()
!
_ = flexeval.run(eval_run)
Part 2: Analyzing our results#
We’ll analyze the data we created in Part 1.
from flexeval.metrics import access as metric_access
for metric in metric_access.get_all_metrics():
print(
f"{metric['thread']} {metric['turn']} {metric['metric_name']} {metric['metric_value']}"
)
1 1 assistant 0.0
1 13 assistant 1.0
1 13 flesch_reading_ease 117.16000000000003
1 13 is_grade_appropriate 0.0
2 2 assistant 0.0
2 14 assistant 1.0
2 14 flesch_reading_ease 117.16000000000003
2 14 is_grade_appropriate 0.0
3 3 assistant 0.0
3 15 assistant 1.0
3 15 flesch_reading_ease 117.16000000000003
3 15 is_grade_appropriate 0.0
4 4 assistant 0.0
4 16 assistant 1.0
4 16 flesch_reading_ease 117.16000000000003
4 16 is_grade_appropriate 0.0
5 5 assistant 0.0
5 17 assistant 1.0
5 17 flesch_reading_ease 117.16000000000003
5 17 is_grade_appropriate 0.0
6 6 assistant 0.0
6 18 assistant 1.0
6 18 flesch_reading_ease 117.16000000000003
6 18 is_grade_appropriate 0.0
7 7 assistant 0.0
7 19 assistant 1.0
7 19 flesch_reading_ease 117.16000000000003
7 19 is_grade_appropriate 0.0
8 8 assistant 0.0
8 20 assistant 1.0
8 20 flesch_reading_ease 117.16000000000003
8 20 is_grade_appropriate 0.0
9 9 assistant 0.0
9 21 assistant 1.0
9 21 flesch_reading_ease 117.16000000000003
9 21 is_grade_appropriate 0.0
10 10 assistant 0.0
10 22 assistant 1.0
10 22 flesch_reading_ease 117.16000000000003
10 22 is_grade_appropriate 0.0
11 11 assistant 0.0
11 23 assistant 1.0
11 23 flesch_reading_ease 117.16000000000003
11 23 is_grade_appropriate 0.0
12 12 assistant 0.0
12 24 assistant 1.0
12 24 flesch_reading_ease 117.16000000000003
12 24 is_grade_appropriate 0.0