Abstractions#
FlexEval’s API uses several abstractions.
These abstractions are expressed through pydantic
objects, and understanding them will enable you to have a nicer time using FlexEval.
Key abstractions#
FlexEval is a tool for executing evaluations.
An evaluation is represented by flexeval.schema.eval_schema.Eval
, and contains a set of MetricItem
s to apply to the test data.
Functions:
FunctionItem
s apply a Python function to the test data, returning a numeric value.Rubrics:
RubricItem
s use a configuredGraderLlm
function and the provided rubric template to generate a numeric score from an LLM’s output.
You execute an Eval
by creating an flexeval.schema.evalrun_schema.EvalRun
.
EvalRun contains:
Data sources (conversations as inputs, an SQLite filepath as output)
An
Eval
specification, containing the metrics to computeSources for the metrics defined in the
Eval
e.g. Python modules containing the functions referenced inFunctionItem
s or YAML files containing the rubric templates.A
Config
specification, describing how evaluation should be executed.
The Config
includes details about multi-threaded metric computation, about logging, etc.
Data Hierarchy#
Metrics can operate at any of four levels of granularity:
Thread
: Full conversationTurn
: Adjacent set of messages from the same user or assistantMessage
: Individual message from user or assistantToolCall
: Function/tool invocation within a message
Metrics operate at the Turn
level by default, but you can override a MetricItem
's metric_level
.