Abstractions#
FlexEval’s API uses several abstractions.
These abstractions are expressed through pydantic objects, and understanding them will enable you to have a nicer time using FlexEval.
Key abstractions#
FlexEval is a tool for executing evaluations.
An evaluation is represented by flexeval.schema.eval_schema.Eval, and contains a set of MetricItems to apply to the test data.
Functions:
FunctionItems apply a Python function to the test data, returning a numeric value.Rubrics:
RubricItems use a configuredGraderLlmfunction and the provided rubric template to generate a numeric score from an LLM’s output.
You execute an Eval by creating an flexeval.schema.evalrun_schema.EvalRun.
EvalRun contains:
Data sources (conversations as inputs, an SQLite filepath as output)
An
Evalspecification, containing the metrics to computeSources for the metrics defined in the
Evale.g. Python modules containing the functions referenced inFunctionItems or YAML files containing the rubric templates.A
Configspecification, describing how evaluation should be executed.
The Config includes details about multi-threaded metric computation, about logging, etc.
Data Hierarchy#
Metrics can operate at any of four levels of granularity:
Thread: Full conversationTurn: Adjacent set of messages from the same user or assistantMessage: Individual message from user or assistantToolCall: Function/tool invocation within a message
Metrics operate at the Turn level by default, but you can override a MetricItem's metric_level.