Abstractions#

FlexEval’s API uses several abstractions. These abstractions are expressed through pydantic objects, and understanding them will enable you to have a nicer time using FlexEval.

Key abstractions#

FlexEval is a tool for executing evaluations.

An evaluation is represented by flexeval.schema.eval_schema.Eval, and contains a set of MetricItems to apply to the test data.

Functions: FunctionItems apply a Python function to the test data, returning a numeric value.
Rubrics: RubricItems use a configured GraderLlm function and the provided rubric template to generate a numeric score from an LLM’s output.

You execute an Eval by creating an flexeval.schema.evalrun_schema.EvalRun. EvalRun contains:

Data sources (conversations as inputs, an SQLite filepath as output)
An Eval specification, containing the metrics to compute
Sources for the metrics defined in the Eval e.g. Python modules containing the functions referenced in FunctionItems or YAML files containing the rubric templates.
A Config specification, describing how evaluation should be executed.

The Config includes details about multi-threaded metric computation, about logging, etc.

Data Hierarchy#

Metrics can operate at any of four levels of granularity:

Thread: Full conversation
Turn: Adjacent set of messages from the same user or assistant
Message: Individual message from user or assistant
ToolCall: Function/tool invocation within a message

Metrics operate at the Turn level by default, but you can override a MetricItem's metric_level.

Abstractions#

Key abstractions#

Data Hierarchy#

This Page