Abstractions#

FlexEval’s API uses several abstractions. These abstractions are expressed through pydantic objects, and understanding them will enable you to have a nicer time using FlexEval.

Key abstractions#

FlexEval is a tool for executing evaluations.

An evaluation is represented by flexeval.schema.eval_schema.Eval, and contains a set of MetricItems to apply to the test data.

  • Functions: FunctionItems apply a Python function to the test data, returning a numeric value.

  • Rubrics: RubricItems use a configured GraderLlm function and the provided rubric template to generate a numeric score from an LLM’s output.

You execute an Eval by creating an flexeval.schema.evalrun_schema.EvalRun. EvalRun contains:

  • Data sources (conversations as inputs, an SQLite filepath as output)

  • An Eval specification, containing the metrics to compute

  • Sources for the metrics defined in the Eval e.g. Python modules containing the functions referenced in FunctionItems or YAML files containing the rubric templates.

  • A Config specification, describing how evaluation should be executed.

The Config includes details about multi-threaded metric computation, about logging, etc.

Data Hierarchy#

Metrics can operate at any of four levels of granularity:

  • Thread: Full conversation

  • Turn: Adjacent set of messages from the same user or assistant

  • Message: Individual message from user or assistant

  • ToolCall: Function/tool invocation within a message

Metrics operate at the Turn level by default, but you can override a MetricItem's metric_level.