Abstractions#

FlexEval’s API uses several abstractions. These abstractions are expressed through pydantic objects, and understanding them will enable you to have a nicer time using FlexEval.

Key abstractions#

FlexEval is a tool for executing evaluations.

An evaluation is represented by flexeval.schema.eval_schema.Eval, and contains a set of MetricItems to apply to the test data.

  • Functions: FunctionItems apply a Python function to the test data, returning a numeric value.

  • Rubrics: RubricItems use a configured GraderLlm function and the provided rubric template to generate a numeric score from an LLM’s output.

You execute an Eval by creating an flexeval.schema.evalrun_schema.EvalRun. EvalRun contains:

  • Data sources (conversations as inputs, an SQLite filepath as output)

  • An Eval specification, containing the metrics to compute

  • Sources for the metrics defined in the Eval e.g. Python modules containing the functions referenced in FunctionItems or YAML files containing the rubric templates.

  • A Config specification, describing how evaluation should be executed.

The Config includes details about multi-threaded metric computation, about logging, etc.

Data Sources#

Data sources can be any of these types:

  • FileDataSource (type: file): Load from a JSONL or LangGraph SQLite file. This is the most common data source.

  • NamedDataSource (type: named): Reference a previously loaded dataset by name, enabling dataset reuse across eval runs.

  • IterableDataSource (type: iterable): Load from an in-memory Python iterable (programmatic use only).

In YAML configurations, specify the type field:

data_sources:
  - type: file
    path: conversations.jsonl

In Python, the type is set automatically when you construct the appropriate class:

data_sources = [FileDataSource(path="conversations.jsonl")]

Data Hierarchy#

Data is organized at several levels of granularity:

  • Dataset: A loaded collection of conversations. Datasets can be shared across multiple eval runs.

  • Thread: Full conversation

  • Turn: Adjacent set of messages from the same user or assistant

  • Message: Individual message from user or assistant

  • ToolCall: Function/tool invocation within a message

Metrics operate at the Turn level by default, but you can override a MetricItem's metric_level.