flexeval#

FlexEval is a Python package for designing custom metrics, completion functions, and LLM-graded rubrics for evaluating the behavior of LLM-powered systems.

This top-level import exposes the run() method.

flexeval.run(eval_run: EvalRun) EvalRunner[source]#

Runs the evaluations.

Modules

classes

Peewee classes used for saving the results of a FlexEval run.

cli

CLI commands.

completions

Completing conversations using LLMs.

compute_metrics

Utilities for computing needed metric computations and actually invoking those computations.

configuration

Built-in completion functions, function metrics, and rubric metrics.

data_loader

Dataset loading functions.

db_utils

Peewee database utilities.

dependency_graph

Determines how configured metrics depend on each other.

function_types

Inspection utilities that use type hints to determine the appropriate object to pass to a function metric.

helpers

Generic utility functions.

io

Input/output utilities, primarily for reading and writing the schema objects.

log_utils

Logging utilities.

metrics

Utility functions for accessing metrics.

rubric

Rubric metric IO utilities.

run_utils

Utilities for runner.

runner

Convenience functions for running an Eval Run.

schema

Pydantic schema for the core configuration options used for FlexEval.