FlexEval documentation#
FlexEval is a tool for designing custom metrics, completion functions, and LLM-graded rubrics for evaluating the behavior of LLM-powered systems.
Read about the motivation and design of FlexEval in our paper at Educational Data Mining 2024.
Get started with FlexEval, go deeper with the User guide, or learn by example in the Vignettes.
Basic Usage#
Install using pip:
pip install python-flexeval
Create and run an evaluation:
import flexeval
from flexeval.schema import Eval, EvalRun, FileDataSource, Metrics, FunctionItem, Config
data_sources = [FileDataSource(path="vignettes/conversations.jsonl")]
eval = Eval(metrics=Metrics(function=[FunctionItem(name="flesch_reading_ease")]))
config = Config(clear_tables=True)
eval_run = EvalRun(
data_sources=data_sources,
database_path="eval_results.db",
eval=eval,
config=config,
)
flexeval.run(eval_run)
This example computes Flesch reading ease for every turn in a list of conversations provided in JSONL format. The metric values are stored in an SQLite database called eval_results.db
.
Read more in Getting started and see additional usage examples in the Vignettes.