flexeval.schema.evalrun_schema#
The top-level EvalRun
schema and associated sub-schema.
Functions
Utility function to retrieve the default function collection. |
|
Utility function to retrieve the default rubric collection. |
- pydantic model flexeval.schema.evalrun_schema.DataSource[source]#
Bases:
BaseModel
Show JSON schema
{ "title": "DataSource", "type": "object", "properties": { "name": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "", "title": "Name" }, "notes": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "", "title": "Notes" } } }
- Fields:
- pydantic model flexeval.schema.evalrun_schema.EvalRun[source]#
Bases:
BaseModel
EvalRun defines the schema that FlexEval expects.
At a minimum, you need to provide a set of input data sources and an
Eval
.You can evaluate an EvalRun using
run()
.Read more in the User guide.
Show JSON schema
{ "title": "EvalRun", "description": "EvalRun defines the schema that FlexEval expects.\n\nAt a minimum, you need to provide a set of input data sources and an :class:`~flexeval.schema.eval_schema.Eval`.\n\nYou can evaluate an EvalRun using :func:`~flexeval.runner.run`.\n\nRead more in the :ref:`user_guide`.", "type": "object", "properties": { "data_sources": { "description": "List of data sources.", "items": { "$ref": "#/$defs/FileDataSource" }, "minItems": 1, "title": "Data Sources", "type": "array" }, "database_path": { "default": "flexeval/results/results.db", "description": "Output database path.", "format": "path", "title": "Database Path", "type": "string" }, "eval": { "$ref": "#/$defs/Eval", "description": "The evaluation to apply to the data sources." }, "config": { "$ref": "#/$defs/Config", "description": "Configuration details." }, "rubric_paths": { "description": "Additional sources for rubrics. If a Path, should be a YAML file in the expected format.", "items": { "anyOf": [ { "format": "path", "type": "string" }, { "$ref": "#/$defs/RubricsCollection" } ] }, "title": "Rubric Paths", "type": "array" }, "function_modules": { "description": "Additional sources for functions.", "items": { "anyOf": [ { "format": "file-path", "type": "string" }, {} ] }, "title": "Function Modules", "type": "array" }, "add_default_functions": { "default": true, "description": "If the default functions at :mod:`flexeval.configuration.function_metrics` should be made available.", "title": "Add Default Functions", "type": "boolean" } }, "$defs": { "CompletionLlm": { "additionalProperties": false, "properties": { "function_name": { "description": "Completion function defined in `completion_functions.py` or available in the global namespace.", "title": "Function Name", "type": "string" }, "include_system_prompt": { "default": true, "title": "Include System Prompt", "type": "boolean" }, "kwargs": { "additionalProperties": true, "description": "Additional arguments that will be passed to the completion function. Must correspond to arguments in the named function.", "title": "Kwargs", "type": "object" } }, "required": [ "function_name" ], "title": "CompletionLlm", "type": "object" }, "Config": { "properties": { "logs_path": { "anyOf": [ { "format": "path", "type": "string" }, { "type": "null" } ], "default": null, "description": "Log directory path.", "title": "Logs Path" }, "env_filepath": { "anyOf": [ { "format": "path", "type": "string" }, { "type": "null" } ], "default": null, "description": "A .env file to be processed by python-dotenv before running evals with this config.", "title": "Env Filepath" }, "env": { "additionalProperties": true, "description": "Any additional environment variables.", "title": "Env", "type": "object" }, "clear_tables": { "default": false, "description": "Clear any existing tables, if the output SQLite database already exists.", "title": "Clear Tables", "type": "boolean" }, "max_workers": { "default": 1, "description": "Max worker count. Multiple threads will be used if set to > 1. This may have usage limit implications if you are calling APIs.", "title": "Max Workers", "type": "integer" }, "random_seed_conversation_sampling": { "default": 42, "title": "Random Seed Conversation Sampling", "type": "integer" }, "max_n_conversation_threads": { "default": 50, "title": "Max N Conversation Threads", "type": "integer" }, "nb_evaluations_per_thread": { "default": 1, "title": "Nb Evaluations Per Thread", "type": "integer" }, "raise_on_completion_error": { "default": false, "description": "If False (default), metrics will be run even if one or more completions fails.", "title": "Raise On Completion Error", "type": "boolean" }, "raise_on_metric_error": { "default": false, "description": "If False (default), no exception will be thrown if a metric function raises an exception.", "title": "Raise On Metric Error", "type": "boolean" } }, "title": "Config", "type": "object" }, "DependsOnItem": { "additionalProperties": false, "properties": { "name": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Name of the dependency function or rubric.", "title": "Name" }, "type": { "anyOf": [ { "enum": [ "function", "rubric" ], "type": "string" }, { "type": "null" } ], "default": null, "description": "One of 'function' or 'rubric' indicating the type of the dependency.", "title": "Type" }, "kwargs": { "anyOf": [ { "additionalProperties": true, "type": "object" }, { "type": "null" } ], "default": null, "description": "The keyword arguments for the dependency. If provided, used to match which evaluation this dependency is for, so must match the keyword args given for some evaluation.", "title": "Kwargs" }, "metric_name": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Name of the metric dependency. This may be different than function_name if the metric function returns a key/value pair - in which case, this will match the key.", "title": "Metric Name" }, "metric_level": { "anyOf": [ { "enum": [ "Message", "Turn", "Thread", "ToolCall" ], "type": "string" }, { "type": "null" } ], "default": null, "description": "The level of the metric to depend on, which must be equal to or 'greater' than the dependent metric's level. e.g. a Turn can depend on a Thread metric, but not the reverse.", "title": "Metric Level" }, "relative_object_position": { "default": 0, "description": "The position of the object within the Thread. If 0 (default), this is the metric value for the current object. If -1, this is the metric value for the most recent object before this one.", "maximum": 0, "title": "Relative Object Position", "type": "integer" }, "metric_min_value": { "anyOf": [ { "type": "number" }, { "type": "null" } ], "default": -1.7976931348623157e+308, "description": "Minimum value of the dependency to consider it as satisfied.", "title": "Metric Min Value" }, "metric_max_value": { "anyOf": [ { "type": "number" }, { "type": "null" } ], "default": 1.7976931348623157e+308, "description": "Maximum value of the dependency to consider it as satisfied.", "title": "Metric Max Value" } }, "title": "DependsOnItem", "type": "object" }, "Eval": { "additionalProperties": true, "description": "Defines the evaluation that should be executed.\n\nThe key fields are :attr:`metrics` and :attr:`grader_llm`.", "properties": { "do_completion": { "default": false, "description": "Flag to determine if completions should be done in each thread. Set to 'true' if you are testing a new API and want to evaluate the API responses. Set to 'false' (default) if you are evaluating past conversations and do not need to generate new completions.", "title": "Do Completion", "type": "boolean" }, "name": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Name of the test suite. Used as metadata only. Does not need to match the key of the entry in the evals.yaml file.", "title": "Name" }, "notes": { "default": "", "description": "Additional notes regarding the configuration. Used as metadata only.", "title": "Notes", "type": "string" }, "metrics": { "$ref": "#/$defs/Metrics", "description": "Metrics to use in the evaluation." }, "completion_llm": { "anyOf": [ { "$ref": "#/$defs/CompletionLlm" }, { "type": "null" } ], "default": null, "description": "Specification of the LLM or API used to perform new completions. Must be defined if `do_completions: true` is set." }, "grader_llm": { "anyOf": [ { "$ref": "#/$defs/GraderLlm" }, { "type": "null" } ], "default": null, "description": "Specification of the LLM or API used to grade rubrics. Must be defined if any rubric_metrics are specified." } }, "title": "Eval", "type": "object" }, "FileDataSource": { "description": "File to be used as a data source.", "properties": { "name": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "", "title": "Name" }, "notes": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "", "title": "Notes" }, "path": { "description": "Absolute or relative path to data file. Each file must be in jsonl format, with one conversation per line.", "format": "file-path", "title": "Path", "type": "string" }, "format": { "const": "jsonl", "default": "jsonl", "description": "Format of the data file.", "title": "Format", "type": "string" } }, "required": [ "path" ], "title": "FileDataSource", "type": "object" }, "FunctionItem": { "properties": { "name": { "description": "The function to call or name of rubric to use to compute this metric.", "title": "Name", "type": "string" }, "depends_on": { "anyOf": [ { "items": { "$ref": "#/$defs/DependsOnItem" }, "type": "array" }, { "type": "null" } ], "description": "List of dependencies that must be satisfied for this metric to be computed.", "title": "Depends On" }, "metric_level": { "anyOf": [ { "enum": [ "Message", "Turn", "Thread", "ToolCall" ], "type": "string" }, { "type": "null" } ], "default": "Turn", "description": "What level of granularity (ToolCall, Message, Turn, or Thread) this rubric should be applied to", "title": "Metric Level" }, "kwargs": { "additionalProperties": true, "description": "Keyword arguments for the function. Each key must correspond to an argument in the function. Extra keys will cause an error.", "title": "Kwargs", "type": "object" } }, "required": [ "name" ], "title": "FunctionItem", "type": "object" }, "GraderLlm": { "additionalProperties": false, "properties": { "function_name": { "description": "Function defined in `completion_functions.py`. We're not really completing a conversation, but we ARE asking an LLM to provide a response to an input - in this case, the rubric.", "title": "Function Name", "type": "string" }, "kwargs": { "additionalProperties": true, "description": "Additional arguments that will be passed to the completion function. Must correspond to arguments in tne named function.", "title": "Kwargs", "type": "object" } }, "required": [ "function_name" ], "title": "GraderLlm", "type": "object" }, "Metrics": { "description": "Defines the metrics to be evaluated.", "properties": { "function": { "anyOf": [ { "items": { "$ref": "#/$defs/FunctionItem" }, "type": "array" }, { "type": "null" } ], "default": null, "description": "List of function-based metrics to be evaluated.", "title": "Function" }, "rubric": { "anyOf": [ { "items": { "$ref": "#/$defs/RubricItem" }, "type": "array" }, { "type": "null" } ], "default": null, "description": "List of rubrics to be evaluated.", "title": "Rubric" } }, "title": "Metrics", "type": "object" }, "Rubric": { "properties": { "prompt": { "description": "Prompt for the rubric.", "title": "Prompt", "type": "string" }, "choice_scores": { "additionalProperties": { "anyOf": [ { "type": "integer" }, { "type": "number" } ] }, "description": "Choices.", "title": "Choice Scores", "type": "object" }, "name": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Optional name of the rubric.", "title": "Name" }, "notes": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Optional notes.", "title": "Notes" } }, "required": [ "prompt" ], "title": "Rubric", "type": "object" }, "RubricItem": { "properties": { "name": { "description": "The function to call or name of rubric to use to compute this metric.", "title": "Name", "type": "string" }, "depends_on": { "anyOf": [ { "items": { "$ref": "#/$defs/DependsOnItem" }, "type": "array" }, { "type": "null" } ], "description": "List of dependencies that must be satisfied for this metric to be computed.", "title": "Depends On" }, "metric_level": { "anyOf": [ { "enum": [ "Message", "Turn", "Thread", "ToolCall" ], "type": "string" }, { "type": "null" } ], "default": "Turn", "description": "What level of granularity (ToolCall, Message, Turn, or Thread) this rubric should be applied to", "title": "Metric Level" }, "kwargs": { "anyOf": [ { "additionalProperties": true, "type": "object" }, { "type": "null" } ], "description": "Keyword arguments for the rubric evaluation.", "title": "Kwargs" } }, "required": [ "name" ], "title": "RubricItem", "type": "object" }, "RubricsCollection": { "description": "Collection of rubrics that can be used as :class:`~flexeval.schema.eval_schema.RubricItem`\\s.", "properties": { "rubrics": { "additionalProperties": { "$ref": "#/$defs/Rubric" }, "description": "Mapping of rubric names to Rubrics. The rubric names are used for matching metrics to specific rubrics.", "title": "Rubrics", "type": "object" } }, "title": "RubricsCollection", "type": "object" } }, "required": [ "data_sources", "eval" ] }
- Fields:
- field add_default_functions: bool = True#
If the default functions at
flexeval.configuration.function_metrics
should be made available.
- field data_sources: Annotated[list[FileDataSource], Len(min_length=1, max_length=None)] [Required]#
List of data sources.
- Constraints:
min_length = 1
- field function_modules: list[~pathlib.Annotated[~pathlib.Path, ~pydantic.types.PathType(path_type=file)] | ~flexeval.schema.evalrun_schema.FunctionsCollection | ~typing.Annotated[~types.ModuleType, ~pydantic.functional_validators.PlainValidator(func=~flexeval.schema.schema_utils.validate_python_module, json_schema_input_type=~typing.Any), ~pydantic.functional_serializers.PlainSerializer(func=~flexeval.schema.schema_utils.<lambda>, return_type=PydanticUndefined, when_used=always)]] [Optional]#
Additional sources for functions.
- field rubric_paths: list[Path | RubricsCollection] [Optional]#
Additional sources for rubrics. If a Path, should be a YAML file in the expected format.
- pydantic model flexeval.schema.evalrun_schema.FileDataSource[source]#
Bases:
DataSource
File to be used as a data source.
Show JSON schema
{ "title": "FileDataSource", "description": "File to be used as a data source.", "type": "object", "properties": { "name": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "", "title": "Name" }, "notes": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "", "title": "Notes" }, "path": { "description": "Absolute or relative path to data file. Each file must be in jsonl format, with one conversation per line.", "format": "file-path", "title": "Path", "type": "string" }, "format": { "const": "jsonl", "default": "jsonl", "description": "Format of the data file.", "title": "Format", "type": "string" } }, "required": [ "path" ] }
- Fields:
- pydantic model flexeval.schema.evalrun_schema.FunctionsCollection[source]#
Bases:
BaseModel
Collection of functions that can be used as
FunctionItem
s.Show JSON schema
{ "title": "FunctionsCollection", "type": "object", "properties": { "functions": { "default": null, "title": "Functions" } } }
- Fields:
- pydantic model flexeval.schema.evalrun_schema.IterableDataSource[source]#
Bases:
DataSource
Not yet implemented.
Show JSON schema
{ "title": "IterableDataSource", "description": "Not yet implemented.", "type": "object", "properties": { "name": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "", "title": "Name" }, "notes": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "", "title": "Notes" }, "contents": { "description": "Iterable of data items, presumably in the jsonl format (for now).", "items": {}, "title": "Contents", "type": "array" } } }
- Fields:
name ()
notes ()
- flexeval.schema.evalrun_schema.get_default_function_metrics() list[~pathlib.Path | ~flexeval.schema.evalrun_schema.FunctionsCollection | ~typing.Annotated[~types.ModuleType, ~pydantic.functional_validators.PlainValidator(func=~flexeval.schema.schema_utils.validate_python_module, json_schema_input_type=~typing.Any), ~pydantic.functional_serializers.PlainSerializer(func=~flexeval.schema.schema_utils.<lambda>, return_type=PydanticUndefined, when_used=always)]] [source]#
Utility function to retrieve the default function collection.
- flexeval.schema.evalrun_schema.get_default_rubrics() list[Path | RubricsCollection] [source]#
Utility function to retrieve the default rubric collection.