flexeval.schema.evalrun_schema#

The top-level EvalRun schema and associated sub-schema.

Functions

get_default_function_metrics()

Utility function to retrieve the default function collection.

get_default_rubrics()

Utility function to retrieve the default rubric collection.

pydantic model flexeval.schema.evalrun_schema.DataSource[source]#

Bases: BaseModel

Show JSON schema
{
   "title": "DataSource",
   "type": "object",
   "properties": {
      "name": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "",
         "title": "Name"
      },
      "notes": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "",
         "title": "Notes"
      }
   }
}

Fields:
field name: str | None = None#
field notes: str | None = None#
pydantic model flexeval.schema.evalrun_schema.EvalRun[source]#

Bases: BaseModel

EvalRun defines the schema that FlexEval expects.

At a minimum, you need to provide a set of input data sources and an Eval.

You can evaluate an EvalRun using run().

Read more in the User guide.

Show JSON schema
{
   "title": "EvalRun",
   "description": "EvalRun defines the schema that FlexEval expects.\n\nAt a minimum, you need to provide a set of input data sources and an :class:`~flexeval.schema.eval_schema.Eval`.\n\nYou can evaluate an EvalRun using :func:`~flexeval.runner.run`.\n\nRead more in the :ref:`user_guide`.",
   "type": "object",
   "properties": {
      "data_sources": {
         "description": "List of data sources.",
         "items": {
            "$ref": "#/$defs/FileDataSource"
         },
         "minItems": 1,
         "title": "Data Sources",
         "type": "array"
      },
      "database_path": {
         "default": "flexeval/results/results.db",
         "description": "Output database path.",
         "format": "path",
         "title": "Database Path",
         "type": "string"
      },
      "eval": {
         "$ref": "#/$defs/Eval",
         "description": "The evaluation to apply to the data sources."
      },
      "config": {
         "$ref": "#/$defs/Config",
         "description": "Configuration details."
      },
      "rubric_paths": {
         "description": "Additional sources for rubrics. If a Path, should be a YAML file in the expected format.",
         "items": {
            "anyOf": [
               {
                  "format": "path",
                  "type": "string"
               },
               {
                  "$ref": "#/$defs/RubricsCollection"
               }
            ]
         },
         "title": "Rubric Paths",
         "type": "array"
      },
      "function_modules": {
         "description": "Additional sources for functions.",
         "items": {
            "anyOf": [
               {
                  "format": "file-path",
                  "type": "string"
               },
               {}
            ]
         },
         "title": "Function Modules",
         "type": "array"
      },
      "add_default_functions": {
         "default": true,
         "description": "If the default functions at :mod:`flexeval.configuration.function_metrics` should be made available.",
         "title": "Add Default Functions",
         "type": "boolean"
      }
   },
   "$defs": {
      "CompletionLlm": {
         "additionalProperties": false,
         "properties": {
            "function_name": {
               "description": "Completion function defined in `completion_functions.py` or available in the global namespace.",
               "title": "Function Name",
               "type": "string"
            },
            "include_system_prompt": {
               "default": true,
               "title": "Include System Prompt",
               "type": "boolean"
            },
            "kwargs": {
               "additionalProperties": true,
               "description": "Additional arguments that will be passed to the completion function. Must correspond to arguments in the named function.",
               "title": "Kwargs",
               "type": "object"
            }
         },
         "required": [
            "function_name"
         ],
         "title": "CompletionLlm",
         "type": "object"
      },
      "Config": {
         "properties": {
            "logs_path": {
               "anyOf": [
                  {
                     "format": "path",
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Log directory path.",
               "title": "Logs Path"
            },
            "env_filepath": {
               "anyOf": [
                  {
                     "format": "path",
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "A .env file to be processed by python-dotenv before running evals with this config.",
               "title": "Env Filepath"
            },
            "env": {
               "additionalProperties": true,
               "description": "Any additional environment variables.",
               "title": "Env",
               "type": "object"
            },
            "clear_tables": {
               "default": false,
               "description": "Clear any existing tables, if the output SQLite database already exists.",
               "title": "Clear Tables",
               "type": "boolean"
            },
            "max_workers": {
               "default": 1,
               "description": "Max worker count. Multiple threads will be used if set to > 1. This may have usage limit implications if you are calling APIs.",
               "title": "Max Workers",
               "type": "integer"
            },
            "random_seed_conversation_sampling": {
               "default": 42,
               "title": "Random Seed Conversation Sampling",
               "type": "integer"
            },
            "max_n_conversation_threads": {
               "default": 50,
               "title": "Max N Conversation Threads",
               "type": "integer"
            },
            "nb_evaluations_per_thread": {
               "default": 1,
               "title": "Nb Evaluations Per Thread",
               "type": "integer"
            },
            "raise_on_completion_error": {
               "default": false,
               "description": "If False (default), metrics will be run even if one or more completions fails.",
               "title": "Raise On Completion Error",
               "type": "boolean"
            },
            "raise_on_metric_error": {
               "default": false,
               "description": "If False (default), no exception will be thrown if a metric function raises an exception.",
               "title": "Raise On Metric Error",
               "type": "boolean"
            }
         },
         "title": "Config",
         "type": "object"
      },
      "DependsOnItem": {
         "additionalProperties": false,
         "properties": {
            "name": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Name of the dependency function or rubric.",
               "title": "Name"
            },
            "type": {
               "anyOf": [
                  {
                     "enum": [
                        "function",
                        "rubric"
                     ],
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "One of 'function' or 'rubric' indicating the type of the dependency.",
               "title": "Type"
            },
            "kwargs": {
               "anyOf": [
                  {
                     "additionalProperties": true,
                     "type": "object"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "The keyword arguments for the dependency. If provided, used to match which evaluation this dependency is for, so must match the keyword args given for some evaluation.",
               "title": "Kwargs"
            },
            "metric_name": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Name of the metric dependency. This may be different than function_name if the metric function returns a key/value pair - in which case, this will match the key.",
               "title": "Metric Name"
            },
            "metric_level": {
               "anyOf": [
                  {
                     "enum": [
                        "Message",
                        "Turn",
                        "Thread",
                        "ToolCall"
                     ],
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "The level of the metric to depend on, which must be equal to or 'greater' than the dependent metric's level. e.g. a Turn can depend on a Thread metric, but not the reverse.",
               "title": "Metric Level"
            },
            "relative_object_position": {
               "default": 0,
               "description": "The position of the object within the Thread. If 0 (default), this is the metric value for the current object. If -1, this is the metric value for the most recent object before this one.",
               "maximum": 0,
               "title": "Relative Object Position",
               "type": "integer"
            },
            "metric_min_value": {
               "anyOf": [
                  {
                     "type": "number"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": -1.7976931348623157e+308,
               "description": "Minimum value of the dependency to consider it as satisfied.",
               "title": "Metric Min Value"
            },
            "metric_max_value": {
               "anyOf": [
                  {
                     "type": "number"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": 1.7976931348623157e+308,
               "description": "Maximum value of the dependency to consider it as satisfied.",
               "title": "Metric Max Value"
            }
         },
         "title": "DependsOnItem",
         "type": "object"
      },
      "Eval": {
         "additionalProperties": true,
         "description": "Defines the evaluation that should be executed.\n\nThe key fields are :attr:`metrics` and :attr:`grader_llm`.",
         "properties": {
            "do_completion": {
               "default": false,
               "description": "Flag to determine if completions should be done in each thread. Set to 'true' if you are testing a new API and want to evaluate the API responses. Set to 'false' (default) if you are evaluating past conversations and do not need to generate new completions.",
               "title": "Do Completion",
               "type": "boolean"
            },
            "name": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Name of the test suite. Used as metadata only. Does not need to match the key of the entry in the evals.yaml file.",
               "title": "Name"
            },
            "notes": {
               "default": "",
               "description": "Additional notes regarding the configuration. Used as metadata only.",
               "title": "Notes",
               "type": "string"
            },
            "metrics": {
               "$ref": "#/$defs/Metrics",
               "description": "Metrics to use in the evaluation."
            },
            "completion_llm": {
               "anyOf": [
                  {
                     "$ref": "#/$defs/CompletionLlm"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Specification of the LLM or API used to perform new completions. Must be defined if `do_completions: true` is set."
            },
            "grader_llm": {
               "anyOf": [
                  {
                     "$ref": "#/$defs/GraderLlm"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Specification of the LLM or API used to grade rubrics. Must be defined if any rubric_metrics are specified."
            }
         },
         "title": "Eval",
         "type": "object"
      },
      "FileDataSource": {
         "description": "File to be used as a data source.",
         "properties": {
            "name": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "",
               "title": "Name"
            },
            "notes": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "",
               "title": "Notes"
            },
            "path": {
               "description": "Absolute or relative path to data file. Each file must be in jsonl format, with one conversation per line.",
               "format": "file-path",
               "title": "Path",
               "type": "string"
            },
            "format": {
               "const": "jsonl",
               "default": "jsonl",
               "description": "Format of the data file.",
               "title": "Format",
               "type": "string"
            }
         },
         "required": [
            "path"
         ],
         "title": "FileDataSource",
         "type": "object"
      },
      "FunctionItem": {
         "properties": {
            "name": {
               "description": "The function to call or name of rubric to use to compute this metric.",
               "title": "Name",
               "type": "string"
            },
            "depends_on": {
               "anyOf": [
                  {
                     "items": {
                        "$ref": "#/$defs/DependsOnItem"
                     },
                     "type": "array"
                  },
                  {
                     "type": "null"
                  }
               ],
               "description": "List of dependencies that must be satisfied for this metric to be computed.",
               "title": "Depends On"
            },
            "metric_level": {
               "anyOf": [
                  {
                     "enum": [
                        "Message",
                        "Turn",
                        "Thread",
                        "ToolCall"
                     ],
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": "Turn",
               "description": "What level of granularity (ToolCall, Message, Turn, or Thread) this rubric should be applied to",
               "title": "Metric Level"
            },
            "kwargs": {
               "additionalProperties": true,
               "description": "Keyword arguments for the function. Each key must correspond to an argument in the function. Extra keys will cause an error.",
               "title": "Kwargs",
               "type": "object"
            }
         },
         "required": [
            "name"
         ],
         "title": "FunctionItem",
         "type": "object"
      },
      "GraderLlm": {
         "additionalProperties": false,
         "properties": {
            "function_name": {
               "description": "Function defined in `completion_functions.py`. We're not really completing a conversation, but we ARE asking an LLM to provide a response to an input - in this case, the rubric.",
               "title": "Function Name",
               "type": "string"
            },
            "kwargs": {
               "additionalProperties": true,
               "description": "Additional arguments that will be passed to the completion function. Must correspond to arguments in tne named function.",
               "title": "Kwargs",
               "type": "object"
            }
         },
         "required": [
            "function_name"
         ],
         "title": "GraderLlm",
         "type": "object"
      },
      "Metrics": {
         "description": "Defines the metrics to be evaluated.",
         "properties": {
            "function": {
               "anyOf": [
                  {
                     "items": {
                        "$ref": "#/$defs/FunctionItem"
                     },
                     "type": "array"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "List of function-based metrics to be evaluated.",
               "title": "Function"
            },
            "rubric": {
               "anyOf": [
                  {
                     "items": {
                        "$ref": "#/$defs/RubricItem"
                     },
                     "type": "array"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "List of rubrics to be evaluated.",
               "title": "Rubric"
            }
         },
         "title": "Metrics",
         "type": "object"
      },
      "Rubric": {
         "properties": {
            "prompt": {
               "description": "Prompt for the rubric.",
               "title": "Prompt",
               "type": "string"
            },
            "choice_scores": {
               "additionalProperties": {
                  "anyOf": [
                     {
                        "type": "integer"
                     },
                     {
                        "type": "number"
                     }
                  ]
               },
               "description": "Choices.",
               "title": "Choice Scores",
               "type": "object"
            },
            "name": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Optional name of the rubric.",
               "title": "Name"
            },
            "notes": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Optional notes.",
               "title": "Notes"
            }
         },
         "required": [
            "prompt"
         ],
         "title": "Rubric",
         "type": "object"
      },
      "RubricItem": {
         "properties": {
            "name": {
               "description": "The function to call or name of rubric to use to compute this metric.",
               "title": "Name",
               "type": "string"
            },
            "depends_on": {
               "anyOf": [
                  {
                     "items": {
                        "$ref": "#/$defs/DependsOnItem"
                     },
                     "type": "array"
                  },
                  {
                     "type": "null"
                  }
               ],
               "description": "List of dependencies that must be satisfied for this metric to be computed.",
               "title": "Depends On"
            },
            "metric_level": {
               "anyOf": [
                  {
                     "enum": [
                        "Message",
                        "Turn",
                        "Thread",
                        "ToolCall"
                     ],
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": "Turn",
               "description": "What level of granularity (ToolCall, Message, Turn, or Thread) this rubric should be applied to",
               "title": "Metric Level"
            },
            "kwargs": {
               "anyOf": [
                  {
                     "additionalProperties": true,
                     "type": "object"
                  },
                  {
                     "type": "null"
                  }
               ],
               "description": "Keyword arguments for the rubric evaluation.",
               "title": "Kwargs"
            }
         },
         "required": [
            "name"
         ],
         "title": "RubricItem",
         "type": "object"
      },
      "RubricsCollection": {
         "description": "Collection of rubrics that can be used as :class:`~flexeval.schema.eval_schema.RubricItem`\\s.",
         "properties": {
            "rubrics": {
               "additionalProperties": {
                  "$ref": "#/$defs/Rubric"
               },
               "description": "Mapping of rubric names to Rubrics. The rubric names are used for matching metrics to specific rubrics.",
               "title": "Rubrics",
               "type": "object"
            }
         },
         "title": "RubricsCollection",
         "type": "object"
      }
   },
   "required": [
      "data_sources",
      "eval"
   ]
}

Fields:
field add_default_functions: bool = True#

If the default functions at flexeval.configuration.function_metrics should be made available.

field config: Config [Optional]#

Configuration details.

field data_sources: Annotated[list[FileDataSource], Len(min_length=1, max_length=None)] [Required]#

List of data sources.

Constraints:
  • min_length = 1

field database_path: Path = PosixPath('flexeval/results/results.db')#

Output database path.

field eval: Eval [Required]#

The evaluation to apply to the data sources.

field function_modules: list[~pathlib.Annotated[~pathlib.Path, ~pydantic.types.PathType(path_type=file)] | ~flexeval.schema.evalrun_schema.FunctionsCollection | ~typing.Annotated[~types.ModuleType, ~pydantic.functional_validators.PlainValidator(func=~flexeval.schema.schema_utils.validate_python_module, json_schema_input_type=~typing.Any), ~pydantic.functional_serializers.PlainSerializer(func=~flexeval.schema.schema_utils.<lambda>, return_type=PydanticUndefined, when_used=always)]] [Optional]#

Additional sources for functions.

field rubric_paths: list[Path | RubricsCollection] [Optional]#

Additional sources for rubrics. If a Path, should be a YAML file in the expected format.

pydantic model flexeval.schema.evalrun_schema.FileDataSource[source]#

Bases: DataSource

File to be used as a data source.

Show JSON schema
{
   "title": "FileDataSource",
   "description": "File to be used as a data source.",
   "type": "object",
   "properties": {
      "name": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "",
         "title": "Name"
      },
      "notes": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "",
         "title": "Notes"
      },
      "path": {
         "description": "Absolute or relative path to data file. Each file must be in jsonl format, with one conversation per line.",
         "format": "file-path",
         "title": "Path",
         "type": "string"
      },
      "format": {
         "const": "jsonl",
         "default": "jsonl",
         "description": "Format of the data file.",
         "title": "Format",
         "type": "string"
      }
   },
   "required": [
      "path"
   ]
}

Fields:
field format: Literal['jsonl'] = 'jsonl'#

Format of the data file.

field path: Annotated[Path, PathType(path_type=file)] [Required]#

Absolute or relative path to data file. Each file must be in jsonl format, with one conversation per line.

Constraints:
  • path_type = file

pydantic model flexeval.schema.evalrun_schema.FunctionsCollection[source]#

Bases: BaseModel

Collection of functions that can be used as FunctionItems.

Show JSON schema
{
   "title": "FunctionsCollection",
   "type": "object",
   "properties": {
      "functions": {
         "default": null,
         "title": "Functions"
      }
   }
}

Fields:
field functions: list[Callable] [Optional]#

Callables that can be used as functions for evaluation.

pydantic model flexeval.schema.evalrun_schema.IterableDataSource[source]#

Bases: DataSource

Not yet implemented.

Show JSON schema
{
   "title": "IterableDataSource",
   "description": "Not yet implemented.",
   "type": "object",
   "properties": {
      "name": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "",
         "title": "Name"
      },
      "notes": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "",
         "title": "Notes"
      },
      "contents": {
         "description": "Iterable of data items, presumably in the jsonl format (for now).",
         "items": {},
         "title": "Contents",
         "type": "array"
      }
   }
}

Fields:
field contents: Iterable [Optional]#

Iterable of data items, presumably in the jsonl format (for now).

flexeval.schema.evalrun_schema.get_default_function_metrics() list[~pathlib.Path | ~flexeval.schema.evalrun_schema.FunctionsCollection | ~typing.Annotated[~types.ModuleType, ~pydantic.functional_validators.PlainValidator(func=~flexeval.schema.schema_utils.validate_python_module, json_schema_input_type=~typing.Any), ~pydantic.functional_serializers.PlainSerializer(func=~flexeval.schema.schema_utils.<lambda>, return_type=PydanticUndefined, when_used=always)]][source]#

Utility function to retrieve the default function collection.

flexeval.schema.evalrun_schema.get_default_rubrics() list[Path | RubricsCollection][source]#

Utility function to retrieve the default rubric collection.