flexeval.configuration.function_metrics#

Built-in function metrics that can be used in any configuration.

See add_default_functions.

Functions

constant(object, **kwargs)

Returns a constant value.

count_emojis(turn)

Calculate the number of emojis in a given text string.

count_errors(object)

If a Thread, counts the errors of each type in the thread.

count_llm_models(thread)

Provides a count of messages in the thread produced by each LLM model.

count_messages(object)

Calculate the number of conversational messages in the given Thread or Turn.

count_messages_per_role(object[, ...])

Calculate the number of conversational messages for each role.

count_numeric_tool_call_params_by_name(toolcall)

Extracts the values of all numeric ToolCall parameter inputs, with metric_name being the name of the corresponding parameter.

count_of_parts_matching_regex(object, expression)

Determines the total number of messages in this object

count_tokens(object)

Counts how many prompt_tokens and completion_tokens tokens are used.

count_tool_calls(object)

Provides a count of how many total tools calls there are in this Thread/Turn/Message.

count_tool_calls_by_name(object)

Counts how many times a ToolCall was used to call functions, with metric names equal to function names.

count_turns(object)

Calculate the number of conversational turns in a thread.

flesch_kincaid_grade(turn)

Calculate the Flesch-Kincaid Grade Level score for a given text string.

flesch_reading_ease(turn)

Calculate the Flesch Reading Ease score for a given text string.

identity(object, **kwargs)

Returns a string of the object.

index_in_thread(object)

is_langgraph_type(object, type)

Return 1 is the langgraph type for this Message matches the passed in type, and 0 otherwise.

is_last_turn_in_thread(turn)

Returns 1 if this turn is the final turn in its thread, and 0 otherwise.

is_role(object, role)

Returns 1 if the role for this Turn or Message matches the passed in role, and 0 otherwise.

latency(object)

Returns the estimated time, in seconds, that it took for the Turead/Turn/Message to be generated, in seconds.

message_matches_regex(message, expression)

Determines whether a message matches a regular expression specified by the user

openai_moderation_api(turn, **kwargs)

Calls the OpenAI Moderation API to analyze the given conversational turn for content moderation.

process_conversation(conversation)

Process an entire conversation and return the desired output

process_single_message(message)

Process a single conversational message and return the desired output

string_length(object)

Calculate the length of the content.

tool_was_called(object)

Returns 1 if a tool was called, and 0 otherwise

value_counts_by_tool_name(turn, json_key)

Counts the occurrences of particular values in the text content of tool call in the conversation.

flexeval.configuration.function_metrics.constant(object: Thread | Turn | Message | ToolCall, **kwargs) int | float[source]#

Returns a constant value.

Parameters:
  • object (Union[Thread, Turn, Message, ToolCall]) – Accepts (and ignores) any type of object.

  • response (Union[float | int]) – If provided in the kwargs, return response. Otherwise, return 0.

Returns:

The specified response, or 0.

Return type:

int | float

flexeval.configuration.function_metrics.count_emojis(turn: str) int[source]#

Calculate the number of emojis in a given text string.

Parameters:

turn (str) – The input text string to be evaluated.

Returns:

The number of emojis in the input text.

Return type:

int

flexeval.configuration.function_metrics.count_errors(object: Thread | Turn | Message | ToolCall) dict[source]#

If a Thread, counts the errors of each type in the thread. If a Turn, Message, or ToolCall, ditto.

It does this by iterating through ToolCalls and identifying whether there are entries like “*_errors” in tool_call.additional_kwargs

If a ToolCall, returns 1 if there is an error of each type {

“python_errors”: 3, “javascript_errors”: 1

}

flexeval.configuration.function_metrics.count_llm_models(thread: Thread) dict[source]#

Provides a count of messages in the thread produced by each LLM model. Useful for quantifying which LLM generated the results - and agents can have more than 1 type.

flexeval.configuration.function_metrics.count_messages(object: Thread | Turn) int[source]#

Calculate the number of conversational messages in the given Thread or Turn. Excludes any system messages. A message is counted even if the content for that action was blank (e.g., a blank message associated with a tool call).

Parameters:

Thread (Turn or)

Returns:

Count of messages.

Return type:

int

flexeval.configuration.function_metrics.count_messages_per_role(object: Thread | Turn, use_langgraph_roles=False) list[source]#

Calculate the number of conversational messages for each role. Excludes the system prompt. A message is counted even if the content for that action was blank (e.g., a blank message associated with a tool call).

Parameters:

Thread (Turn or)

Returns:

A dictionary with roles as keys roles and values as counts of messages

Return type:

dict

flexeval.configuration.function_metrics.count_numeric_tool_call_params_by_name(toolcall: ToolCall) list[dict][source]#

Extracts the values of all numeric ToolCall parameter inputs, with metric_name being the name of the corresponding parameter.

Parameters:

toolcall (ToolCall) – The tool call.

Returns:

List of key -> numeric value pairs in the tool call.

Return type:

list[dict]

flexeval.configuration.function_metrics.count_of_parts_matching_regex(object: Thread | Turn | Message, expression: str) int[source]#
Determines the total number of messages in this object

matching a regular expression specified by the user. Ignores tool calls in object.

Outputs the sum of the number of matches detected using Pattern.findall() across all entries in the object.

flexeval.configuration.function_metrics.count_tokens(object: Thread | Turn | Message) dict[source]#

Counts how many prompt_tokens and completion_tokens tokens are used.

These values are record at the Message level, so this function sums over messages if the input type is Thread or Turn

flexeval.configuration.function_metrics.count_tool_calls(object: Thread | Turn | Message) dict[source]#

Provides a count of how many total tools calls there are in this Thread/Turn/Message. Differs from count_tool_calls_by_name because it does not return the names of the tool calls.

flexeval.configuration.function_metrics.count_tool_calls_by_name(object: Thread | Turn | Message | ToolCall) dict[source]#

Counts how many times a ToolCall was used to call functions, with metric names equal to function names.

NOTE: This function provides an example of how to go from higher levels of granularity (e.g., Thread) to lower levels of granularity (e.g., ToolCall).

flexeval.configuration.function_metrics.count_turns(object: Thread) int[source]#

Calculate the number of conversational turns in a thread.

Parameters:

Thread

Returns:

Count of turns.

Return type:

int

flexeval.configuration.function_metrics.flesch_kincaid_grade(turn: str) float[source]#

Calculate the Flesch-Kincaid Grade Level score for a given text string.

The Flesch-Kincaid Grade Level score is a readability test designed to indicate the U.S. school grade level of the text. Higher scores indicate material that is more difficult to read and understand, suitable for higher grade levels.

Parameters:

turn (str) – The input text string to be evaluated.

Returns:

The Flesch-Kincaid Grade Level score of the input text.

Return type:

float

flexeval.configuration.function_metrics.flesch_reading_ease(turn: str) float[source]#

Calculate the Flesch Reading Ease score for a given text string.

The Flesch Reading Ease score is a readability test designed to indicate how difficult a passage in English is to understand. Higher scores indicate material that is easier to read; lower scores indicate material that is more difficult to read.

Parameters:

turn (str) – The input text string to be evaluated.

Returns:

The Flesch Reading Ease score of the input text.

Return type:

float

flexeval.configuration.function_metrics.identity(object: Thread | Turn | Message | ToolCall, **kwargs) dict[source]#

Returns a string of the object.

Parameters:

object (Union[Thread, Turn, Message, ToolCall]) – Accepts any type of object.

Returns:

Returns a dict.

Return type:

dict

flexeval.configuration.function_metrics.index_in_thread(object: Turn | Message) int[source]#
flexeval.configuration.function_metrics.is_langgraph_type(object: Message, type: str) dict[source]#

Return 1 is the langgraph type for this Message matches the passed in type, and 0 otherwise.

Args: object: the Message type: a string with the desired type to check against

flexeval.configuration.function_metrics.is_last_turn_in_thread(turn: Turn) int[source]#

Returns 1 if this turn is the final turn in its thread, and 0 otherwise.

Parameters:

turn – turn to evaluate

Returns:

1 for this being the temporally last turn in the thread, 0 otherwise

Return type:

int

flexeval.configuration.function_metrics.is_role(object: Turn | Message, role: str) dict[source]#

Returns 1 if the role for this Turn or Message matches the passed in role, and 0 otherwise.

Args: object: the Turn or Message role: a string with the desired role to check against

flexeval.configuration.function_metrics.latency(object: Thread | Turn | Message) float[source]#

Returns the estimated time, in seconds, that it took for the Turead/Turn/Message to be generated, in seconds.

For Turns and Messages, this is done by comparing the timestamp of the Turn/Message, which indicates the output time of that Turn/Message - to the timestamp fo the previous Turn/Message. For example, if a Message is generated at 1:27.3 but the previous message was generated at 1:23.1, the latency was 4.2 seconds.

For Threads, the latenecy is calculated as the time difference, again in seconds, between the first and last message.

flexeval.configuration.function_metrics.message_matches_regex(message: Message, expression: str) dict[source]#

Determines whether a message matches a regular expression specified by the user

Outputs the number of matches detected using Pattern.findall()

flexeval.configuration.function_metrics.openai_moderation_api(turn: str, **kwargs) dict[source]#

Calls the OpenAI Moderation API to analyze the given conversational turn for content moderation. Since the input is a string, it’ll concatenate all the “content” together and pass it in

Parameters:
  • turn (str) – The conversational turn to be analyzed.

  • **kwargs (Any) – Ignored for now

Returns:

A dictionary of category scores from the moderation API response.

Return type:

Dict[str, float]

flexeval.configuration.function_metrics.process_conversation(conversation: list) int | float | dict[str, int | float] | list[dict[str, int | float]][source]#

Process an entire conversation and return the desired output

Args: conversation (list): an entire conversation as a list

NOTE: Metrics that take a list as input are valid at the Thread

and Turn levels.

Returns: an integer, e.g., 2 or a floating point number, e.g., 2.8 or a dictionary of metric/value pairs, e.g. {‘metric1’:value1, ‘metric2’:value2} or a list of dictionaries. The key can be either ‘role’ or ‘metric’. e.g., [{“role”:role1, “value”:value1}, {“role”:role2, “value”:value2}, …]

flexeval.configuration.function_metrics.process_single_message(message: str) int | float | dict[str, int | float][source]#

Process a single conversational message and return the desired output

Args: message (str): a single conversational message as a string

NOTE: Metrics that take a string as input are valid at the Turn

and Message levels.

Returns: an integer (e.g., 2), or a floating point number (e.g., 2.8), or a dictionary of metric/value pairs (e.g. {‘metric1’:value1, ‘metric2’:value2})

flexeval.configuration.function_metrics.string_length(object: Thread | Turn | Message) int[source]#

Calculate the length of the content.

Parameters:

object (Union[Thread, Turn, Message])

Returns:

The length of the content of the messages (added together for

thread and turn that may contain more than one message)

Return type:

int

flexeval.configuration.function_metrics.tool_was_called(object: Thread | Turn | Message) float[source]#

Returns 1 if a tool was called, and 0 otherwise

flexeval.configuration.function_metrics.value_counts_by_tool_name(turn: list, json_key: str) dict[source]#

Counts the occurrences of particular values in the text content of tool call in the conversation. Assumes the role will be tool, and that kwargs contains the argument json_key. values associated with that json_key for a specific tool name are separately aggregated with counts.

Parameters:
  • conversation (List[Dict[str, Any]]) – A list of dictionaries representing conversational turns. Each dictionary should have a ‘role’ key indicating the role of the participant.

  • json_key – string that represents the key to look for in the content of the tool call text

Returns:

list of name/value pairs for each parameter and function name combo