flexeval.configuration.function_metrics#
Built-in function metrics that can be used in any configuration.
Functions
|
Returns a constant value. |
|
Calculate the number of emojis in a given text string. |
|
If a Thread, counts the errors of each type in the thread. |
|
Provides a count of messages in the thread produced by each LLM model. |
|
Calculate the number of conversational messages in the given Thread or Turn. |
|
Calculate the number of conversational messages for each role. |
|
Extracts the values of all numeric ToolCall parameter inputs, with metric_name being the name of the corresponding parameter. |
|
Determines the total number of messages in this object |
|
Counts how many prompt_tokens and completion_tokens tokens are used. |
|
Provides a count of how many total tools calls there are in this Thread/Turn/Message. |
|
Counts how many times a ToolCall was used to call functions, with metric names equal to function names. |
|
Calculate the number of conversational turns in a thread. |
|
Calculate the Flesch-Kincaid Grade Level score for a given text string. |
|
Calculate the Flesch Reading Ease score for a given text string. |
|
Returns a string of the object. |
|
|
|
Return 1 is the langgraph type for this Message matches the passed in type, and 0 otherwise. |
|
Returns 1 if this turn is the final turn in its thread, and 0 otherwise. |
|
Returns 1 if the role for this Turn or Message matches the passed in role, and 0 otherwise. |
|
Returns the estimated time, in seconds, that it took for the Turead/Turn/Message to be generated, in seconds. |
|
Determines whether a message matches a regular expression specified by the user |
|
Calls the OpenAI Moderation API to analyze the given conversational turn for content moderation. |
|
Process an entire conversation and return the desired output |
|
Process a single conversational message and return the desired output |
|
Calculate the length of the content. |
|
Returns 1 if a tool was called, and 0 otherwise |
|
Counts the occurrences of particular values in the text content of tool call in the conversation. |
- flexeval.configuration.function_metrics.constant(object: Thread | Turn | Message | ToolCall, **kwargs) int | float [source]#
Returns a constant value.
- flexeval.configuration.function_metrics.count_emojis(turn: str) int [source]#
Calculate the number of emojis in a given text string.
- flexeval.configuration.function_metrics.count_errors(object: Thread | Turn | Message | ToolCall) dict [source]#
If a Thread, counts the errors of each type in the thread. If a Turn, Message, or ToolCall, ditto.
It does this by iterating through ToolCalls and identifying whether there are entries like “*_errors” in tool_call.additional_kwargs
If a ToolCall, returns 1 if there is an error of each type {
“python_errors”: 3, “javascript_errors”: 1
}
- flexeval.configuration.function_metrics.count_llm_models(thread: Thread) dict [source]#
Provides a count of messages in the thread produced by each LLM model. Useful for quantifying which LLM generated the results - and agents can have more than 1 type.
- flexeval.configuration.function_metrics.count_messages(object: Thread | Turn) int [source]#
Calculate the number of conversational messages in the given Thread or Turn. Excludes any system messages. A message is counted even if the content for that action was blank (e.g., a blank message associated with a tool call).
- Parameters:
Thread (Turn or)
- Returns:
Count of messages.
- Return type:
- flexeval.configuration.function_metrics.count_messages_per_role(object: Thread | Turn, use_langgraph_roles=False) list [source]#
Calculate the number of conversational messages for each role. Excludes the system prompt. A message is counted even if the content for that action was blank (e.g., a blank message associated with a tool call).
- Parameters:
Thread (Turn or)
- Returns:
A dictionary with roles as keys roles and values as counts of messages
- Return type:
- flexeval.configuration.function_metrics.count_numeric_tool_call_params_by_name(toolcall: ToolCall) list[dict] [source]#
Extracts the values of all numeric ToolCall parameter inputs, with metric_name being the name of the corresponding parameter.
- flexeval.configuration.function_metrics.count_of_parts_matching_regex(object: Thread | Turn | Message, expression: str) int [source]#
- Determines the total number of messages in this object
matching a regular expression specified by the user. Ignores tool calls in object.
Outputs the sum of the number of matches detected using Pattern.findall() across all entries in the object.
- flexeval.configuration.function_metrics.count_tokens(object: Thread | Turn | Message) dict [source]#
Counts how many prompt_tokens and completion_tokens tokens are used.
These values are record at the Message level, so this function sums over messages if the input type is Thread or Turn
- flexeval.configuration.function_metrics.count_tool_calls(object: Thread | Turn | Message) dict [source]#
Provides a count of how many total tools calls there are in this Thread/Turn/Message. Differs from count_tool_calls_by_name because it does not return the names of the tool calls.
- flexeval.configuration.function_metrics.count_tool_calls_by_name(object: Thread | Turn | Message | ToolCall) dict [source]#
Counts how many times a ToolCall was used to call functions, with metric names equal to function names.
NOTE: This function provides an example of how to go from higher levels of granularity (e.g., Thread) to lower levels of granularity (e.g., ToolCall).
- flexeval.configuration.function_metrics.count_turns(object: Thread) int [source]#
Calculate the number of conversational turns in a thread.
- Parameters:
Thread
- Returns:
Count of turns.
- Return type:
- flexeval.configuration.function_metrics.flesch_kincaid_grade(turn: str) float [source]#
Calculate the Flesch-Kincaid Grade Level score for a given text string.
The Flesch-Kincaid Grade Level score is a readability test designed to indicate the U.S. school grade level of the text. Higher scores indicate material that is more difficult to read and understand, suitable for higher grade levels.
- flexeval.configuration.function_metrics.flesch_reading_ease(turn: str) float [source]#
Calculate the Flesch Reading Ease score for a given text string.
The Flesch Reading Ease score is a readability test designed to indicate how difficult a passage in English is to understand. Higher scores indicate material that is easier to read; lower scores indicate material that is more difficult to read.
- flexeval.configuration.function_metrics.identity(object: Thread | Turn | Message | ToolCall, **kwargs) dict [source]#
Returns a string of the object.
- flexeval.configuration.function_metrics.is_langgraph_type(object: Message, type: str) dict [source]#
Return 1 is the langgraph type for this Message matches the passed in type, and 0 otherwise.
Args: object: the Message type: a string with the desired type to check against
- flexeval.configuration.function_metrics.is_last_turn_in_thread(turn: Turn) int [source]#
Returns 1 if this turn is the final turn in its thread, and 0 otherwise.
- Parameters:
turn – turn to evaluate
- Returns:
1 for this being the temporally last turn in the thread, 0 otherwise
- Return type:
- flexeval.configuration.function_metrics.is_role(object: Turn | Message, role: str) dict [source]#
Returns 1 if the role for this Turn or Message matches the passed in role, and 0 otherwise.
Args: object: the Turn or Message role: a string with the desired role to check against
- flexeval.configuration.function_metrics.latency(object: Thread | Turn | Message) float [source]#
Returns the estimated time, in seconds, that it took for the Turead/Turn/Message to be generated, in seconds.
For Turns and Messages, this is done by comparing the timestamp of the Turn/Message, which indicates the output time of that Turn/Message - to the timestamp fo the previous Turn/Message. For example, if a Message is generated at 1:27.3 but the previous message was generated at 1:23.1, the latency was 4.2 seconds.
For Threads, the latenecy is calculated as the time difference, again in seconds, between the first and last message.
- flexeval.configuration.function_metrics.message_matches_regex(message: Message, expression: str) dict [source]#
Determines whether a message matches a regular expression specified by the user
Outputs the number of matches detected using Pattern.findall()
- flexeval.configuration.function_metrics.openai_moderation_api(turn: str, **kwargs) dict [source]#
Calls the OpenAI Moderation API to analyze the given conversational turn for content moderation. Since the input is a string, it’ll concatenate all the “content” together and pass it in
- flexeval.configuration.function_metrics.process_conversation(conversation: list) int | float | dict[str, int | float] | list[dict[str, int | float]] [source]#
Process an entire conversation and return the desired output
Args: conversation (list): an entire conversation as a list
- NOTE: Metrics that take a list as input are valid at the Thread
and Turn levels.
Returns: an integer, e.g., 2 or a floating point number, e.g., 2.8 or a dictionary of metric/value pairs, e.g. {‘metric1’:value1, ‘metric2’:value2} or a list of dictionaries. The key can be either ‘role’ or ‘metric’. e.g., [{“role”:role1, “value”:value1}, {“role”:role2, “value”:value2}, …]
- flexeval.configuration.function_metrics.process_single_message(message: str) int | float | dict[str, int | float] [source]#
Process a single conversational message and return the desired output
Args: message (str): a single conversational message as a string
- NOTE: Metrics that take a string as input are valid at the Turn
and Message levels.
Returns: an integer (e.g., 2), or a floating point number (e.g., 2.8), or a dictionary of metric/value pairs (e.g. {‘metric1’:value1, ‘metric2’:value2})
- flexeval.configuration.function_metrics.string_length(object: Thread | Turn | Message) int [source]#
Calculate the length of the content.
- flexeval.configuration.function_metrics.tool_was_called(object: Thread | Turn | Message) float [source]#
Returns 1 if a tool was called, and 0 otherwise
- flexeval.configuration.function_metrics.value_counts_by_tool_name(turn: list, json_key: str) dict [source]#
Counts the occurrences of particular values in the text content of tool call in the conversation. Assumes the role will be tool, and that kwargs contains the argument json_key. values associated with that json_key for a specific tool name are separately aggregated with counts.
- Parameters:
conversation (List[Dict[str, Any]]) – A list of dictionaries representing conversational turns. Each dictionary should have a ‘role’ key indicating the role of the participant.
json_key – string that represents the key to look for in the content of the tool call text
- Returns:
list of name/value pairs for each parameter and function name combo