Rubric Guide#
Note: this guide has not yet been updated for the current version of FlexEval. Its advice may still be useful.
FlexEval allows users to write their own rubrics to guide the grader LLM in grading conversational turns or entire conversations to approximate human judgment. To use this functionality, users need to
Write rubrics in a YAML file. (Examples can be viewed in rubric_metrics.yaml [
raw
] inflexeval.configuration
.)Alternately, create a
RubricsCollection
.Add that path or collection to
EvalRun
’srubric_paths
.
Here, we offer guidelines on writing and using rubrics in FlexEval.
A General Template for FlexEval Rubrics#
A rubric in FlexEval typically comprises two main sections: “prompt” and “choice scores”. In “prompt”, users describe the task for the grader LLM, specify what data should be included, define the output format, and provide additional instructions, guidelines, or notes if necessary. In the section of “choice scores”, users provide a mapping of output choices (e.g., “Yes”, “No”) to their scores (e.g., 1, 0). These scores are logged as metrics in evaluation results. Below is a typical rubric template used in FlexEval:
The name of your rubric:
prompt: |-
Your Role:
Describe the role for your grader LLM (e.g., "You are a helpful assistant with solid knowledge in K-12 math instruction.")
Your Task:
Decide the grader LLM's task (e.g., "Check the student message and decide whether the student is asking for a plot or not.")
Data:
Give some background information about the data first (e.g., "The data includes a student message extracted from a conversation between the student and a tutor...").
Then specify what you want to include as data using key words (wrapped in curly braces, e.g., {turn}) in the data block marked by [BEGIN DATA] and [END DATA]. For more details about these key words, proceed to read "Key Words in Data Block".
[BEGIN DATA]
***
{turn}
***
[END DATA]
__start rubric__
Provide details of the rubric for the classification task here.
__end rubric__
Output:
Specify what to output and how the output should look like. For example, you can follow the approach of "chain-of-thought then classify" (*cot classify* for short, e.g., "First, report your reasoning for your decision. Second, print your decision.")
choice_scores:
Provide a mapping of the output and their corresponding scores, such as below:
"YES": 1
"NO": 0
Key Words in Data Block#
Key words in the data block as demonstrated in the template above are used for populating rubrics with corresponding data. They are wrapped in curly braces (e.g., {conversation}, {turn}) to denote what should be included in the data. These key words are automatically filled out and graded when running the evals. There are four key words that you can use:
{conversation}: the whole conversation which may contain multiple conversational turns, including the previous and current entries
{context}: the previous entries that serve as the contextual information for the current entry
{turn}: the current entry only
{completion}: the new completion generated by an LLM
Notes:
{conversation} and {context} do not appear in the same data block because {conversation} includes {context}
{turn} and {completion} do not appear in the same data block because you either evaluate the current entry from the input data or evaluate the completion generated by an LLM based on the input data
When {completion} is used, the
do_completion
parameter should be “True” in the test specification for the rubric metric inevals.yaml
.
Parameters for Rubric Metric in Test Specification#
In configuration/evals.yaml, you can use two parameters to specify how the rubric metric should be conducted, namely, “metric_level” and “depends_on”.
metric_level
: This parameter determines on which level (e.g., turn, message, toolcall) should the rubric metric be conducted.depends_on
: This parameter describes a condition for the rubric metric to be conducted. In the following example, the depends_on parameter speficies that the rubric metric “is_pedagogically_appropriate_plot” is only run when the result of the function metric “is_role” is “assistant”.
- name: is_pedagogically_appropriate_plot
depends_on:
- name: is_role
type: function
kwargs:
role: assistant
metric_name: assistant
metric_min_value: 1