Rubric Guide#

Note: this guide has not yet been updated for the current version of FlexEval. Its advice may still be useful.

FlexEval allows users to write their own rubrics to guide the grader LLM in grading conversational turns or entire conversations to approximate human judgment. To use this functionality, users need to

Write rubrics in a YAML file. (Examples can be viewed in rubric_metrics.yaml [raw] in flexeval.configuration.)
Alternately, create a RubricsCollection.
Add that path or collection to EvalRun’s rubric_paths.

Here, we offer guidelines on writing and using rubrics in FlexEval.

A general template for FlexEval rubrics#

A rubric in FlexEval typically comprises two main sections: “prompt” and “choice scores”. In “prompt”, users describe the task for the grader LLM, specify what data should be included, define the output format, and provide additional instructions, guidelines, or notes if necessary. In the section of “choice scores”, users provide a mapping of output choices (e.g., “Yes”, “No”) to their scores (e.g., 1, 0). These scores are logged as metrics in evaluation results. Below is a typical rubric template used in FlexEval:

The name of your rubric:
  prompt: |-
    Your Role:
      Describe the role for your grader LLM (e.g., "You are a helpful assistant with solid knowledge in K-12 math instruction.")

    Your Task:
      Decide the grader LLM's task (e.g., "Check the student message and decide whether the student is asking for a plot or not.")

    Data:
	Give some background information about the data first (e.g., "The data includes a student message extracted from a conversation between the student and a tutor...").
	Then specify what you want to include as data using fill words (wrapped in curly braces, e.g., {turn}) in the data block marked by [BEGIN DATA] and [END DATA]. We provide more details about the available fill words in the next section.

      [BEGIN DATA]
      ***
      {turn}
      ***
      [END DATA]

    __start rubric__
    Provide details of the rubric for the classification task here.
    __end rubric__

    Output:
	Specify what to output and how the output should look like. For example, you can follow the approach of "chain-of-thought then classify" (*cot classify* for short, e.g., "First, report your reasoning for your decision. Second, print your decision.")

  choice_scores:
   Provide a mapping of the output and their corresponding scores, such as below:
    "YES": 1
    "NO": 0

Fill words in FlexEval templates#

Fill words in the data block – as demonstrated in the template above – are used for populating rubric templates with data from the conversation or message being evaluated. They are wrapped in curly braces (e.g., {conversation}, {turn}) to denote what should be included in the filled rubric. These fill words are automatically replaced and graded when running the evals. There are four fill words that you can use:

{conversation}: the whole conversation which may contain multiple conversational turns, including the previous and current entries
{context}: the previous entries that serve as the contextual information for the current entry
{turn}: the current entry only
{completion}: the new completion generated by an LLM

Notes:

{conversation} and {context} do not appear in the same data block because {conversation} includes {context}
{turn} and {completion} do not appear in the same data block because you either evaluate the current entry from the input data or evaluate the completion generated by an LLM based on the input data
When {completion} is used, the do_completion parameter should be “True” in the test specification for the rubric metric in evals.yaml.

Parameters for Rubric Metric in Test Specification#

In configuration/evals.yaml, you can use two parameters to specify how the rubric metric should be conducted, namely, “metric_level” and “depends_on”.

metric_level: This parameter determines on which level (e.g., turn, message, toolcall) should the rubric metric be conducted.
depends_on: This parameter describes a condition for the rubric metric to be conducted. In the following example, the depends_on parameter speficies that the rubric metric “is_pedagogically_appropriate_plot” is only run when the result of the function metric “is_role” is “assistant”.

- name: is_pedagogically_appropriate_plot
    depends_on:
    - name: is_role
      type: function
      kwargs:
      role: assistant
      metric_name: assistant
      metric_min_value: 1

Rubric Guide#

A general template for FlexEval rubrics#

Fill words in FlexEval templates#

Parameters for Rubric Metric in Test Specification#

Further reading#

This Page