{ "cells": [ { "cell_type": "markdown", "id": "512a79f9-25ac-469e-9322-0b199e08436e", "metadata": {}, "source": [ "(metric_analysis)=\n", "# Metric Analysis\n", "\n", "This vignette demonstrates a more complete use of FlexEval: constructing an {class}`~flexeval.schema.eval_schema.Eval` with both rubric and function metrics, running that eval via an {class}`~flexeval.schema.evalrun_schema.EvalRun`, and using FlexEval utility functions to retrieve and interpret the results.\n", "\n", "Author: [Zachary Levonian](https://levon003.github.io) \\\n", "Date: July 2025" ] }, { "cell_type": "markdown", "id": "a177fe23-afd0-4548-8fb9-6073308a21cd", "metadata": {}, "source": [ "## Part 1: Running FlexEval to compute metrics\n", "\n", "We'll create some test data, build an eval, and execute it." ] }, { "cell_type": "code", "execution_count": 1, "id": "5efa0406-0987-42f7-b54a-6833244babbf", "metadata": {}, "outputs": [], "source": [ "import dotenv\n", "\n", "assert dotenv.load_dotenv(\"../.env\"), (\n", " \"This vignette assumes access to API keys in a .env file.\"\n", ")" ] }, { "cell_type": "markdown", "id": "a2425711-4fee-4d49-bb96-7592a24da32a", "metadata": {}, "source": [ "### Generating test data\n", "\n", "Let's evaluate the quality of grade-appropriate explanations." ] }, { "cell_type": "code", "execution_count": 2, "id": "049336f0-7ede-489b-a45b-5dbb4374912a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "12" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "concepts = [\"integer addition\", \"factoring polynomials\", \"logistic regression\"]\n", "grades = [\"3rd\", \"5th\", \"7th\", \"9th\"]\n", "\n", "user_queries = []\n", "for concept in concepts:\n", " for grade in grades:\n", " user_queries.append(\n", " f\"Concisely summarize {concept} at the United States {grade}-grade level.\"\n", " )\n", "len(user_queries)" ] }, { "cell_type": "markdown", "id": "71700ecf-1a86-4363-80f8-4919c3c5bbd1", "metadata": {}, "source": [ "We can imagine that our system under test involves a particular system prompt, or perhaps multiple candidate prompts.\n", "\n", "In this case, we'll imagine a single, simple system prompt." ] }, { "cell_type": "code", "execution_count": 3, "id": "70fdf148-885f-4fb4-bba4-f6f75e915d7b", "metadata": {}, "outputs": [], "source": [ "system_prompt = \"\"\"You are a friendly math tutor.\n", "\n", "You attempt to summarize any mathematical topic the student is interested in, even if it's not appropriate for their grade level.\"\"\"" ] }, { "cell_type": "code", "execution_count": 4, "id": "22992eee-f95e-4254-97b4-a5e7a326e625", "metadata": {}, "outputs": [], "source": [ "# convert to JSONL\n", "import json\n", "from pathlib import Path\n", "\n", "concept_queries_path = Path(\"concept_queries.jsonl\")\n", "with open(concept_queries_path, \"w\") as outfile:\n", " for user_query in user_queries:\n", " outfile.write(\n", " json.dumps(\n", " {\n", " \"input\": [\n", " {\"role\": \"system\", \"content\": system_prompt},\n", " {\"role\": \"user\", \"content\": user_query},\n", " ]\n", " }\n", " )\n", " + \"\\n\"\n", " )" ] }, { "cell_type": "markdown", "id": "68163ea3-f92c-485a-b5da-6f966c4e4e40", "metadata": {}, "source": [ "Each line of `concept_queries.jsonl` will become a unique {class}`~flexeval.classes.thread.Thread` to be processed.\n", "\n", "Now that we have test data, we can build a FlexEval configuration and execute it.\n", "\n", "### Defining an Eval\n", "\n", "An {class}`~flexeval.schema.eval_schema.Eval` describes the computations that need to happen to compute the required metrics.\n", "\n", "In this case, we'll set a few details:\n", " - We want to generate new LLM completions, rather than just using any existing assistant messages in our threads. To do that, we'll set {attr}`~flexeval.schema.eval_schema.Eval.do_completion` to true, and define the function to actually generate those completions from those provided in {mod}`flexeval.configuration.completion_functions`. In this case, we'll use {func}`~flexeval.configuration.completion_functions.litellm_completion`, which uses [LiteLLM](https://docs.litellm.ai) to provide access to many different model APIs.\n", " - We'll compute two {class}`~flexeval.schema.eval_schema.FunctionItem`s, a [Flesch reading ease](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests) score and {meth}`~flexeval.configuration.function_metrics.is_role`. We need `is_role` because we can use its value to compute particular metrics only for assistant messages (like the new completions we'll be generating).\n", " - Finally, we can specify a custom {class}`~flexeval.schema.eval_schema.RubricItem`s. We'll write a prompt that describes the assessment we want to make. In this case, we try to determine if the assistant response is grade appropriate." ] }, { "cell_type": "code", "execution_count": 5, "id": "6b72fa23-f077-434a-9916-e6389115560d", "metadata": {}, "outputs": [], "source": [ "import flexeval\n", "from flexeval.schema import (\n", " Eval,\n", " Rubric,\n", " GraderLlm,\n", " DependsOnItem,\n", " Metrics,\n", " FunctionItem,\n", " RubricItem,\n", " CompletionLlm,\n", ")" ] }, { "cell_type": "code", "execution_count": 6, "id": "56a3afed-73e7-484d-9a6d-da6758e1716e", "metadata": {}, "outputs": [], "source": [ "# by specifying an OpenAI model name here, we'll need OPENAI_API_KEY to exist in our environment variables or in our .env file\n", "completion_llm = CompletionLlm(\n", " function_name=\"litellm_completion\",\n", " kwargs={\"model\": \"gpt-4o-mini\"},\n", ")" ] }, { "cell_type": "markdown", "id": "46b010f4-ab11-4eb8-aae1-ce0f31d37c26", "metadata": {}, "source": [ "A note about {class}`~flexeval.schema.eval_schema.CompletionLlm`: if you're using LiteLLM, you can replace an API call with a pre-written response by passing `mock_response` as an additional keyword argument.\n", "\n", "For example, this configuration will always return \"I can't help with that!\":\n", "\n", "```python\n", "kwargs={\"model\": \"gpt-4o-mini\", \"mock_response\": \"I can't help with that!\"}\n", "```\n", "\n", "Now let's define a rubric. We just need a prompt, a set of choice scores mapping from string responses to a numeric metric value, and information about what LLM to use. For simplicity, we'll use the same model for evaluating the completions that we use to generate them – but this should probably be avoided in general due to LLMs' [preference for their own outputs](https://arxiv.org/abs/2404.13076)." ] }, { "cell_type": "code", "execution_count": 7, "id": "1ccc2343-0ed4-46fa-96ed-f13d09afd943", "metadata": {}, "outputs": [], "source": [ "bad_rubric_prompt = \"\"\"Read the following input and output, assessing if the output is grade-appropriate.\n", "[Input]: {context}\n", "[Output]: {content}\n", "\n", "On a new line after your explanation, print:\n", "- YES if the Output is fully appropriate for the grade level\n", "- NO if the Output would uses language or concepts that would be inappropriate for that grade level\n", "\n", "Only print YES or NO on the final line.\n", "\"\"\"\n", "rubric = Rubric(\n", " prompt=bad_rubric_prompt,\n", " choice_scores={\"YES\": 1, \"NO\": 0},\n", ")\n", "rubrics = {\n", " \"is_grade_appropriate\": rubric,\n", "}\n", "grader_llm = GraderLlm(\n", " function_name=\"litellm_completion\", kwargs={\"model\": \"gpt-4o-mini\"}\n", ")" ] }, { "cell_type": "markdown", "id": "192e2fe7-128e-4bd4-8a0c-fd1fc0156813", "metadata": {}, "source": [ "We'll define the metrics that need to be computed using the rubric name defined above (`is_grade_appropriate`).\n", "\n", "We'll also define the computation of our two function metrics: {meth}`~flexeval.configuration.function_metrics.is_role` and {meth}`~flexeval.configuration.function_metrics.flesch_reading_ease`.\n", "\n", "We want to evaluate `flesch_reading_ease` and `is_grade_appropriate` only on assistant turns, so we need to declare a dependency.\n", "\n", "Here's how we declare a dependency:\n", "\n", " - A {class}`~flexeval.schema.eval_schema.DependsOnItem` requires a `name` that matches a defined metric.\n", " - Providng `kwargs` are optional, but if you use the same name with different kwargs you need to provide them so that we know which metric you're depending on.\n", " - Set `metric_min_value` or `metric_max_value` or both. The {meth}`~flexeval.configuration.function_metrics.is_role` documentation tell us that it returns 1 (true) if the turn has the role provided in the keyword args and 0 (false) otherwise. So we just need to set `metric_min_value` to 1.\n", " - When we define a {class}`~flexeval.schema.eval_schema.MetricItem` that has dependencies, provide one or more {class}`~flexeval.schema.eval_schema.DependsOnItem`s in the {attr}`~flexeval.schema.eval_schema.MetricItem.depends_on` list." ] }, { "cell_type": "code", "execution_count": 8, "id": "11632831-8cd1-4861-8721-4399d70694db", "metadata": {}, "outputs": [], "source": [ "is_assistant_dependency = DependsOnItem(\n", " name=\"is_role\", kwargs={\"role\": \"assistant\"}, metric_min_value=1\n", ")\n", "metrics = Metrics(\n", " function=[\n", " FunctionItem(name=\"is_role\", kwargs={\"role\": \"assistant\"}),\n", " FunctionItem(\n", " name=\"flesch_reading_ease\",\n", " depends_on=[is_assistant_dependency],\n", " ),\n", " ],\n", " rubric=[\n", " RubricItem(name=\"is_grade_appropriate\", depends_on=[is_assistant_dependency])\n", " ],\n", ")" ] }, { "cell_type": "markdown", "id": "f17e0ab8-7aa2-4866-a19c-2519fd5a4067", "metadata": {}, "source": [ "We'll finish building our {class}`~flexeval.schema.eval_schema.Eval` by providing all the info we defined above.\n", "\n", "An eval's `name` is optional, but providing one can help you later if you're running lots of different Evals against the same dataset.\n", "\n", "I'll call this eval `grade_appropriateness`, since that's what we're trying to evaluate." ] }, { "cell_type": "code", "execution_count": 9, "id": "bf4646f4-235a-4261-af00-4830c845d30e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Eval(do_completion=True, name='grade_appropriateness', notes='', metrics=Metrics(function=[FunctionItem(name='is_role', depends_on=[], metric_level='Turn', kwargs={'role': 'assistant'}), FunctionItem(name='flesch_reading_ease', depends_on=[DependsOnItem(name='is_role', type=None, kwargs={'role': 'assistant'}, metric_name=None, metric_level=None, relative_object_position=0, metric_min_value=1.0, metric_max_value=1.7976931348623157e+308)], metric_level='Turn', kwargs={})], rubric=[RubricItem(name='is_grade_appropriate', depends_on=[DependsOnItem(name='is_role', type=None, kwargs={'role': 'assistant'}, metric_name=None, metric_level=None, relative_object_position=0, metric_min_value=1.0, metric_max_value=1.7976931348623157e+308)], metric_level='Turn', kwargs={})]), completion_llm=CompletionLlm(function_name='litellm_completion', include_system_prompt=True, kwargs={'model': 'gpt-4o-mini'}), grader_llm=GraderLlm(function_name='litellm_completion', kwargs={'model': 'gpt-4o-mini'}))" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eval = Eval(\n", " name=\"grade_appropriateness\",\n", " metrics=metrics,\n", " grader_llm=grader_llm,\n", " do_completion=True,\n", " completion_llm=completion_llm,\n", ")\n", "eval" ] }, { "cell_type": "markdown", "id": "c2085f03-a760-4bc3-a883-ff40ed65f722", "metadata": {}, "source": [ "### Building an EvalRun\n", "\n", "To execute the defined {class}`~flexeval.schema.eval_schema.Eval`, we need to build an {class}`~flexeval.schema.evalrun_schema.EvalRun`.\n", "\n", "Here's what we need (other than the {class}`~flexeval.schema.eval_schema.Eval`):\n", " - {attr}`~flexeval.schema.evalrun_schema.EvalRun.data_sources` should be a list of datasets. We already saved a jsonl with our inputs, so we'll wrap the path in a {class}`~flexeval.schema.evalrun_schema.FileDataSource`.\n", " - {attr}`~flexeval.schema.evalrun_schema.EvalRun.database_path` defines the location of the SQLite file that FlexEval produces.\n", " - {attr}`~flexeval.schema.evalrun_schema.EvalRun.rubric_paths` is where we provide the rubric prompts we defined above (in the form of a {class}`~flexeval.schema.rubric_schema.RubricsCollection`).\n", " - {attr}`~flexeval.schema.evalrun_schema.EvalRun.config` is optional, but you can provide a {class}`~flexeval.schema.config_schema.Config` there to override settings that define how the Eval will be executed. In this case, we set `clear_tables` to True in order to delete any outputs in the provided `database_path`." ] }, { "cell_type": "code", "execution_count": 10, "id": "b666ab22-5a44-4fd2-ba10-82bce189856d", "metadata": {}, "outputs": [], "source": [ "from flexeval.schema import Config, EvalRun, FileDataSource, RubricsCollection" ] }, { "cell_type": "code", "execution_count": 11, "id": "ba16f1b0-60a7-4ca4-976f-6e55397958f7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "EvalRun(data_sources=[FileDataSource(name=None, notes=None, path=PosixPath('concept_queries.jsonl'), format='jsonl')], database_path=PosixPath('eval_results.db'), eval=Eval(do_completion=True, name='grade_appropriateness', notes='', metrics=Metrics(function=[FunctionItem(name='is_role', depends_on=[], metric_level='Turn', kwargs={'role': 'assistant'}), FunctionItem(name='flesch_reading_ease', depends_on=[DependsOnItem(name='is_role', type=None, kwargs={'role': 'assistant'}, metric_name=None, metric_level=None, relative_object_position=0, metric_min_value=1.0, metric_max_value=1.7976931348623157e+308)], metric_level='Turn', kwargs={})], rubric=[RubricItem(name='is_grade_appropriate', depends_on=[DependsOnItem(name='is_role', type=None, kwargs={'role': 'assistant'}, metric_name=None, metric_level=None, relative_object_position=0, metric_min_value=1.0, metric_max_value=1.7976931348623157e+308)], metric_level='Turn', kwargs={})]), completion_llm=CompletionLlm(function_name='litellm_completion', include_system_prompt=True, kwargs={'model': 'gpt-4o-mini'}), grader_llm=GraderLlm(function_name='litellm_completion', kwargs={'model': 'gpt-4o-mini'})), config=Config(logs_path=None, env_filepath=None, env={}, clear_tables=True, max_workers=1, random_seed_conversation_sampling=42, max_n_conversation_threads=50, nb_evaluations_per_thread=1, raise_on_completion_error=False, raise_on_metric_error=False), rubric_paths=[RubricsCollection(rubrics={'is_grade_appropriate': Rubric(prompt='Read the following input and output, assessing if the output is grade-appropriate.\\n[Input]: {context}\\n[Output]: {content}\\n\\nOn a new line after your explanation, print:\\n- YES if the Output is fully appropriate for the grade level\\n- NO if the Output would uses language or concepts that would be inappropriate for that grade level\\n\\nOnly print YES or NO on the final line.\\n', choice_scores={'YES': 1, 'NO': 0}, name=None, notes=None)})], function_modules=[], add_default_functions=True)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "input_data_sources = [FileDataSource(path=concept_queries_path)]\n", "database_path = Path(\"eval_results.db\")\n", "config = Config(clear_tables=True)\n", "eval_run = EvalRun(\n", " data_sources=input_data_sources,\n", " database_path=database_path,\n", " eval=eval,\n", " config=config,\n", " rubric_paths=[RubricsCollection(rubrics=rubrics)],\n", ")\n", "eval_run" ] }, { "cell_type": "markdown", "id": "3e897396-c34a-47a1-b290-4db2cb22078c", "metadata": {}, "source": [ "### Running the EvalRun\n", "\n", "Once we've built an {class}`~flexeval.schema.evalrun_schema.EvalRun`, running it is easy: we can just use {func}`~flexeval.runner.run`!" ] }, { "cell_type": "code", "execution_count": 11, "id": "97d51e3d-5b3c-470f-bb94-29e3dff4f0c7", "metadata": {}, "outputs": [], "source": [ "_ = flexeval.run(eval_run)" ] }, { "cell_type": "markdown", "id": "bac15687-da31-49a3-9af0-e5a7514109a9", "metadata": {}, "source": [ "Now that we've run our Eval, we can analyze our results!" ] }, { "cell_type": "markdown", "id": "914afcd2-a9f4-49bb-9a2c-c0d8c489e353", "metadata": {}, "source": [ "## Part 2: Analyzing our results\n", "\n", "We'll analyze the data we created in Part 1.\n", "\n", "### flexeval.metrics.access\n", "\n", "A few utility functions are exposed in {mod}`flexeval.metrics.access` for returning computed metrics." ] }, { "cell_type": "code", "execution_count": 13, "id": "3710a57b-a4f4-42c7-89c3-46d784568667", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "import pandas as pd\n", "from IPython import display # for prettier printing\n", "\n", "from flexeval import db_utils\n", "from flexeval.metrics import access as metric_access" ] }, { "cell_type": "markdown", "id": "9afcae51-d28b-4948-8a46-60dc60cf8aae", "metadata": {}, "source": [ "Note: We use Peewee's database methods to access the output at {attr}`~flexeval.schema.evalrun_schema.EvalRun.database_path`. Using {func}`~flexeval.runner.run` will set the appropriate variables for you, but if you just want to access the data from an eval you can use {func}`flexeval.db_utils.ensure_database` to ensure the Peewee connection is available." ] }, { "cell_type": "code", "execution_count": 14, "id": "278c209b-b674-4290-b370-9b2be55472b4", "metadata": {}, "outputs": [], "source": [ "database_path = Path(\"eval_results.db\")\n", "db_utils.ensure_database(database_path)" ] }, { "cell_type": "markdown", "id": "74c5fbf7-81e1-4024-8415-f0f01d3becca", "metadata": {}, "source": [ "To start, let's just get all of the metrics as a Pandas dataframe. \n", "\n", "{func}`~flexeval.metrics.access.get_all_metrics` returns a list of dictionaries." ] }, { "cell_type": "code", "execution_count": 15, "id": "b592c271-d138-4db5-9617-df98e2c2ac96", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idevalsetrundatasetthreadturnmessagetoolcallevaluation_nameevaluation_typemetric_name...metric_valuekwargssourcedepends_onrubric_promptrubric_completionrubric_modelrubric_completion_tokensrubric_prompt_tokensrubric_score
011111NoneNoneis_rolefunctionassistant...0.000000{'role': 'assistant'}def is_role(object: Union[Turn, Message], role...[]NoneNoneNoneNaNNaNNone
1211113NoneNoneis_rolefunctionassistant...1.000000{'role': 'assistant'}def is_role(object: Union[Turn, Message], role...[]NoneNoneNoneNaNNaNNone
2311113NoneNoneflesch_reading_easefunctionflesch_reading_ease...76.191959{}def flesch_reading_ease(turn: str) -> float:\\n...[{\"name\": \"is_role\", \"type\": null, \"kwargs\": {...NoneNoneNoneNaNNaNNone
\n", "

3 rows × 21 columns

\n", "
" ], "text/plain": [ " id evalsetrun dataset thread turn message toolcall \\\n", "0 1 1 1 1 1 None None \n", "1 2 1 1 1 13 None None \n", "2 3 1 1 1 13 None None \n", "\n", " evaluation_name evaluation_type metric_name ... metric_value \\\n", "0 is_role function assistant ... 0.000000 \n", "1 is_role function assistant ... 1.000000 \n", "2 flesch_reading_ease function flesch_reading_ease ... 76.191959 \n", "\n", " kwargs source \\\n", "0 {'role': 'assistant'} def is_role(object: Union[Turn, Message], role... \n", "1 {'role': 'assistant'} def is_role(object: Union[Turn, Message], role... \n", "2 {} def flesch_reading_ease(turn: str) -> float:\\n... \n", "\n", " depends_on rubric_prompt \\\n", "0 [] None \n", "1 [] None \n", "2 [{\"name\": \"is_role\", \"type\": null, \"kwargs\": {... None \n", "\n", " rubric_completion rubric_model rubric_completion_tokens \\\n", "0 None None NaN \n", "1 None None NaN \n", "2 None None NaN \n", "\n", " rubric_prompt_tokens rubric_score \n", "0 NaN None \n", "1 NaN None \n", "2 NaN None \n", "\n", "[3 rows x 21 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(metric_access.get_all_metrics())\n", "df.head(n=3)" ] }, { "cell_type": "markdown", "id": "bbf2dce6-cb4c-4a91-906b-2ce49ea25065", "metadata": {}, "source": [ "We have access to all of the columns in {class}`~flexeval.classes.metric.Metric`.\n", "\n", "We can see that FlexEval generates one metric result each time it runs, along with information about what was run. Let's look at the columns we have for our metrics." ] }, { "cell_type": "code", "execution_count": 16, "id": "07ae8075-4ff2-4f9f-abf4-1189d036965c", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "Columns returned by `get_all_metrics()`: id, evalsetrun, dataset, thread, turn, message, toolcall, evaluation_name, evaluation_type, metric_name, metric_level, metric_value, kwargs, source, depends_on, rubric_prompt, rubric_completion, rubric_model, rubric_completion_tokens, rubric_prompt_tokens, rubric_score" ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display.Markdown(\"Columns returned by `get_all_metrics()`: \" + \", \".join(df.columns))" ] }, { "cell_type": "markdown", "id": "fc093470-96cd-4980-9d3f-c3f46f53a23f", "metadata": {}, "source": [ "Let's start by looking at how many times each metric was computed:" ] }, { "cell_type": "code", "execution_count": 17, "id": "6791f442-9809-4642-8f33-5f860b068f49", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
count
evaluation_name
is_role24
flesch_reading_ease12
is_grade_appropriate12
\n", "
" ], "text/plain": [ " count\n", "evaluation_name \n", "is_role 24\n", "flesch_reading_ease 12\n", "is_grade_appropriate 12" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(df.evaluation_name.value_counts())" ] }, { "cell_type": "markdown", "id": "4d1a1ffe-9920-44d2-a719-904418544657", "metadata": {}, "source": [ "These counts look right:\n", " - Our dataset contained 12 input sentences.\n", " - We generated 12 LLM responses.\n", " - `is_role` was evaluated on all 24 turns...\n", " - ...but `flesch_reading_ease` and `is_grade_appropriate` were computed only on the 12 LLM-generated (\"assistant\") turns.\n", "\n", "Let's start our analysis by looking at the numeric metric: Flesch reading ease." ] }, { "cell_type": "markdown", "id": "cd83c648-c049-4a3a-aebd-1d9ca5419bda", "metadata": {}, "source": [ "### Analyzing a function metric\n", "\n", "A quick aside: Flesch reading ease is widely used, but it should be used with care. As summarized by Crossley et al. ([CLEAR 2022](https://link.springer.com/article/10.3758/s13428-022-01802-x)): \n", "\n", ">Traditional readability formulas lack construct and theoretical validity because they are based on weak proxies of word decoding (i.e., characters or syllables per word) and syntactic complexity (i.e., number or words per sentence) and ignore many text features that are important components of reading models including text cohesion and semantics. Additionally, many traditional readability formulas were normed using readers from specific age groups on small corpora of texts taken from specific domains.\n", "\n", "Nevertheless, many people may expect to see a readability score, and it can be useful to assess the [convergent validity](https://en.wikipedia.org/wiki/Convergent_validity) for some rubric metrics.\n", "\n", "Let's start by filtering to only the `flesch_reading_ease` metric results and computing the mean." ] }, { "cell_type": "code", "execution_count": 18, "id": "d23e8aa5-b4bc-4217-ab9c-50c0b3c893ce", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean Flesch reading ease score: 59.89\n" ] } ], "source": [ "flesch_reading_ease = df[df.metric_name == \"flesch_reading_ease\"]\n", "print(f\"Mean Flesch reading ease score: {flesch_reading_ease.metric_value.mean():.2f}\")" ] }, { "cell_type": "markdown", "id": "99dbd569-a801-4c92-b828-d0a0b3957f9c", "metadata": {}, "source": [ "A Flesch reading ease of 60 is at the [\"10th to 12th grade\" level](https://pages.stern.nyu.edu/wstarbuc/Writing/Flesch.htm).\n", "\n", "Hmm, seems a bit high. \n", "\n", "Plotting the distribution, we never seem to get above 80 (appropriate for 6th grade and lower)." ] }, { "cell_type": "code", "execution_count": 20, "id": "a63cb994-812f-40ff-838f-82aae44bc816", "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " 2025-09-08T15:26:39.430586\n", " image/svg+xml\n", " \n", " \n", " Matplotlib v3.10.5, https://matplotlib.org/\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n" ], "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "%config InlineBackend.figure_format = 'svg'\n", "\n", "fig, ax = plt.subplots(1, 1, figsize=(5, 3))\n", "\n", "ax.hist(flesch_reading_ease.metric_value, bins=np.arange(30, 110, 10))\n", "ax.set_xlabel(\"Flesch reading ease\")\n", "ax.set_ylabel(\"Concept explanation count\")\n", "\n", "fig.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "c7a5b2c5-d8b8-4f1a-a6a4-a0ce47b08536", "metadata": {}, "source": [ "Already, we've learned enough to consider going back to our prompt engineering to produce more age-appropriate messages.\n", "\n", "But presuming we're okay with the overall range, we'd prefer to see how things are looking at the original grade ranges we were targeting.\n", "\n", "We can extract the original grade from the initial user message in each thread in our dataset. \n", "To get those user messages, we can use another utility function in {mod}`flexeval.metrics.access`." ] }, { "cell_type": "code", "execution_count": 21, "id": "df273c0e-2a5d-4c67-9a71-824a35483f93", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
count
grade_level
33
53
73
93
\n", "
" ], "text/plain": [ " count\n", "grade_level \n", "3 3\n", "5 3\n", "7 3\n", "9 3" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "thread_ids = set(flesch_reading_ease.thread)\n", "# get the first message in each thread\n", "first_message_df = pd.DataFrame(\n", " metric_access.get_first_user_message_for_threads(thread_ids)\n", ")\n", "# join in the first message texts as \"first_message_content\"\n", "flesch_reading_ease = flesch_reading_ease.merge(\n", " first_message_df[[\"thread\", \"content\"]].rename(\n", " columns={\"content\": \"first_message_content\"}\n", " ),\n", " how=\"inner\",\n", " on=\"thread\",\n", ")\n", "# extract grade level from the messages\n", "flesch_reading_ease[\"grade_level\"] = (\n", " flesch_reading_ease.first_message_content.str.extract(\n", " r\"(\\d+)(?:st|nd|rd|th)-grade\"\n", " ).astype(int)\n", ")\n", "pd.DataFrame(flesch_reading_ease.grade_level.value_counts())" ] }, { "cell_type": "markdown", "id": "56c2e28a-f3e8-4868-91b4-e8b440ca7dfb", "metadata": {}, "source": [ "With grade level in hand, we can investigate reading ease at each prompted grade level." ] }, { "cell_type": "code", "execution_count": 22, "id": "3580c288-0682-48da-93a8-478204f3b34e", "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " 2025-09-08T15:26:42.978548\n", " image/svg+xml\n", " \n", " \n", " Matplotlib v3.10.5, https://matplotlib.org/\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n" ], "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig, ax = plt.subplots(1, 1, figsize=(5, 2.5))\n", "\n", "# no has ever accused matplotlib of being concise\n", "ax.scatter(\n", " flesch_reading_ease.grade_level,\n", " flesch_reading_ease.metric_value,\n", " color=\"black\",\n", " s=6,\n", ")\n", "means = flesch_reading_ease.groupby(\"grade_level\").metric_value.mean()\n", "ax.scatter(\n", " means.index, means, color=\"darkgreen\", s=50, marker=\"_\", label=\"Grade-level mean\"\n", ")\n", "for x, y in zip(means.index, means):\n", " ax.text(x, y, f\" {y:.1f}\", va=\"center\", ha=\"left\")\n", "ax.set_xticks(sorted(means.index))\n", "ax.set_xlim(means.index.min() - 1, means.index.max() + 1)\n", "ax.legend()\n", "ax.set_xlabel(\"Grade\")\n", "ax.set_ylabel(\"Flesch reading ease\")\n", "\n", "fig.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "25841739-aafb-491e-9e70-f30494e2b6f6", "metadata": {}, "source": [ "These group-level means suggest that the completions _are_ more grade-appropriate depending on the prompted grade.\n", "\n", "However, the reading ease scores also suggest the responses are potentially too complex.\n", "\n", "Even the most readable explanations – generated for the third-grade prompt – are at the 7th-grade level by the Flesch-Kincaid at the 7th-grade level." ] }, { "cell_type": "markdown", "id": "cb09f2e7-49c6-4c38-a336-cd879c552a96", "metadata": {}, "source": [ "### Analyzing a rubric metric\n", "\n", "We might attempt to go deeper by investigating the LLM-as-a-judge rubric results.\n", "\n", "One of the advantages of a rubric is that we can develop it iteratively.\n", "The investigation here is only an initial look; in reality, we would likely update the prompt after investigating the outputs and seeing where we might sharpen the validity of the metric we're developing.\n", "\n", "Let's start by looking at the raw counts." ] }, { "cell_type": "code", "execution_count": 23, "id": "1c20e8a8-1da1-49e5-a0e8-af8b498a962e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
count
metric_value
1.011
0.01
\n", "
" ], "text/plain": [ " count\n", "metric_value \n", "1.0 11\n", "0.0 1" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "is_grade_appropriate = df[df.metric_name == \"is_grade_appropriate\"]\n", "pd.DataFrame(is_grade_appropriate.metric_value.value_counts())" ] }, { "cell_type": "markdown", "id": "cc1fc0d3-d55b-4f4b-931a-3e681e430fcd", "metadata": {}, "source": [ "Only one determination of grade inappriateness! Let's try to get some insight by looking at one of the YES determinations." ] }, { "cell_type": "code", "execution_count": 24, "id": "f5d89f35-e8b9-49a7-86f4-1ccf7ffc0357", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "Prompt:\n", "> Read the following input and output, assessing if the output is grade-appropriate.\n", "> [Input]: user: Concisely summarize integer addition at the United States 5th-grade level.\n", "> \n", "> [Output]: assistant: Integer addition involves combining whole numbers to find a total. To add integers, start with the first number and count up by the second number. Here are some key points:\n", "> \n", "> 1. **Positive Integers**: When you add two positive numbers, you simply find their total (e.g., 3 + 4 = 7).\n", "> 2. **Negative Integers**: When adding negative numbers, you find the total by moving left on the number line (e.g., -3 + -4 = -7).\n", "> 3. **Mix of Positive and Negative**: If you add a positive and a negative number, subtract the smaller absolute value from the larger one, and keep the sign of the number with the larger absolute value (e.g., 5 + -3 = 2 and -5 + 3 = -2).\n", "> \n", "> Practice adding integers using number lines or counters to see how it works!\n", "> \n", "> \n", "> On a new line after your explanation, print:\n", "> - YES if the Output is fully appropriate for the grade level\n", "> - NO if the Output would uses language or concepts that would be inappropriate for that grade level\n", "> \n", "> Only print YES or NO on the final line.\n", "\n", "Completion:\n", "> 1. The input asks for a concise summary of integer addition suitable for a 5th-grade level.\n", "> 2. The output begins by clearly defining integer addition in simple terms, \"combining whole numbers to find a total,\" which is age-appropriate.\n", "> 3. The output includes basic concepts of addition - explaining how to add positive integers, which is straightforward and suitable for 5th graders.\n", "> 4. The output then covers negative integers, explaining their addition in a manner that relates to the number line, which may be complex but is part of the 5th-grade curriculum and is explained clearly.\n", "> 5. The section on mixing positive and negative integers is also relevant and explained through examples, making it easier for students to understand.\n", "> 6. The suggestion to practice with number lines or counters reinforces learning, which is valuable for elementary students.\n", "> \n", "> Overall, the language and concepts used in the output are appropriate for the 5th-grade level, offering explanations and examples that are clear and relatable.\n", "> \n", "> Based on this reasoning, I conclude that the output is indeed grade-appropriate.\n", "> \n", "> YES\n", "> YES\n", "\n", "Parsed response: 1.0 (YES)" ], "text/plain": [ "" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def quote_markdown(markdown_string: str) -> str:\n", " return \"\\n\".join(\"> \" + line for line in markdown_string.splitlines())\n", "\n", "\n", "def format_rubric_summary(rubric_metric) -> str:\n", " return f\"\"\"Prompt:\n", "{quote_markdown(rubric_metric.rubric_prompt)}\n", "\n", "Completion:\n", "{quote_markdown(rubric_metric.rubric_completion)}\n", "\n", "Parsed response: {rubric_metric.metric_value} ({rubric_metric.rubric_score})\"\"\"\n", "\n", "\n", "rubric_metric = (\n", " is_grade_appropriate[is_grade_appropriate.metric_value == 1].sample(n=1).iloc[0]\n", ")\n", "display.Markdown(format_rubric_summary(rubric_metric))" ] }, { "cell_type": "markdown", "id": "e7226ada-317f-49d2-8216-9c939c022a41", "metadata": {}, "source": [ "Hmm, not quite convincing. I already have a few ideas for how we might improve our prompt.\n", "\n", "Let's finish up by looking at the one No determination." ] }, { "cell_type": "code", "execution_count": 25, "id": "0768ddd3-04e8-4f05-b6e2-8e70db2f1adc", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "Prompt:\n", "> Read the following input and output, assessing if the output is grade-appropriate.\n", "> [Input]: user: Concisely summarize factoring polynomials at the United States 3rd-grade level.\n", "> \n", "> [Output]: assistant: Factoring polynomials means breaking them down into simpler parts that, when multiplied together, give you the original polynomial. Think of it like taking a number apart into its smaller number pieces. For example, if you have \\( x^2 + 2x \\), you can factor it by pulling out a common piece, which in this case is \\( x \\). So, it becomes \\( x(x + 2) \\). This makes it easier to work with!\n", "> \n", "> \n", "> On a new line after your explanation, print:\n", "> - YES if the Output is fully appropriate for the grade level\n", "> - NO if the Output would uses language or concepts that would be inappropriate for that grade level\n", "> \n", "> Only print YES or NO on the final line.\n", "\n", "Completion:\n", "> 1. **Understanding the Input**: The user is asking for a summary of factoring polynomials suitable for a 3rd-grade level. This means the explanation needs to be simple, using language that is easily understandable by young children.\n", "> \n", "> 2. **Analyzing the Output**: The output explains that factoring polynomials involves breaking them down into simpler parts. It uses the example of \\( x^2 + 2x \\) and shows how to factor it into \\( x(x + 2) \\).\n", "> \n", "> 3. **Evaluating Language and Concepts**:\n", "> - \"Factoring polynomials\" may be too advanced for a 3rd grader, who is typically learning basic operations with numbers and simple algebraic concepts.\n", "> - The terminology could be challenging. Phrases like \"breaking them down into simpler parts\" and \"when multiplied together\" might be difficult for 3rd graders to grasp fully.\n", "> - The example provided uses variables, which might also be unfamiliar to 3rd graders who may not yet have been introduced to algebraic expressions.\n", "> \n", "> 4. **Conclusion**: Overall, while the concept of breaking things down is appropriate, the specific language and the variable example make it likely outside the understanding of typical 3rd graders. \n", "> \n", "> Therefore, the output is not fully appropriate for the grade level of a 3rd grader.\n", "> \n", "> Based on this reasoning, I will provide the final answer.\n", "> \n", "> NO\n", "> \n", "> NO\n", "\n", "Parsed response: 0.0 (NO)" ], "text/plain": [ "" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rubric_metric = (\n", " is_grade_appropriate[is_grade_appropriate.metric_value == 0].sample(n=1).iloc[0]\n", ")\n", "display.Markdown(format_rubric_summary(rubric_metric))" ] }, { "cell_type": "markdown", "id": "c1ee630e-8b77-43b1-a818-cf0e1b874e33", "metadata": {}, "source": [ "This analysis looks better, but still quite vague. I see many specifics that we could include in the prompt as examples!\n", "\n", "### Next steps\n", "\n", "Based on this analysis, I would:\n", " - Update the completion prompt to produce more appropriate responses.\n", " - Update the rubric prompt to make more accurate – and harsher! – determinations about the appropriateness of the langauge used." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.11" }, "mystnb": { "execution_mode": "off" } }, "nbformat": 4, "nbformat_minor": 5 }