{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AVBtOVlNJ51C"
      },
      "source": [
        "# Tutorial: Generating Structured Output with OpenAI\n",
        "\n",
        "- **Level**: Beginner\n",
        "- **Time to complete**: 15 minutes\n",
        "- **Prerequisites**: You must have an API key from an active OpenAI account as this tutorial uses a GPT model by OpenAI.\n",
        "- **Components Used**: [`OpenAIChatGenerator`](https://docs.haystack.deepset.ai/docs/openaichatgenerator), [`OpenAIResponsesChatGenerator`](https://docs.haystack.deepset.ai/docs/openairesponseschatgenerator)\n",
        "- **Goal**: Learn how to generate structured outputs with `OpenAIChatGenerator` or `OpenAIResponsesChatGenerator` using Pydantic model or JSON schema.\n",
        "\n",
        "## Overview\n",
        "This tutorial shows how to produce structured outputs by either providing [Pydantic](https://github.com/pydantic/pydantic) model or JSON schema to `OpenAIChatGenerator`.\n",
        "\n",
        "Note: Only latest model starting with `gpt-4o-mini` can be used for this feature.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ljbWiyJkKiPw"
      },
      "source": [
        "## Installing Dependencies\n",
        "Install Haystack with pip:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "i-HzGWjz8LCM"
      },
      "outputs": [],
      "source": [
        "%%bash\n",
        "\n",
        "pip install -q \"haystack-ai>=2.20.0\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Cmjfa8CiCeFl"
      },
      "source": [
        "## Structured Outputs with `OpenAIChatGenerator`\n",
        "\n",
        "### Using Pydantic Models\n",
        "First, we'll see how to pass Pydantic model to `OpenAIChatGenerator`. For this purpose, we define two [Pydantic models](https://docs.pydantic.dev/1.10/usage/models/), `City` and `CitiesData`. These models specify the fields and types that represent the data structure we want."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "id": "xwKrDOOGdaAz"
      },
      "outputs": [],
      "source": [
        "from typing import List\n",
        "from pydantic import BaseModel\n",
        "\n",
        "\n",
        "class City(BaseModel):\n",
        "    name: str\n",
        "    country: str\n",
        "    population: int\n",
        "\n",
        "\n",
        "class CitiesData(BaseModel):\n",
        "    cities: List[City]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zv-6-l_PCeFl"
      },
      "source": [
        "> You can change these models according to the format you wish to extract from the text."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KM9-Zq2FL7Nn"
      },
      "source": [
        "\n",
        "[OpenAIChatGenerator](https://docs.haystack.deepset.ai/docs/openaichatgenerator) generates\n",
        "text using OpenAI's GPT model by default. We pass our Pydantic model to `response_format` parameter in generation_kwargs .\n",
        "\n",
        "We also need to set the `OPENAI_API_KEY` variable.\n",
        "\n",
        "Note: You can also set the `response_format` in `generation_kwargs` param in the `run` method of chat generator."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "id": "Z4cQteIgunUR",
        "outputId": "a15deff5-61e5-4600-b9c4-8dcc00808f7b",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Enter OpenAI API key:··········\n"
          ]
        }
      ],
      "source": [
        "import os\n",
        "from getpass import getpass\n",
        "\n",
        "from haystack.components.generators.chat import OpenAIChatGenerator\n",
        "\n",
        "if \"OPENAI_API_KEY\" not in os.environ:\n",
        "    os.environ[\"OPENAI_API_KEY\"] = getpass(\"Enter OpenAI API key:\")\n",
        "chat_generator = OpenAIChatGenerator(generation_kwargs={\"response_format\": CitiesData})"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kV_kexTjImpo"
      },
      "source": [
        "### Running the Component\n",
        "\n",
        "Run the component with an example passage that you want to convert into a JSON format and the `json_schema` you have created for `CitiesData`. For the given example passage, the generated JSON object should be like:\n",
        "```json\n",
        "{\n",
        "  \"cities\": [\n",
        "    {\n",
        "      \"name\": \"Berlin\",\n",
        "      \"country\": \"Germany\",\n",
        "      \"population\": 3850809\n",
        "    },\n",
        "    {\n",
        "      \"name\": \"Paris\",\n",
        "      \"country\": \"France\",\n",
        "      \"population\": 2161000\n",
        "    },\n",
        "    {\n",
        "      \"name\": \"Lisbon\",\n",
        "      \"country\": \"Portugal\",\n",
        "      \"population\": 504718\n",
        "    }\n",
        "  ]\n",
        "}\n",
        "```\n",
        "The output of the LLM should be compliant with the `json_schema`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "id": "yIoMedb6eKia"
      },
      "outputs": [],
      "source": [
        "from haystack.dataclasses import ChatMessage\n",
        "\n",
        "text = \"Berlin is the capital of Germany. It has a population of 3,850,809. Paris, France's capital, has 2.161 million residents. Lisbon is the capital and the largest city of Portugal with the population of 504,718.\"\n",
        "result = chat_generator.run(messages=[ChatMessage.from_user(text)])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eWPawSjgSJAM"
      },
      "source": [
        "### Printing the Correct JSON\n",
        "If you didn't get any error, you can now print the corrected JSON."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "BVO47gXQQnDC",
        "outputId": "3c637c7e-bee5-4f42-9d23-340a72aa4bc2"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "{'cities': [{'name': 'Berlin', 'country': 'Germany', 'population': 3850809}, {'name': 'Paris', 'country': 'France', 'population': 2161000}, {'name': 'Lisbon', 'country': 'Portugal', 'population': 504718}]}\n"
          ]
        }
      ],
      "source": [
        "import json\n",
        "valid_reply = result[\"replies\"][0].text\n",
        "valid_json = json.loads(valid_reply)\n",
        "print(valid_json)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vXUnDfIH8LCN"
      },
      "source": [
        "### Using JSON schema\n",
        "\n",
        "Now, we’ll create a JSON schema of the `CitiesData` model and pass it to `OpenAIChatGenerator`. OpenAI expects schemas in a specific format, so the schema generated with `model_json_schema()` cannot be used directly.\n",
        "\n",
        "For details on how to create schemas for OpenAI, see the [OpenAI Structured Outputs guide](https://platform.openai.com/docs/guides/structured-outputs#supported-schemas)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {
        "id": "cs3ETz3O8LCN"
      },
      "outputs": [],
      "source": [
        "cities_data_schema={\n",
        "        \"type\": \"json_schema\",\n",
        "        \"json_schema\": {\n",
        "            \"name\": \"CitiesData\",\n",
        "            \"schema\": {\n",
        "                \"type\": \"object\",\n",
        "                \"properties\": {\n",
        "                    \"cities\": {\n",
        "                        \"type\": \"array\",\n",
        "                        \"items\": {\n",
        "                            \"type\": \"object\",\n",
        "                            \"properties\": {\n",
        "                                \"name\": { \"type\": \"string\" },\n",
        "                                \"country\": { \"type\": \"string\" },\n",
        "                                \"population\": { \"type\": \"integer\" }\n",
        "                            },\n",
        "                            \"required\": [\"name\", \"country\", \"population\"],\n",
        "                            \"additionalProperties\": False\n",
        "                        }\n",
        "                    }\n",
        "                },\n",
        "                \"required\": [\"cities\"],\n",
        "                \"additionalProperties\": False\n",
        "            },\n",
        "            \"strict\": True\n",
        "        }\n",
        "    }"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "J5vpI1A48LCN"
      },
      "source": [
        "Pass this JSON schema to the `response_format` parameter in chat generator. We run the generator individually to see the output."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 10,
      "metadata": {
        "id": "luHoDVd48LCN",
        "outputId": "5142a3cf-0353-48ba-c2bc-ff9173bfa5a6",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "{\"cities\":[{\"name\":\"Berlin\",\"country\":\"Germany\",\"population\":3850809},{\"name\":\"Paris\",\"country\":\"France\",\"population\":2161000},{\"name\":\"Lisbon\",\"country\":\"Portugal\",\"population\":504718}]}\n"
          ]
        }
      ],
      "source": [
        "chat_generator = OpenAIChatGenerator(generation_kwargs={\"response_format\": cities_data_schema})\n",
        "\n",
        "text = \"Berlin is the capital of Germany. It has a population of 3,850,809. Paris, France's capital, has 2.161 million residents. Lisbon is the capital and the largest city of Portugal with the population of 504,718.\"\n",
        "result = chat_generator.run(messages=[ChatMessage.from_user(text)])\n",
        "\n",
        "print(result[\"replies\"][0].text)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dGugTWbz8LCN"
      },
      "source": [
        "## Structured Outputs with `OpenAIResponsesChatGenerator`\n",
        "\n",
        "### Using Pydantic Models\n",
        "We'll use the models `City` and `CitiesData` defined above.\n",
        "[OpenAIResponsesChatGenerator](https://docs.haystack.deepset.ai/docs/openairesponseschatgenerator) generates\n",
        "text using OpenAI's `gpt-5-mini` model by default. We pass our Pydantic model to `text_format` parameter in `generation_kwargs` when calling the `run` method.\n",
        "\n",
        "Note: You can set the `text_format` for the generator by passing it in `generation_kwargs`, in `init` or `run` methods."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 11,
      "metadata": {
        "id": "xRyW9NzS8LCN"
      },
      "outputs": [],
      "source": [
        "import os\n",
        "from getpass import getpass\n",
        "\n",
        "from haystack.components.generators.chat import OpenAIResponsesChatGenerator\n",
        "\n",
        "if \"OPENAI_API_KEY\" not in os.environ:\n",
        "    os.environ[\"OPENAI_API_KEY\"] = getpass(\"Enter OpenAI API key:\")\n",
        "responses_generator = OpenAIResponsesChatGenerator(generation_kwargs={\"text_format\": CitiesData})"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_OEWZ14d8LCN"
      },
      "source": [
        "Let's check the structured output with a simple user message."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 12,
      "metadata": {
        "id": "bHwuFRwu8LCN",
        "outputId": "b677b65e-6882-429d-8b8f-1a4318956973",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "{'replies': [ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>, _content=[ReasoningContent(reasoning_text='', extra={'id': 'rs_049246de4a3c0e6b0069526cbc22a881a0b20b4a48f826829d', 'type': 'reasoning'}), TextContent(text='{\"cities\":[{\"name\":\"Berlin\",\"country\":\"Germany\",\"population\":3850809},{\"name\":\"Paris\",\"country\":\"France\",\"population\":2161000}]}')], _name=None, _meta={'id': 'resp_049246de4a3c0e6b0069526cbb9f9081a08d3a8ac3bee64438', 'created_at': 1767009467.0, 'error': None, 'incomplete_details': None, 'instructions': None, 'metadata': {}, 'model': 'gpt-5-mini-2025-08-07', 'object': 'response', 'parallel_tool_calls': True, 'temperature': 1.0, 'tool_choice': 'auto', 'tools': [], 'top_p': 1.0, 'background': False, 'max_output_tokens': None, 'max_tool_calls': None, 'previous_response_id': None, 'prompt_cache_key': None, 'prompt_cache_retention': None, 'reasoning': {'effort': 'medium', 'summary': None}, 'safety_identifier': None, 'service_tier': 'default', 'status': 'completed', 'text': {'format': {'name': 'CitiesData', 'schema': {'$defs': {'City': {'properties': {'name': {'title': 'Name', 'type': 'string'}, 'country': {'title': 'Country', 'type': 'string'}, 'population': {'title': 'Population', 'type': 'integer'}}, 'required': ['name', 'country', 'population'], 'title': 'City', 'type': 'object', 'additionalProperties': False}}, 'properties': {'cities': {'items': {'$ref': '#/$defs/City'}, 'title': 'Cities', 'type': 'array'}}, 'required': ['cities'], 'title': 'CitiesData', 'type': 'object', 'additionalProperties': False}, 'type': 'json_schema', 'description': None, 'strict': True}, 'verbosity': 'medium'}, 'top_logprobs': 0, 'truncation': 'disabled', 'usage': {'input_tokens': 131, 'input_tokens_details': {'cached_tokens': 0}, 'output_tokens': 299, 'output_tokens_details': {'reasoning_tokens': 256}, 'total_tokens': 430}, 'user': None, 'billing': {'payer': 'developer'}, 'completed_at': 1767009473, 'store': True, 'logprobs': [[]]})]}"
            ]
          },
          "metadata": {},
          "execution_count": 12
        }
      ],
      "source": [
        "responses_generator.run(messages=[ChatMessage.from_user(\"Berlin is the capital of Germany. It has a population of 3,850,809. Paris, France's capital, has 2.161 million residents.\")])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4SYliVWz8LCN"
      },
      "source": [
        "### Using JSON Schema\n",
        "Now, we’ll create a JSON schema of the `CitiesData` model and pass it to `OpenAIResponsesChatGenerator`. We cannot use the same schema we defined for `OpenAIChatGenerator` as OpenAI Responses API expects a different format of schema.\n",
        "For further details, see the [documentation](https://platform.openai.com/docs/guides/migrate-to-responses#6-update-structured-outputs-definition).\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 13,
      "metadata": {
        "id": "7MMbREMY8LCN"
      },
      "outputs": [],
      "source": [
        "cities_data_schema_responses={\n",
        "    \"format\": {\n",
        "        \"type\": \"json_schema\",\n",
        "            \"name\": \"CitiesData\",\n",
        "            \"schema\": {\n",
        "                \"type\": \"object\",\n",
        "                \"properties\": {\n",
        "                    \"cities\": {\n",
        "                        \"type\": \"array\",\n",
        "                        \"items\": {\n",
        "                            \"type\": \"object\",\n",
        "                            \"properties\": {\n",
        "                                \"name\": { \"type\": \"string\" },\n",
        "                                \"country\": { \"type\": \"string\" },\n",
        "                                \"population\": { \"type\": \"integer\" }\n",
        "                            },\n",
        "                            \"required\": [\"name\", \"country\", \"population\"],\n",
        "                            \"additionalProperties\": False\n",
        "                        }\n",
        "                    }\n",
        "                },\n",
        "                \"required\": [\"cities\"],\n",
        "                \"additionalProperties\": False\n",
        "            },\n",
        "            \"strict\": True\n",
        "        }\n",
        "    }"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "F2HlUNS08LCO"
      },
      "source": [
        "We pass our JSON schema to `text` parameter in `generation_kwargs`.\n",
        "\n",
        "Note: You can also set the `text` in `generation_kwargs` param in the `run` method of the chat generator."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 14,
      "metadata": {
        "id": "nKbaJt6_8LCO",
        "outputId": "c1d12cf5-bf07-4a3c-e424-befc2b0b006b",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "{'cities': [{'name': 'Berlin', 'country': 'Germany', 'population': 3850809}, {'name': 'Paris', 'country': 'France', 'population': 2161000}]}\n"
          ]
        }
      ],
      "source": [
        "chat_generator = OpenAIResponsesChatGenerator(generation_kwargs={\"text\": cities_data_schema_responses})\n",
        "\n",
        "result = chat_generator.run(messages=[ChatMessage.from_user(\"Berlin is the capital of Germany. It has a population of 3,850,809. Paris, France's capital, has 2.161 million residents.\")])\n",
        "parsed = json.loads(result[\"replies\"][0].text)\n",
        "\n",
        "print(parsed)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NJAmZU0P8LCO"
      },
      "source": [
        "## What's next\n",
        "\n",
        "🎉 Congratulations! You've learned how to easily produce structured ouputs with `OpenAIChatGenerator` and `OpenAIResponsesChatGenerator` using Pydantic models and JSON schema.\n",
        "\n",
        "Other chat generators that also support structured outputs: [`MistralChatGenerator`](https://docs.haystack.deepset.ai/docs/mistralchatgenerator), [`OpenRouterChatGenerator`](https://docs.haystack.deepset.ai/docs/openrouterchatgenerator), [`NvidiaChatGenerator`](https://docs.haystack.deepset.ai/docs/nvidiachatgenerator), [`MetaLlamaChatGenerator`](https://docs.haystack.deepset.ai/docs/metallamachatgenerator), [`TogetherAIChatGenerator`](https://docs.haystack.deepset.ai/docs/togetheraichatgenerator),\n",
        "[`LlamaStackChatGenerator`](https://docs.haystack.deepset.ai/docs/llamastackchatgenerator) and [`STACKITChatGenerator`](https://docs.haystack.deepset.ai/docs/stackitchatgenerator).\n",
        "\n",
        "To stay up to date on the latest Haystack developments, you can [subscribe to our newsletter](https://landing.deepset.ai/haystack-community-updates) and [join Haystack discord community](https://discord.gg/haystack).\n",
        "\n",
        "Thanks for reading!"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "gpuType": "T4",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "base",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.12.12"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}