{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Keyword Extraction with LLM Chat Generator\n",
    "This notebook demonstrates how to extract keywords and key phrases from text using Haystack’s `ChatPromptBuilder` together with an LLM via `OpenAIChatGenerator`. We will:\n",
    "\n",
    "- Define a prompt that instructs the model to identify single- and multi-word keywords.\n",
    "\n",
    "- Capture each keyword’s character offsets.\n",
    "\n",
    "- Assign a relevance score (0–1).\n",
    "\n",
    "- Parse and display the results as JSON.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Install packages and setup OpenAI API key"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install haystack-ai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from getpass import getpass\n",
    "\n",
    "if \"OPENAI_API_KEY\" not in os.environ:\n",
    "    os.environ[\"OPENAI_API_KEY\"] = getpass(\"Enter OpenAI API key:\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import Required Libraries\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "\n",
    "from haystack.dataclasses import ChatMessage\n",
    "from haystack.components.builders import ChatPromptBuilder\n",
    "from haystack.components.generators.chat import OpenAIChatGenerator\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prepare Text \n",
    "Collect your text you want to analyze."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "text_to_analyze = \"Artificial intelligence models like large language models are increasingly integrated into various sectors including healthcare, finance, education, and customer service. They can process natural language, generate text, translate languages, and extract meaningful insights from unstructured data. When performing key word extraction, these systems identify the most significant terms, phrases, or concepts that represent the core meaning of a document. Effective extraction must balance between technical terminology, domain-specific jargon, named entities, action verbs, and contextual relevance. The process typically involves tokenization, stopword removal, part-of-speech tagging, frequency analysis, and semantic relationship mapping to prioritize terms that most accurately capture the document's essential information and main topics.\"\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Build the Prompt\n",
    "We construct a single-message template that instructs the model to extract keywords, their positions and scores and return the output as JSON object.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "messages = [\n",
    "    ChatMessage.from_user(\n",
    "        '''\n",
    "You are a keyword extractor. Extract the most relevant keywords and phrases from the following text. For each keyword:\n",
    "1. Find single and multi-word keywords that capture important concepts\n",
    "2. Include the starting position (index) where each keyword appears in the text\n",
    "3. Assign a relevance score between 0 and 1 for each keyword\n",
    "4. Focus on nouns, noun phrases, and important terms\n",
    "\n",
    "Text to analyze: {{text}}\n",
    "\n",
    "Return the results as a JSON array in this exact format:\n",
    "{\n",
    "  \"keywords\": [\n",
    "    {\n",
    "      \"keyword\": \"example term\",\n",
    "      \"positions\": [5],\n",
    "      \"score\": 0.95\n",
    "    },\n",
    "    {\n",
    "      \"keyword\": \"another keyword\",\n",
    "      \"positions\": [20],\n",
    "      \"score\": 0.85\n",
    "    }\n",
    "  ]\n",
    "}\n",
    "\n",
    "Important:\n",
    "- Each keyword must have its EXACT character position in the text (counting from 0)\n",
    "- Scores should reflect the relevance (0–1)\n",
    "- Include both single words and meaningful phrases\n",
    "- List results from highest to lowest score\n",
    "'''\n",
    "    )\n",
    "]\n",
    "\n",
    "builder = ChatPromptBuilder(template=messages, required_variables='*')\n",
    "prompt = builder.run(text=text_to_analyze)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Initialize the Generator and Extract Keywords\n",
    "We use OpenAIChatGenerator (e.g., gpt-4o-mini) to send our prompt and request a JSON-formatted response."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize the chat-based generator\n",
    "extractor = OpenAIChatGenerator(model=\"gpt-4o-mini\")\n",
    "\n",
    "# Run the generator with our formatted prompt\n",
    "results = extractor.run(\n",
    "    messages=prompt[\"prompt\"],\n",
    "    generation_kwargs={\"response_format\": {\"type\": \"json_object\"}}\n",
    ")\n",
    "\n",
    "# Extract the raw text reply\n",
    "output_str = results[\"replies\"][0].text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parse and Display Results\n",
    "Finally, convert the returned JSON string into a Python object and iterate over the extracted keywords."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Keyword: artificial intelligence\n",
      " Positions: [0]\n",
      " Score: 1.0\n",
      "\n",
      "Keyword: large language models\n",
      " Positions: [18]\n",
      " Score: 0.95\n",
      "\n",
      "Keyword: healthcare\n",
      " Positions: [63]\n",
      " Score: 0.9\n",
      "\n",
      "Keyword: finance\n",
      " Positions: [72]\n",
      " Score: 0.9\n",
      "\n",
      "Keyword: education\n",
      " Positions: [81]\n",
      " Score: 0.9\n",
      "\n",
      "Keyword: customer service\n",
      " Positions: [91]\n",
      " Score: 0.9\n",
      "\n",
      "Keyword: natural language\n",
      " Positions: [108]\n",
      " Score: 0.85\n",
      "\n",
      "Keyword: unstructured data\n",
      " Positions: [162]\n",
      " Score: 0.85\n",
      "\n",
      "Keyword: key word extraction\n",
      " Positions: [193]\n",
      " Score: 0.8\n",
      "\n",
      "Keyword: significant terms\n",
      " Positions: [215]\n",
      " Score: 0.8\n",
      "\n",
      "Keyword: technical terminology\n",
      " Positions: [290]\n",
      " Score: 0.75\n",
      "\n",
      "Keyword: domain-specific jargon\n",
      " Positions: [311]\n",
      " Score: 0.75\n",
      "\n",
      "Keyword: named entities\n",
      " Positions: [334]\n",
      " Score: 0.7\n",
      "\n",
      "Keyword: action verbs\n",
      " Positions: [352]\n",
      " Score: 0.7\n",
      "\n",
      "Keyword: contextual relevance\n",
      " Positions: [367]\n",
      " Score: 0.7\n",
      "\n",
      "Keyword: tokenization\n",
      " Positions: [406]\n",
      " Score: 0.65\n",
      "\n",
      "Keyword: stopword removal\n",
      " Positions: [420]\n",
      " Score: 0.65\n",
      "\n",
      "Keyword: part-of-speech tagging\n",
      " Positions: [437]\n",
      " Score: 0.65\n",
      "\n",
      "Keyword: frequency analysis\n",
      " Positions: [457]\n",
      " Score: 0.65\n",
      "\n",
      "Keyword: semantic relationship mapping\n",
      " Positions: [476]\n",
      " Score: 0.65\n",
      "\n",
      "Keyword: essential information\n",
      " Positions: [508]\n",
      " Score: 0.6\n",
      "\n",
      "Keyword: main topics\n",
      " Positions: [529]\n",
      " Score: 0.6\n",
      "\n"
     ]
    }
   ],
   "source": [
    "try:\n",
    "    data = json.loads(output_str)\n",
    "    for kw in data[\"keywords\"]:\n",
    "        print(f'Keyword: {kw[\"keyword\"]}')\n",
    "        print(f' Positions: {kw[\"positions\"]}')\n",
    "        print(f' Score: {kw[\"score\"]}\\n')\n",
    "except json.JSONDecodeError:\n",
    "    print(\"Failed to parse the output as JSON. Raw output:\", output_str)\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.12.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
