{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b79cad40-2c9c-4598-8195-0d6cf525ff87",
   "metadata": {},
   "source": [
    "# Tutorial: Retrieving a Context Window Around a Sentence\n",
    "\n",
    "- **Level**: Beginner\n",
    "- **Time to complete**: 10 minutes\n",
    "- **Components Used**: [`SentenceWindowRetriever`](https://docs.haystack.deepset.ai/docs/sentencewindowretrieval),\n",
    "[`DocumentSplitter`](https://docs.haystack.deepset.ai/docs/documentsplitter), [`InMemoryDocumentStore`](https://docs.haystack.deepset.ai/docs/inmemorydocumentstore), [`InMemoryBM25Retriever`](https://docs.haystack.deepset.ai/docs/inmemorybm25retriever)\n",
    "- **Goal**: After completing this tutorial, you will have learned about Sentence-Window Retrieval and how to use it for document retrieval.\n",
    "\n",
    "## Overview\n",
    "\n",
    "The Sentence-Window retrieval technique is a simple and effective way to retrieve more context given a user query which matched some document. It is based on the idea that the most relevant sentences are likely to be close to each other in the document. The technique involves selecting a window of sentences around a sentence matching a user query and instead of returning the matching sentence, the entire window is returned. This technique can be particularly useful when the user query is a question or a phrase that requires more context to be understood.\n",
    "\n",
    "The [`SentenceWindowRetriever`](https://docs.haystack.deepset.ai/docs/sentencewindowretrieval) can be used in a Pipeline to implement the Sentence-Window retrieval technique.\n",
    "\n",
    "The component takes a `document_store` and a `window_size` as input. The `document_store` contains the documents we want to query, and the `window_size` is used to determine the number of sentences to return around the matching sentence. So the number of sentences returned will be `2 * window_size + 1`. Although we use the term \"sentence\" as it's inertly attached to this technique, the `SentenceWindowRetriever` actually works with any splitter from the `DocumentSplitter` class, for instance: `word`, `sentence`, `page`.\n",
    "\n",
    "`SentenceWindowRetriever(document_store=doc_store, window_size=2)`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "98c2f9d3",
   "metadata": {},
   "source": [
    "## Installing Haystack\n",
    "\n",
    "To start, install the latest release of Haystack with `pip`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fae70eb2",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "pip install --upgrade pip\n",
    "pip install haystack-ai nltk"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ee689359-94bf-45b6-b69b-f17266382ff8",
   "metadata": {},
   "source": [
    "## Getting started with Sentence-Window Retrieval\n",
    "\n",
    "Let's see a simple example of how to use the `SentenceWindowRetriever` in isolation, and later we can see how to use it within a pipeline. We start by creating a document and splitting it into sentences using the `DocumentSplitter` class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "fb04d1f9-5329-499c-9479-7e3b4b4fa126",
   "metadata": {},
   "outputs": [],
   "source": [
    "from haystack import Document\n",
    "from haystack.components.preprocessors import DocumentSplitter\n",
    "\n",
    "splitter = DocumentSplitter(split_length=1, split_overlap=0, split_by=\"period\")\n",
    "\n",
    "text = (\n",
    "    \"Paul fell asleep to dream of an Arrakeen cavern, silent people all around  him moving in the dim light \"\n",
    "    \"of glowglobes. It was solemn there and like a cathedral as he listened to a faint sound—the \"\n",
    "    \"drip-drip-drip of water. Even while he remained in the dream, Paul knew he would remember it upon \"\n",
    "    \"awakening. He always remembered the dreams that were predictions. The dream faded. Paul awoke to feel \"\n",
    "    \"himself in the warmth of his bed—thinking thinking. This world of Castle Caladan, without play or \"\n",
    "    \"companions his own age,  perhaps did not deserve sadness in farewell. Dr Yueh, his teacher, had \"\n",
    "    \"hinted  that the faufreluches class system was not rigidly guarded on Arrakis. The planet sheltered \"\n",
    "    \"people who lived at the desert edge without caid or bashar to command them: will-o’-the-sand people \"\n",
    "    \"called Fremen, marked down on no  census of the Imperial Regate.\"\n",
    ")\n",
    "\n",
    "doc = Document(content=text)\n",
    "docs = splitter.run([doc])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61d94c11-4c32-4022-979d-8316f9069ac8",
   "metadata": {},
   "source": [
    "This will result in 9 sentences represented as Haystack Document objects. We can then write these documents to a DocumentStore and use the SentenceWindowRetriever to retrieve a window of sentences around a matching sentence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "24a7fd39-19df-486e-9914-166dc3e77cc4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "9"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from haystack.document_stores.in_memory import InMemoryDocumentStore\n",
    "from haystack.document_stores.types import DuplicatePolicy\n",
    "\n",
    "doc_store = InMemoryDocumentStore()\n",
    "doc_store.write_documents(docs[\"documents\"], policy=DuplicatePolicy.OVERWRITE)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a1253dd5-e71e-4a26-814d-e8d256750aff",
   "metadata": {},
   "source": [
    "Now we use the `SentenceWindowRetriever` to retrieve a window of sentences around a certain sentence. Note that the `SentenceWindowRetriever` receives as input in run time a `Document` present in the document store, and it will rely on the documents metadata to retrieve the window of sentences around the matching sentence. So, one important aspect to notice is that the `SentenceWindowRetriever` needs to be used in conjunction with another `Retriever` that handles the initial user query, such as the `InMemoryBM25Retriever`, and returns the matching documents.\n",
    "\n",
    "Let's pass the Document containing the sentence `The dream faded.` to the `SentenceWindowRetriever` and retrieve a window of 2 sentences around it. Note that we need to wrap it in a list as the `run` method expects a list of documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "c028b53b-e68c-4b02-bd83-f9096aa54079",
   "metadata": {},
   "outputs": [],
   "source": [
    "from haystack.components.retrievers import SentenceWindowRetriever\n",
    "\n",
    "retriever = SentenceWindowRetriever(document_store=doc_store, window_size=2)\n",
    "result = retriever.run(retrieved_documents=[docs[\"documents\"][4]])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0992d153-6770-4519-a930-6b4c85115611",
   "metadata": {},
   "source": [
    "The result is a dictionary with two keys:\n",
    "\n",
    "- `context_windows`: a list of strings containing the context windows around the matching sentence.\n",
    "- `context_documents`: a list `Document` objects, containing the retrieved documents plus the context document surrounding them. The documents are sorted by the `split_idx_start` meta field."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "b92920a5-5937-4f6b-87fb-a68db4c79401",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[' Even while he remained in the dream, Paul knew he would remember it upon awakening. He always remembered the dreams that were predictions. The dream faded. Paul awoke to feel himself in the warmth of his bed—thinking thinking. This world of Castle Caladan, without play or companions his own age,  perhaps did not deserve sadness in farewell.']"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result[\"context_windows\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "faded9fe-725a-4b50-8855-7356ec0749e7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(id=5d093b6ec1a4bdc7e75f033ae0b570e237053213a09b42a56ad815b4d118943d, content: ' Even while he remained in the dream, Paul knew he would remember it upon awakening.', meta: {'source_id': 'b56504f244b7b650096b14d678bc82f3d7fe240bb135361c6a23a14c4b809596', 'page_number': 1, 'split_id': 2, 'split_idx_start': 219}),\n",
       " Document(id=4ed71ff61df531053cc7d5f80e8a0bd1e702f3a396f3f3983ceeffe89878a684, content: ' He always remembered the dreams that were predictions.', meta: {'source_id': 'b56504f244b7b650096b14d678bc82f3d7fe240bb135361c6a23a14c4b809596', 'page_number': 1, 'split_id': 3, 'split_idx_start': 303}),\n",
       " Document(id=f485258001abdf2deab98249c7f0826b4f6b1bef7c37763d14318e7b595f434f, content: ' The dream faded.', meta: {'source_id': 'b56504f244b7b650096b14d678bc82f3d7fe240bb135361c6a23a14c4b809596', 'page_number': 1, 'split_id': 4, 'split_idx_start': 358}),\n",
       " Document(id=f39c29c3a3122affc5909dc7b98f5880d9bd984731380420134c440da6fee363, content: ' Paul awoke to feel himself in the warmth of his bed—thinking thinking.', meta: {'source_id': 'b56504f244b7b650096b14d678bc82f3d7fe240bb135361c6a23a14c4b809596', 'page_number': 1, 'split_id': 5, 'split_idx_start': 375}),\n",
       " Document(id=15401623a2a4fed533db7c1bbe8df157f79a9395cf8d3d6e92dc5ae553d0dded, content: ' This world of Castle Caladan, without play or companions his own age,  perhaps did not deserve sadn...', meta: {'source_id': 'b56504f244b7b650096b14d678bc82f3d7fe240bb135361c6a23a14c4b809596', 'page_number': 1, 'split_id': 6, 'split_idx_start': 446})]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result[\"context_documents\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "289e207c-7f8f-45da-bdfd-61ed4955942d",
   "metadata": {},
   "source": [
    "## Create a Keyword Retrieval Pipeline with Sentence-Window Retrieval"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6bc10c96-e453-4e43-9b64-d05fae6de040",
   "metadata": {},
   "source": [
    "Let's see this component in action. We will use the BBC news dataset to show how the `SentenceWindowRetriever` works with a dataset containing multiple news articles.\n",
    "\n",
    "### Reading the dataset\n",
    "\n",
    "The original dataset is available at http://mlg.ucd.ie/datasets/bbc.html, but  it was already preprocessed and stored in\n",
    "a single CSV file available here: https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "82565cae-a730-4cd7-85f3-40be0e77b94d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import List\n",
    "import csv\n",
    "from haystack import Document\n",
    "\n",
    "\n",
    "def read_documents(file: str) -> List[Document]:\n",
    "    with open(file, \"r\") as file:\n",
    "        reader = csv.reader(file, delimiter=\"\\t\")\n",
    "        next(reader, None)  # skip the headers\n",
    "        documents = []\n",
    "        for row in reader:\n",
    "            category = row[0].strip()\n",
    "            title = row[2].strip()\n",
    "            text = row[3].strip()\n",
    "            documents.append(Document(content=text, meta={\"category\": category, \"title\": title}))\n",
    "\n",
    "    return documents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "4f581e8b-0693-4b09-b82e-71e78cb83f1a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "import requests\n",
    "\n",
    "doc = requests.get(\"https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv\")\n",
    "\n",
    "datafolder = Path(\"data\")\n",
    "datafolder.mkdir(exist_ok=True)\n",
    "with open(datafolder / \"bbc-news-data.csv\", \"wb\") as f:\n",
    "    for chunk in doc.iter_content(512):\n",
    "        f.write(chunk)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "1ab23051-7df1-49e6-a009-ba187855aab3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2225"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs = read_documents(\"data/bbc-news-data.csv\")\n",
    "len(docs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4a003472-19c1-4bc0-b6df-995bc66e8904",
   "metadata": {},
   "source": [
    "### Indexing the documents\n",
    "\n",
    "We will now apply the `DocumentSplitter` to split the documents into sentences and write them to an `InMemoryDocumentStore`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "eb3203f3-2f75-4a60-9d2a-f530a09113a0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'writer': {'documents_written': 44186}}"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from haystack import Document, Pipeline\n",
    "from haystack.components.preprocessors import DocumentSplitter\n",
    "from haystack.components.writers import DocumentWriter\n",
    "from haystack.document_stores.in_memory import InMemoryDocumentStore\n",
    "from haystack.document_stores.types import DuplicatePolicy\n",
    "\n",
    "doc_store = InMemoryDocumentStore()\n",
    "\n",
    "indexing_pipeline = Pipeline()\n",
    "indexing_pipeline.add_component(\"splitter\", DocumentSplitter(split_length=1, split_overlap=0, split_by=\"sentence\"))\n",
    "indexing_pipeline.add_component(\"writer\", DocumentWriter(document_store=doc_store, policy=DuplicatePolicy.OVERWRITE))\n",
    "\n",
    "indexing_pipeline.connect(\"splitter\", \"writer\")\n",
    "\n",
    "indexing_pipeline.run({\"documents\": docs})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0b6030d-ace7-471e-ae5f-b7dfc0ec1064",
   "metadata": {},
   "source": [
    "### Build a Sentence-Window Retrieval Pipeline\n",
    "\n",
    "Let's now build a pipeline to retrieve the documents using the `InMemoryBM25Retriever` (with keyword retrieval) and the `SentenceWindowRetriever`. Here, we are setting up the retriever with a `window_size` of 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "7048bc9c-8c6a-4df0-92c4-20b6162cfdb4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<haystack.core.pipeline.pipeline.Pipeline object at 0x13abfb200>\n",
       "🚅 Components\n",
       "  - bm25_retriever: InMemoryBM25Retriever\n",
       "  - sentence_window__retriever: SentenceWindowRetriever\n",
       "🛤️ Connections\n",
       "  - bm25_retriever.documents -> sentence_window__retriever.retrieved_documents (List[Document])"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from haystack.components.retrievers.in_memory import InMemoryBM25Retriever\n",
    "from haystack.components.retrievers import SentenceWindowRetriever\n",
    "\n",
    "sentence_window_pipeline = Pipeline()\n",
    "\n",
    "sentence_window_pipeline.add_component(\"bm25_retriever\", InMemoryBM25Retriever(document_store=doc_store))\n",
    "sentence_window_pipeline.add_component(\"sentence_window__retriever\", SentenceWindowRetriever(doc_store, window_size=2))\n",
    "\n",
    "sentence_window_pipeline.connect(\"bm25_retriever.documents\", \"sentence_window__retriever.retrieved_documents\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e67abab4-10c1-4d95-b8fe-1ff24bb93161",
   "metadata": {},
   "source": [
    "### Putting it all together\n",
    "\n",
    "Let's see what happens when we retrieve documents relevant to \"phishing attacks\", returning only the highest scored document. \n",
    "We will also include the outputs from the `InMemoryBM25Retriever` so that we can compare the results with and without the `SentenceWindowRetriever`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "fdadd81e-3a4c-474d-9a92-85fa2f6a1264",
   "metadata": {},
   "outputs": [],
   "source": [
    "result = sentence_window_pipeline.run(\n",
    "    data={\"bm25_retriever\": {\"query\": \"phishing attacks\", \"top_k\": 1}}, include_outputs_from={\"bm25_retriever\"}\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bbda91dc-238e-44a8-a241-9c35115efe88",
   "metadata": {},
   "source": [
    "Let's now inspect the results from the `InMemoryBM25Retriever` and the `SentenceWindowRetriever`. Since we split the documents by sentence, the `InMemoryBM25Retriever` returns only the sentence associated with the matching query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "ff53bf0b-ec2f-49aa-a5ff-82e0686ac81d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(id=57766497f35c7ebef5c49e754b8df41a8df3d5df3e46bc595807d7420d7cef8e, content: ' The Anti-Phishing Working group reported that the number of phishing attacks against new targets wa...', meta: {'category': 'tech', 'title': 'Cyber crime booms in 2004', 'source_id': '5c81f8cbd6c9c07819bf60e484489fe0af9e6626591ec77066701cb856fb3b33', 'page_number': 1, 'split_id': 12, 'split_idx_start': 1520}, score: 17.74585935028894)]"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result[\"bm25_retriever\"][\"documents\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aca630e2-f0a9-4a07-bdcb-5f7e10415802",
   "metadata": {},
   "source": [
    "The `SentenceWindowRetriever`, on the other hand, returns a window of sentences around the matching sentence, giving us more context to understand the sentence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "b3453305-b536-460f-be3f-8fc6d7169673",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['\"  In particular, phishing attacks, which typically use fake versions of bank websites to grab login details of customers, boomed during 2004. Web portal Lycos Europe reported a 500% increase in the number of phishing e-mail messages it was catching. The Anti-Phishing Working group reported that the number of phishing attacks against new targets was growing at a rate of 30% or more per month. Those who fall victim to these attacks can find that their bank account has been cleaned out or that their good name has been ruined by someone stealing their identity. This change in the ranks of virus writers could mean the end of the mass-mailing virus which attempts to spread by tricking people into opening infected attachments on e-mail messages.']"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result[\"sentence_window__retriever\"][\"context_windows\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bbbf6f9f",
   "metadata": {},
   "source": [
    "We are also able to access the context window as a list of `Document`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "6b5fd252",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(id=905582f4a147cae72b90223e433db5986c4ff46d8c8a325fe56ea3cfbecff742, content: '\"  In particular, phishing attacks, which typically use fake versions of bank websites to grab login...', meta: {'category': 'tech', 'title': 'Cyber crime booms in 2004', 'source_id': '5c81f8cbd6c9c07819bf60e484489fe0af9e6626591ec77066701cb856fb3b33', 'page_number': 1, 'split_id': 10, 'split_idx_start': 1270}),\n",
       " Document(id=91f6969683e714cddf3ef4816616176d7e467bb7756eb4051f0aa5f15e7bcabd, content: ' Web portal Lycos Europe reported a 500% increase in the number of phishing e-mail messages it was c...', meta: {'category': 'tech', 'title': 'Cyber crime booms in 2004', 'source_id': '5c81f8cbd6c9c07819bf60e484489fe0af9e6626591ec77066701cb856fb3b33', 'page_number': 1, 'split_id': 11, 'split_idx_start': 1412}),\n",
       " Document(id=57766497f35c7ebef5c49e754b8df41a8df3d5df3e46bc595807d7420d7cef8e, content: ' The Anti-Phishing Working group reported that the number of phishing attacks against new targets wa...', meta: {'category': 'tech', 'title': 'Cyber crime booms in 2004', 'source_id': '5c81f8cbd6c9c07819bf60e484489fe0af9e6626591ec77066701cb856fb3b33', 'page_number': 1, 'split_id': 12, 'split_idx_start': 1520}),\n",
       " Document(id=5ed9c84a161ee527527a3bb0c7b90dddee368a840860c672623408d90d399de0, content: ' Those who fall victim to these attacks can find that their bank account has been cleaned out or tha...', meta: {'category': 'tech', 'title': 'Cyber crime booms in 2004', 'source_id': '5c81f8cbd6c9c07819bf60e484489fe0af9e6626591ec77066701cb856fb3b33', 'page_number': 1, 'split_id': 13, 'split_idx_start': 1665}),\n",
       " Document(id=08d1d6c1e05b68d626a37bf0f863affca8eda54aad886b27583b4c783d1bd308, content: ' This change in the ranks of virus writers could mean the end of the mass-mailing virus which attemp...', meta: {'category': 'tech', 'title': 'Cyber crime booms in 2004', 'source_id': '5c81f8cbd6c9c07819bf60e484489fe0af9e6626591ec77066701cb856fb3b33', 'page_number': 1, 'split_id': 14, 'split_idx_start': 1834})]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result[\"sentence_window__retriever\"][\"context_documents\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "06614ff0-7b82-46dc-8416-a78633704583",
   "metadata": {},
   "source": [
    "## Wrapping Up\n",
    "\n",
    "We saw how the `SentenceWindowRetriever` works and how it can be used to retrieve a window of sentences around a matching document, give us more context to understand the document. One important aspect to notice is that the `SentenceWindowRetriever` doesn't handle queries directly but relies on the output of another `Retriever` that handles the initial user query. This allows the `SentenceWindowRetriever` to be used in conjunction with any other retriever in the pipeline, such as the `InMemoryBM25Retriever`."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
