{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial: Embedding Metadata for Improved Retrieval\n",
    "\n",
    "\n",
    "- **Level**: Intermediate\n",
    "- **Time to complete**: 10 minutes\n",
    "- **Components Used**: [`InMemoryDocumentStore`](https://docs.haystack.deepset.ai/docs/inmemorydocumentstore), [`InMemoryEmbeddingRetriever`](https://docs.haystack.deepset.ai/docs/inmemoryembeddingretriever), [`SentenceTransformersDocumentEmbedder`](https://docs.haystack.deepset.ai/docs/sentencetransformersdocumentembedder), [`SentenceTransformersTextEmbedder`](https://docs.haystack.deepset.ai/docs/sentencetransformerstextembedder)\n",
    "- **Goal**: After completing this tutorial, you'll have learned how to embed metadata information while indexing documents, to improve retrieval.\n",
    "\n",
    "> ⚠️ Note of caution: The method showcased in this tutorial is not always the right approach for all types of metadata. This method works best when the embedded metadata is meaningful. For example, here we're showcasing embedding the \"title\" meta field, which can also provide good context for the embedding model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview\n",
    "\n",
    "While indexing documents into a document store, we have 2 options: embed the text for that document or embed the text alongside some meaningful metadata. In some cases, embedding meaningful metadata alongside the contents of a document may improve retrieval down the line. \n",
    "\n",
    "In this tutorial, we will see how we can embed metadata as well as the text of a document. We will fetch various pages from Wikipedia and index them into an `InMemoryDocumentStore` with metadata information that includes their title, and URL. Next, we will see how retrieval with and without this metadata."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "### Prepare the Colab Environment\n",
    "\n",
    "- [Enable GPU Runtime in Colab](https://docs.haystack.deepset.ai/docs/enabling-gpu-acceleration)\n",
    "- [Set logging level to INFO](https://docs.haystack.deepset.ai/docs/setting-the-log-level)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Install Haystack\n",
    "\n",
    "Install Haystack and other required packages with `pip`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "pip install haystack-ai wikipedia sentence-transformers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Indexing Documents with Metadata\n",
    "\n",
    "Create a pipeline to store the small example dataset in the [InMemoryDocumentStore](https://docs.haystack.deepset.ai/docs/inmemorydocumentstore) with their embeddings. We will use [SentenceTransformersDocumentEmbedder](https://docs.haystack.deepset.ai/docs/sentencetransformersdocumentembedder) to generate embeddings for your Documents and write them to the document store with the [DocumentWriter](https://docs.haystack.deepset.ai/docs/documentwriter).\n",
    "\n",
    "After adding these components to your pipeline, connect them and run the pipeline.\n",
    "\n",
    "> 💡 The `InMemoryDocumentStore` is the simplest document store to run tutorials with and comes with no additional requirements. This can be changed to any of the other available document stores such as **Weaviate, AstraDB, Qdrant, Pinecone and more**. Check out the [full list of document stores](https://haystack.deepset.ai/integrations?type=Document+Store) with instructions on how to run them. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we'll create a helper function that can create indexing pipelines. We will optionally provide this function with `meta_fields_to_embed`. If provided, the `SentenceTransformersDocumentEmbedder` will be initialized with metadata to embed alongside the content of the document.\n",
    "\n",
    "For example, the embedder below will be embedding the \"url\" field as well as the contents of documents:\n",
    "\n",
    "```python\n",
    "from haystack.components.embedders import SentenceTransformersDocumentEmbedder\n",
    "\n",
    "embedder = SentenceTransformersDocumentEmbedder(meta_fields_to_embed=[\"url\"])\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "from haystack import Pipeline\n",
    "from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter\n",
    "from haystack.components.embedders import SentenceTransformersDocumentEmbedder\n",
    "from haystack.components.writers import DocumentWriter\n",
    "from haystack.document_stores.types import DuplicatePolicy\n",
    "from haystack.utils import ComponentDevice\n",
    "\n",
    "\n",
    "def create_indexing_pipeline(document_store, metadata_fields_to_embed=None):\n",
    "    document_cleaner = DocumentCleaner()\n",
    "    document_splitter = DocumentSplitter(split_by=\"period\", split_length=2)\n",
    "    document_embedder = SentenceTransformersDocumentEmbedder(\n",
    "        model=\"thenlper/gte-large\", meta_fields_to_embed=metadata_fields_to_embed\n",
    "    )\n",
    "    document_writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE)\n",
    "\n",
    "    indexing_pipeline = Pipeline()\n",
    "    indexing_pipeline.add_component(\"cleaner\", document_cleaner)\n",
    "    indexing_pipeline.add_component(\"splitter\", document_splitter)\n",
    "    indexing_pipeline.add_component(\"embedder\", document_embedder)\n",
    "    indexing_pipeline.add_component(\"writer\", document_writer)\n",
    "\n",
    "    indexing_pipeline.connect(\"cleaner\", \"splitter\")\n",
    "    indexing_pipeline.connect(\"splitter\", \"embedder\")\n",
    "    indexing_pipeline.connect(\"embedder\", \"writer\")\n",
    "\n",
    "    return indexing_pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we can index our documents from various wikipedia articles. We will create 2 indexing pipelines:\n",
    "\n",
    "- The `indexing_pipeline`: which indexes only the contents of the documents. We will index these documents into `document_store`.\n",
    "- The `indexing_with_metadata_pipeline`: which indexes meta fields alongside the contents of the documents. We will index these documents into `document_store_with_embedded_metadata`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "modules.json: 100%|██████████| 349/349 [00:00<00:00, 561kB/s]\n",
      "config_sentence_transformers.json: 100%|██████████| 124/124 [00:00<00:00, 360kB/s]\n",
      "README.md: 100%|██████████| 93.0k/93.0k [00:00<00:00, 970kB/s]\n",
      "sentence_bert_config.json: 100%|██████████| 52.0/52.0 [00:00<00:00, 156kB/s]\n",
      "config.json: 100%|██████████| 743/743 [00:00<00:00, 2.25MB/s]\n",
      "model.safetensors: 100%|██████████| 133M/133M [02:41<00:00, 826kB/s]  \n",
      "tokenizer_config.json: 100%|██████████| 366/366 [00:00<00:00, 1.92MB/s]\n",
      "vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 1.09MB/s]\n",
      "tokenizer.json: 100%|██████████| 711k/711k [00:00<00:00, 1.11MB/s]\n",
      "special_tokens_map.json: 100%|██████████| 125/125 [00:00<00:00, 374kB/s]\n",
      "1_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 596kB/s]\n",
      "Batches: 100%|██████████| 2/2 [00:40<00:00, 20.03s/it]\n",
      "Batches: 100%|██████████| 2/2 [00:38<00:00, 19.09s/it]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'writer': {'documents_written': 47}}"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import wikipedia\n",
    "from haystack import Document\n",
    "from haystack.document_stores.in_memory import InMemoryDocumentStore\n",
    "\n",
    "some_bands = \"\"\"The Beatles,The Cure\"\"\".split(\",\")\n",
    "\n",
    "raw_docs = []\n",
    "\n",
    "for title in some_bands:\n",
    "    page = wikipedia.page(title=title, auto_suggest=False)\n",
    "    doc = Document(content=page.content, meta={\"title\": page.title, \"url\": page.url})\n",
    "    raw_docs.append(doc)\n",
    "\n",
    "document_store = InMemoryDocumentStore(embedding_similarity_function=\"cosine\")\n",
    "document_store_with_embedded_metadata = InMemoryDocumentStore(embedding_similarity_function=\"cosine\")\n",
    "\n",
    "indexing_pipeline = create_indexing_pipeline(document_store=document_store)\n",
    "indexing_with_metadata_pipeline = create_indexing_pipeline(\n",
    "    document_store=document_store_with_embedded_metadata, metadata_fields_to_embed=[\"title\"]\n",
    ")\n",
    "\n",
    "indexing_pipeline.run({\"cleaner\": {\"documents\": raw_docs}})\n",
    "indexing_with_metadata_pipeline.run({\"cleaner\": {\"documents\": raw_docs}})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Comparing Retrieval With and Without Embedded Metadata\n",
    "\n",
    "As a final step, we will be creating a retrieval pipeline that will have 2 retrievers:\n",
    "- First: retrieving from the `document_store`, where we have not embedded metadata.\n",
    "- Second: retrieving from the `document_store_with_embedded_metadata`, where we have embedded metadata.\n",
    "\n",
    "We will then be able to compare the results and see if embedding metadata has helped with retrieval in this case.\n",
    "\n",
    "> 💡 Here, we are using the `InMemoryEmbeddingRetriever` because we used the `InMemoryDocumentStore` above. If you're using another document store, change this to use the accompanying embedding retriever for the document store you are using. Check out the [Embedders Documentation](https://docs.haystack.deepset.ai/docs/embedders) for a full list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<haystack.pipeline.Pipeline at 0x19e0933b0>"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from haystack.components.embedders import SentenceTransformersTextEmbedder\n",
    "from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever\n",
    "\n",
    "retrieval_pipeline = Pipeline()\n",
    "retrieval_pipeline.add_component(\"text_embedder\", SentenceTransformersTextEmbedder(model=\"thenlper/gte-large\"))\n",
    "retrieval_pipeline.add_component(\n",
    "    \"retriever\", InMemoryEmbeddingRetriever(document_store=document_store, scale_score=False, top_k=3)\n",
    ")\n",
    "retrieval_pipeline.add_component(\n",
    "    \"retriever_with_embeddings\",\n",
    "    InMemoryEmbeddingRetriever(document_store=document_store_with_embedded_metadata, scale_score=False, top_k=3),\n",
    ")\n",
    "\n",
    "retrieval_pipeline.connect(\"text_embedder\", \"retriever\")\n",
    "retrieval_pipeline.connect(\"text_embedder\", \"retriever_with_embeddings\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's run the pipeline and compare the results from `retriever` and `retirever_with_embeddings`. Below you'll see 3 documents returned by each retriever, ranked by relevance.\n",
    "\n",
    "Notice that with the question \"Have the Beatles ever been to Bangor?\", the first pipeline is not returning relevant documents, but the second one is. Here, the `meta` field \"title\" is helpful, because as it turns out, the document that contains the information about The Beatles visiting Bangor does not contain a reference to \"The Beatles\". But, by embedding metadata, the embedding model is able to retrieve the right document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result = retrieval_pipeline.run({\"text_embedder\": {\"text\": \"Have the Beatles ever been to Bangor?\"}})\n",
    "\n",
    "print(\"Retriever Results:\\n\")\n",
    "for doc in result[\"retriever\"][\"documents\"]:\n",
    "    print(doc)\n",
    "\n",
    "print(\"Retriever with Embeddings Results:\\n\")\n",
    "for doc in result[\"retriever_with_embeddings\"][\"documents\"]:\n",
    "    print(doc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What's next\n",
    "\n",
    "🎉 Congratulations! You've embedded metadata while indexing, to improve the results of retrieval!\n",
    "\n",
    "If you liked this tutorial, there's more to learn about Haystack:\n",
    "- [Creating a Hybrid Retrieval Pipeline](https://haystack.deepset.ai/tutorials/33_hybrid_retrieval)\n",
    "- [Building Fallbacks to Websearch with Conditional Routing](https://haystack.deepset.ai/tutorials/36_building_fallbacks_with_conditional_routing)\n",
    "- [Evaluating RAG Pipelines](https://haystack.deepset.ai/tutorials/35_evaluating_rag_pipelines)\n",
    "\n",
    "To stay up to date on the latest Haystack developments, you can [sign up for our newsletter](https://landing.deepset.ai/haystack-community-updates) or [join Haystack discord community](https://discord.gg/haystack).\n",
    "\n",
    "Thanks for reading!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "mistral",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
