{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "YxZBCJn21Ygd"
   },
   "source": [
    "# Question Answering with Amazon Sagemaker, Chroma and Haystack\n",
    "\n",
    "*Notebook by [Sara Zanzottera](https://www.zansara.dev/) and [Bilge Yucel](https://www.linkedin.com/in/bilge-yucel/)*\n",
    "\n",
    "[Amazon Sagemaker](https://docs.aws.amazon.com/sagemaker/) is a comprehensive, fully managed machine learning service\n",
    "that allows data scientists and developers to build, train, and deploy ML models efficiently. You can choose from various foundation models to find the one best suited for your use case.\n",
    "\n",
    "In this notebook, we'll go through the process of **creating a generative question answering application** using the newly added [Amazon Sagemaker integration](https://haystack.deepset.ai/integrations/amazon-sagemaker) with [Haystack](https://github.com/deepset-ai/haystack) and [Chroma](https://haystack.deepset.ai/integrations/chroma-documentstore) to store our documents efficiently. The demo will illustrate the step-by-step development of a QA application using some Wikipedia pages about NASA's Mars missions 🚀\n",
    "\n",
    "## Setup the Development Environment\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "v5dzhxUV1QwR"
   },
   "source": [
    "### Install dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "EX5oCws-etEH",
    "is_executing": true,
    "outputId": "4d46055f-4d58-4d67-b895-ad701c2eb306"
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "pip install chroma-haystack amazon-sagemaker-haystack wikipedia typing_extensions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Eg9lSuAJM6MJ"
   },
   "source": [
    "## Deploy a model on Sagemaker\n",
    "\n",
    "To use Amazon Sagemaker's models, you first need to deploy them. In this example we'll be using Falcon 7B Instruct BF16, so make sure to deploy such model on your account before proceeding.\n",
    "\n",
    "For help you can check out:\n",
    "- Amazon Sagemaker Jumpstart [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-foundation-models-use.html).\n",
    "- [This notebook](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/jumpstart-foundation-models/text-generation-falcon.ipynb) on how to deploy Falcon models programmatically with a notebook\n",
    "- [This blogpost](https://aws.amazon.com/blogs/machine-learning/build-production-ready-generative-ai-applications-for-enterprise-search-using-haystack-pipelines-and-amazon-sagemaker-jumpstart-with-llms/) about deploying models on Sagemaker for Haystack 1.x\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pSBYYgYq1Ij3"
   },
   "source": [
    "### API Keys\n",
    "\n",
    "To use Amazon Sagemaker, you need to set a few environment variables: `AWS ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and often to indicate the region by setting `AWS_REGION`. Once logged into your account, locate these keys under the IAM user's \"Security Credentials\" section. For detailed guidance, refer to the documentation on [Managing access keys for IAM users](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "tZTz7cHwhZ-9",
    "is_executing": true,
    "outputId": "72a5b7af-5d81-4c2f-e922-f35ee1dda94e"
   },
   "outputs": [],
   "source": [
    "import os\n",
    "from getpass import getpass\n",
    "\n",
    "os.environ[\"AWS_ACCESS_KEY_ID\"] = getpass(\"aws_access_key_id: \")\n",
    "os.environ[\"AWS_SECRET_ACCESS_KEY\"] = getpass(\"aws_secret_access_key: \")\n",
    "os.environ[\"AWS_REGION\"] = input(\"aws_region_name: \")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "k-CF7LUSy2T7"
   },
   "source": [
    "## Load data from Wikipedia\n",
    "\n",
    "We are going to download the Wikipedia pages related to NASA's martian rovers using the python library `wikipedia`.\n",
    "\n",
    "These pages are converted into Haystack Documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "-Nz9MRVgxcfW"
   },
   "outputs": [],
   "source": [
    "import wikipedia\n",
    "from haystack.dataclasses import Document\n",
    "\n",
    "wiki_pages = [\n",
    "    \"Ingenuity_(helicopter)\",\n",
    "    \"Perseverance_(rover)\",\n",
    "    \"Curiosity_(rover)\",\n",
    "    \"Opportunity_(rover)\",\n",
    "    \"Spirit_(rover)\",\n",
    "    \"Sojourner_(rover)\"\n",
    "]\n",
    "\n",
    "raw_docs=[]\n",
    "for title in wiki_pages:\n",
    "    page = wikipedia.page(title=title, auto_suggest=False)\n",
    "    doc = Document(content=page.content, meta={\"title\": page.title, \"url\":page.url})\n",
    "    raw_docs.append(doc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "oa6aH6fB08d_"
   },
   "source": [
    "## Building the Indexing Pipeline\n",
    "\n",
    "Our indexing pipeline will preprocess the provided Wikipedia pages by cleaning and splitting it into chunks before storing them in [ChromaDocumentStore](https://docs.haystack.deepset.ai/docs/chroma-document-store).\n",
    "\n",
    "Let’s run the pipeline below and index our file to our document store:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "SrBctAl5e_Kf"
   },
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "from haystack import Pipeline\n",
    "from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter\n",
    "from haystack.components.writers import DocumentWriter\n",
    "from haystack.document_stores.types import DuplicatePolicy\n",
    "from haystack_integrations.document_stores.chroma import ChromaDocumentStore\n",
    "\n",
    "## Initialize ChromaDocumentStore\n",
    "document_store = ChromaDocumentStore()\n",
    "\n",
    "## Create pipeline components\n",
    "cleaner = DocumentCleaner()\n",
    "splitter = DocumentSplitter(split_by=\"sentence\", split_length=10, split_overlap=2)\n",
    "writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)\n",
    "\n",
    "## Add components to the pipeline\n",
    "indexing_pipeline = Pipeline()\n",
    "indexing_pipeline.add_component(\"cleaner\", cleaner)\n",
    "indexing_pipeline.add_component(\"splitter\", splitter)\n",
    "indexing_pipeline.add_component(\"writer\", writer)\n",
    "\n",
    "## Connect the components to each other\n",
    "indexing_pipeline.connect(\"cleaner\", \"splitter\")\n",
    "indexing_pipeline.connect(\"splitter\", \"writer\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "oJLXM8nM02AB"
   },
   "source": [
    "Run the pipeline with the files you want to index (note that this step may take some time):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "X7HrON1PFHos",
    "is_executing": true,
    "outputId": "22a57096-f22a-4333-cd85-9ea9ea4b52e0"
   },
   "outputs": [],
   "source": [
    "indexing_pipeline.run({\"cleaner\":{\"documents\":raw_docs}})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "UNmHvZLjA4Rv"
   },
   "source": [
    "## Building the Query Pipeline\n",
    "\n",
    "Let’s create another pipeline to query our application. In this pipeline, we’ll use [ChromaQueryTextRetriever](https://docs.haystack.deepset.ai/docs/chromaqueryretriever) to retrieve relevant information from the ChromaDocumentStore and a Falcon 7B Instruct BF16 model to generate answers with [SagemakerGenerator](https://docs.haystack.deepset.ai/docs/sagemakergenerator).\n",
    "\n",
    "Next, we'll create a prompt for our task using the Retrieval-Augmented Generation (RAG) approach with [PromptBuilder](https://docs.haystack.deepset.ai/docs/promptbuilder). This prompt will help generate answers by considering the provided context. Finally, we'll connect these three components to complete the pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "8Q3JYuyShRnQ"
   },
   "outputs": [],
   "source": [
    "from haystack import Pipeline\n",
    "from haystack.components.builders import PromptBuilder\n",
    "from haystack_integrations.components.generators.amazon_sagemaker import SagemakerGenerator\n",
    "from haystack_integrations.components.retrievers.chroma import ChromaQueryTextRetriever\n",
    "\n",
    "# Create pipeline components\n",
    "retriever = ChromaQueryTextRetriever(document_store=document_store, top_k=3)\n",
    "\n",
    "# Initialize the AmazonSagemakerGenerator with an Amazon Sagemaker model\n",
    "# You may need to change the model name if it differs from your endpoint name.\n",
    "model = 'jumpstart-dft-hf-llm-falcon-7b-instruct-bf16'\n",
    "generator = SagemakerGenerator(model=model, generation_kwargs={\"max_new_tokens\":256})\n",
    "template = \"\"\"\n",
    "{% for document in documents %}\n",
    "    {{ document.content }}\n",
    "{% endfor %}\n",
    "\n",
    "Answer based on the information above: {{question}}\n",
    "\"\"\"\n",
    "prompt_builder = PromptBuilder(template=template)\n",
    "\n",
    "## Add components to the pipeline\n",
    "rag_pipeline = Pipeline()\n",
    "rag_pipeline.add_component(\"retriever\", retriever)\n",
    "rag_pipeline.add_component(\"prompt_builder\", prompt_builder)\n",
    "rag_pipeline.add_component(\"llm\", generator)\n",
    "\n",
    "## Connect the components to each other\n",
    "rag_pipeline.connect(\"retriever\", \"prompt_builder.documents\")\n",
    "rag_pipeline.connect(\"prompt_builder\", \"llm\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "5NywqZKo6msf"
   },
   "source": [
    "Ask your question and learn about the Amazon Sagemaker service using Amazon Sagemaker models!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "mDYCSRRtiAy5",
    "outputId": "b644aeb8-c9eb-4dbf-ed28-ad3080826410"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Opportunity landed on Mars on January 24, 2004.\n"
     ]
    }
   ],
   "source": [
    "question = \"When did Opportunity land?\"\n",
    "response = rag_pipeline.run({\"retriever\": {\"query\": question}, \"prompt_builder\": {\"question\": question}})\n",
    "\n",
    "print(response[\"llm\"][\"replies\"][0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "giSWajzyAcNp",
    "outputId": "ac240c1a-657d-447a-8f08-8558616d71e9"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Yes, the Ingenuity mission is over. The helicopter made a total of 72 flights over a period of about 3 years until rotor damage sustained in January 2024 forced an end to the mission.\n"
     ]
    }
   ],
   "source": [
    "question = \"Is Ingenuity mission over?\"\n",
    "response = rag_pipeline.run({\"retriever\": {\"query\": question}, \"prompt_builder\": {\"question\": question}})\n",
    "\n",
    "print(response[\"llm\"][\"replies\"][0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "ROhJ8VL_JdHc",
    "outputId": "4c77f458-d472-4644-b073-304986bf7a6c"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "The first NASA rover to land on Mars was called Sojourner.\n"
     ]
    }
   ],
   "source": [
    "question = \"What was the name of the first NASA rover to land on Mars?\"\n",
    "response = rag_pipeline.run({\"retriever\": {\"query\": question}, \"prompt_builder\": {\"question\": question}})\n",
    "\n",
    "print(response[\"llm\"][\"replies\"][0])"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
