{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YxZBCJn21Ygd"
      },
      "source": [
        "# PDF-Based Question Answering with Amazon Bedrock and Haystack\n",
        "\n",
        "*Notebook by [Bilge Yucel](https://www.linkedin.com/in/bilge-yucel/)*\n",
        "\n",
        "[Amazon Bedrock](https://aws.amazon.com/bedrock/) is a fully managed service that provides high-performing foundation models from leading AI startups and Amazon through a single API. You can choose from various foundation models to find the one best suited for your use case.\n",
        "\n",
        "In this notebook, we'll go through the process of **creating a generative question answering application** tailored for PDF files using the newly added [Amazon Bedrock integration](https://haystack.deepset.ai/integrations/amazon-bedrock) with [Haystack](https://github.com/deepset-ai/haystack) and [OpenSearch](https://haystack.deepset.ai/integrations/opensearch-document-store) to store our documents efficiently. The demo will illustrate the step-by-step development of a QA application designed specifically for the Bedrock documentation, demonstrating the power of Bedrock in the process 🚀\n",
        "\n",
        "## Setup the Development Environment"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "v5dzhxUV1QwR"
      },
      "source": [
        "### Install dependencies"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "EX5oCws-etEH"
      },
      "outputs": [],
      "source": [
        "%%bash\n",
        "\n",
        "pip install -q opensearch-haystack amazon-bedrock-haystack pypdf"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WMJaEllC1Wat"
      },
      "source": [
        "### Download Files\n",
        "\n",
        "For this application, we'll use the user guide of Amazon Bedrock. Amazon Bedrock provides the [PDF form of its guide](https://docs.aws.amazon.com/pdfs/bedrock/latest/userguide/bedrock-ug.pdf). Let's download it!"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "jS9Gijk-Gemm"
      },
      "outputs": [],
      "source": [
        "!wget \"https://docs.aws.amazon.com/pdfs/bedrock/latest/userguide/bedrock-ug.pdf\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NY1Ik7YGGtJN"
      },
      "source": [
        "> Note: You can code to download the PDF to `/content/bedrock-documentation.pdf` directory as an alternative👇🏼  "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "id": "chi-VAhGeuQn"
      },
      "outputs": [],
      "source": [
        "# import os\n",
        "\n",
        "# import boto3\n",
        "# from botocore import UNSIGNED\n",
        "# from botocore.config import Config\n",
        "\n",
        "# s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))\n",
        "# s3.download_file('core-engineering', 'public/blog-posts/bedrock-documentation.pdf', '/content/bedrock-documentation.pdf')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Ys3-RvVqqWdD"
      },
      "source": [
        "### Initialize an OpenSearch Instance on Colab\n",
        "\n",
        "[OpenSearch](https://opensearch.org/) is a fully open source search and analytics engine and is compatible with the [Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/what-is.html) that’s helpful if you’d like to deploy, operate, and scale your OpenSearch cluster later on.\n",
        "\n",
        "Let’s install OpenSearch and start an instance on Colab. For other installation options, check out [OpenSearch documentation](https://opensearch.org/docs/latest/install-and-configure/install-opensearch/index/)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "vyWWR3Xye8l_"
      },
      "outputs": [],
      "source": [
        "!wget https://artifacts.opensearch.org/releases/bundle/opensearch/2.11.1/opensearch-2.11.1-linux-x64.tar.gz\n",
        "!tar -xvf opensearch-2.11.1-linux-x64.tar.gz\n",
        "!chown -R daemon:daemon opensearch-2.11.1\n",
        "# disabling security. Be mindful when you want to disable security in production systems\n",
        "!sudo echo 'plugins.security.disabled: true' >> opensearch-2.11.1/config/opensearch.yml"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "id": "Vaxe75MXkMi2"
      },
      "outputs": [],
      "source": [
        "%%bash --bg\n",
        "cd opensearch-2.11.1 && sudo -u daemon -- ./bin/opensearch"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YuN1y5WQ1jI9"
      },
      "source": [
        "> OpenSearch needs 30 seconds for a fully started server"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "id": "f9gbVwRU_Y5Q"
      },
      "outputs": [],
      "source": [
        "import time\n",
        "\n",
        "time.sleep(30)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "pSBYYgYq1Ij3"
      },
      "source": [
        "### API Keys\n",
        "\n",
        "To use Amazon Bedrock, you need `aws_access_key_id`, `aws_secret_access_key`, and indicate the `aws_region_name`. Once logged into your account, locate these keys under the IAM user's \"Security Credentials\" section. For detailed guidance, refer to the documentation on [Managing access keys for IAM users](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 15,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "tZTz7cHwhZ-9",
        "outputId": "4429e7ff-6d6b-4491-cde4-2c91ff04f78f"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "aws_access_key_id: ··········\n",
            "aws_secret_access_key: ··········\n",
            "aws_region_name: us-east-1\n"
          ]
        }
      ],
      "source": [
        "from getpass import getpass\n",
        "\n",
        "os.environ[\"AWS_ACCESS_KEY_ID\"] = getpass(\"aws_access_key_id: \")\n",
        "os.environ[\"AWS_SECRET_ACCESS_KEY\"] = getpass(\"aws_secret_access_key: \")\n",
        "os.environ[\"AWS_DEFAULT_REGION\"] = input(\"aws_region_name: \")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "oa6aH6fB08d_"
      },
      "source": [
        "## Building the Indexing Pipeline\n",
        "\n",
        "Our indexing pipeline will convert the PDF file into a Haystack Document using [PyPDFToDocument](https://docs.haystack.deepset.ai/docs/pypdftodocument) and preprocess it by cleaning and splitting it into chunks before storing them in [OpenSearchDocumentStore](https://docs.haystack.deepset.ai/docs/opensearch-document-store).\n",
        "\n",
        "Let’s run the pipeline below and index our file to our document store:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": true,
        "id": "SrBctAl5e_Kf"
      },
      "outputs": [],
      "source": [
        "from pathlib import Path\n",
        "\n",
        "from haystack import Pipeline\n",
        "from haystack.components.converters import PyPDFToDocument\n",
        "from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter\n",
        "from haystack.components.writers import DocumentWriter\n",
        "from haystack.document_stores.types import DuplicatePolicy\n",
        "from haystack_integrations.document_stores.opensearch import OpenSearchDocumentStore\n",
        "\n",
        "## Initialize the OpenSearchDocumentStore\n",
        "document_store = OpenSearchDocumentStore()\n",
        "\n",
        "## Create pipeline components\n",
        "converter = PyPDFToDocument()\n",
        "cleaner = DocumentCleaner()\n",
        "splitter = DocumentSplitter(split_by=\"sentence\", split_length=10, split_overlap=2)\n",
        "writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)\n",
        "\n",
        "## Add components to the pipeline\n",
        "indexing_pipeline = Pipeline()\n",
        "indexing_pipeline.add_component(\"converter\", converter)\n",
        "indexing_pipeline.add_component(\"cleaner\", cleaner)\n",
        "indexing_pipeline.add_component(\"splitter\", splitter)\n",
        "indexing_pipeline.add_component(\"writer\", writer)\n",
        "\n",
        "## Connect the components to each other\n",
        "indexing_pipeline.connect(\"converter\", \"cleaner\")\n",
        "indexing_pipeline.connect(\"cleaner\", \"splitter\")\n",
        "indexing_pipeline.connect(\"splitter\", \"writer\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "oJLXM8nM02AB"
      },
      "source": [
        "Run the pipeline with the pdf. This might take ~4mins if you're running this notebook on CPU."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 21,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "X7HrON1PFHos",
        "outputId": "30e9ea38-0392-43ef-ab38-eb62dbffc7a8"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'writer': {'documents_written': 1060}}"
            ]
          },
          "execution_count": 21,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "indexing_pipeline.run({\"converter\": {\"sources\": [Path(\"/content/bedrock-ug.pdf\")]}})"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UNmHvZLjA4Rv"
      },
      "source": [
        "## Building the Query Pipeline\n",
        "\n",
        "Let’s create another pipeline to query our application. In this pipeline, we’ll use [OpenSearchBM25Retriever](https://docs.haystack.deepset.ai/docs/opensearchbm25retriever) to retrieve relevant information from the OpenSearchDocumentStore and an Amazon Nova model `amazon.nova-pro-v1:0` to generate answers with [AmazonChatBedrockGenerator](https://docs.haystack.deepset.ai/docs/amazonbedrockchatgenerator). You can select and test different models using the dropdown on right.\n",
        "\n",
        "Next, we'll create a prompt for our task using the Retrieval-Augmented Generation (RAG) approach with [ChatPromptBuilder](https://docs.haystack.deepset.ai/docs/chatpromptbuilder). This prompt will help generate answers by considering the provided context. Finally, we'll connect these three components to complete the pipeline."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 26,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "8Q3JYuyShRnQ",
        "outputId": "012ff7f3-b879-4cea-b10f-fb0a81f930e6"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "<haystack.core.pipeline.pipeline.Pipeline object at 0x7d1aa6550150>\n",
              "🚅 Components\n",
              "  - retriever: OpenSearchBM25Retriever\n",
              "  - prompt_builder: ChatPromptBuilder\n",
              "  - llm: AmazonBedrockChatGenerator\n",
              "🛤️ Connections\n",
              "  - retriever.documents -> prompt_builder.documents (List[Document])\n",
              "  - prompt_builder.prompt -> llm.messages (List[ChatMessage])"
            ]
          },
          "execution_count": 26,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from haystack.components.builders import ChatPromptBuilder\n",
        "from haystack.dataclasses import ChatMessage\n",
        "from haystack import Pipeline\n",
        "from haystack_integrations.components.generators.amazon_bedrock import AmazonBedrockChatGenerator\n",
        "from haystack_integrations.components.retrievers.opensearch import OpenSearchBM25Retriever\n",
        "\n",
        "## Create pipeline components\n",
        "retriever = OpenSearchBM25Retriever(document_store=document_store, top_k=15)\n",
        "\n",
        "## Initialize the AmazonBedrockGenerator with an Amazon Bedrock model\n",
        "bedrock_model = 'amazon.nova-lite-v1:0'\n",
        "generator = AmazonBedrockChatGenerator(model=bedrock_model)\n",
        "template = \"\"\"\n",
        "{% for document in documents %}\n",
        "    {{ document.content }}\n",
        "{% endfor %}\n",
        "\n",
        "Please answer the question based on the given information from Amazon Bedrock documentation.\n",
        "\n",
        "{{query}}\n",
        "\"\"\"\n",
        "prompt_builder = ChatPromptBuilder(template=[ChatMessage.from_user(template)], required_variables=\"*\")\n",
        "\n",
        "## Add components to the pipeline\n",
        "rag_pipeline = Pipeline()\n",
        "rag_pipeline.add_component(\"retriever\", retriever)\n",
        "rag_pipeline.add_component(\"prompt_builder\", prompt_builder)\n",
        "rag_pipeline.add_component(\"llm\", generator)\n",
        "\n",
        "## Connect the components to each other\n",
        "rag_pipeline.connect(\"retriever\", \"prompt_builder.documents\")\n",
        "rag_pipeline.connect(\"prompt_builder\", \"llm\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5NywqZKo6msf"
      },
      "source": [
        "Ask your question and learn about the Amazon Bedrock service using Amazon Bedrock models!"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 28,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "mDYCSRRtiAy5",
        "outputId": "fc206e26-d198-48a6-bf7c-f471fb0de0a9"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Amazon Bedrock is a fully managed service that makes high-performing foundation models (FMs) from leading AI startups and Amazon available for use through a unified API. Key capabilities include:\n",
            "\n",
            "- Easily experiment with and evaluate top foundation models for various use cases. Models are available from providers like AI21 Labs, Anthropic, Cohere, Meta, and Stability AI.\n",
            "\n",
            "- Privately customize models with your own data using techniques like fine-tuning and retrieval augmented generation (RAG). \n",
            "\n",
            "- Build agents that execute tasks using enterprise systems and data sources.\n",
            "\n",
            "- Serverless experience so you can get started quickly without managing infrastructure.\n",
            "\n",
            "- Integrate customized models into applications using AWS tools.\n",
            "\n",
            "So in summary, Amazon Bedrock provides easy access to top AI models that you can customize and integrate into apps to build intelligent solutions. It's a fully managed service focused on generative AI.\n"
          ]
        }
      ],
      "source": [
        "question = \"What is Amazon Bedrock?\"\n",
        "response = rag_pipeline.run({\"query\": question})\n",
        "\n",
        "print(response[\"llm\"][\"replies\"][0].text)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_OoOKGqjK8ir"
      },
      "source": [
        "### Other Queries\n",
        "\n",
        "You can also try these queries:\n",
        "\n",
        "* How can I setup Amazon Bedrock?\n",
        "* How can I finetune foundation models?\n",
        "* How should I form my prompts for Amazon Titan models?\n",
        "* How should I form my prompts for Claude models?"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.7"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}