{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RZ0VPP0Oh9lI"
      },
      "source": [
        "# Building a custom component for RAG pipelines with Haystack\n",
        "\n",
        "*by Tuana Celik: [Twitter](https://twitter.com/tuanacelik), [LinkedIn](https://www.linkedin.com/in/tuanacelik/)*\n",
        "\n",
        "📚 Check out the [**Customizing RAG Pipelines to Summarize Latest Hacker News Posts with Haystack**](https://haystack.deepset.ai/blog/customizing-rag-to-summarize-hacker-news-posts-with-haystack2) article for a detailed run through of this example.\n",
        "\n",
        "### Install dependencies"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "0dFI9IYf50Mh"
      },
      "outputs": [],
      "source": [
        "!pip install newspaper3k\n",
        "!pip install haystack-ai"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "J6RfNzvOpE7B"
      },
      "source": [
        "## Create a Custom Haystack Component\n",
        "\n",
        "This `HackernewsNewestFetcher` ferches the `last_k` newest posts on Hacker News and returns the contents as a List of Haystack Document objects"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "SLjTO0fv4UaB"
      },
      "outputs": [],
      "source": [
        "from typing import List\n",
        "from haystack import component, Document\n",
        "from newspaper import Article\n",
        "import requests\n",
        "\n",
        "@component\n",
        "class HackernewsNewestFetcher():\n",
        "\n",
        "  @component.output_types(articles=List[Document])\n",
        "  def run(self, last_k: int):\n",
        "    newest_list = requests.get(url='https://hacker-news.firebaseio.com/v0/newstories.json?print=pretty')\n",
        "    articles = []\n",
        "    for id in newest_list.json()[0:last_k]:\n",
        "      article = requests.get(url=f\"https://hacker-news.firebaseio.com/v0/item/{id}.json?print=pretty\")\n",
        "      if 'url' in article.json():\n",
        "        articles.append(article.json()['url'])\n",
        "\n",
        "    docs = []\n",
        "    for url in articles:\n",
        "      try:\n",
        "        article = Article(url)\n",
        "        article.download()\n",
        "        article.parse()\n",
        "        docs.append(Document(content=article.text, meta={'title': article.title, 'url': url}))\n",
        "      except:\n",
        "        print(f\"Couldn't download {url}, skipped\")\n",
        "    return {'articles': docs}\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8Z9Q-S55pX-z"
      },
      "source": [
        "## Create a Haystack 2.0 RAG Pipeline\n",
        "\n",
        "This pipeline uses the components available in the Haystack 2.0 preview package at time of writing (22 September 2023) as well as the custom component we've created above.\n",
        "\n",
        "The end result is a RAG pipeline designed to provide a list of summaries for each of the `last_k` posts on Hacker News, followes by the source URL."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "1nT2-ms3QV0c"
      },
      "outputs": [],
      "source": [
        "from getpass import getpass\n",
        "import os\n",
        "\n",
        "os.environ[\"OPENAI_API_KEY\"] = getpass(\"OpenAI Key: \")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "HTxmwMup-5eS"
      },
      "outputs": [],
      "source": [
        "from haystack import Pipeline\n",
        "from haystack.components.builders.prompt_builder import PromptBuilder\n",
        "from haystack.components.generators import OpenAIGenerator\n",
        "\n",
        "prompt_template = \"\"\"\n",
        "You will be provided a few of the latest posts in HackerNews, followed by their URL.\n",
        "For each post, provide a brief summary followed by the URL the full post can be found in.\n",
        "\n",
        "Posts:\n",
        "{% for article in articles %}\n",
        "  {{article.content}}\n",
        "  URL: {{article.meta['url']}}\n",
        "{% endfor %}\n",
        "\"\"\"\n",
        "\n",
        "prompt_builder = PromptBuilder(template=prompt_template)\n",
        "llm = OpenAIGenerator(model=\"gpt-4\")\n",
        "fetcher = HackernewsNewestFetcher()\n",
        "\n",
        "pipe = Pipeline()\n",
        "pipe.add_component(\"hackernews_fetcher\", fetcher)\n",
        "pipe.add_component(\"prompt_builder\", prompt_builder)\n",
        "pipe.add_component(\"llm\", llm)\n",
        "\n",
        "pipe.connect(\"hackernews_fetcher.articles\", \"prompt_builder.articles\")\n",
        "pipe.connect(\"prompt_builder.prompt\", \"llm.prompt\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Ex4yeDWA_xMo"
      },
      "outputs": [],
      "source": [
        "result = pipe.run(data={\"hackernews_fetcher\": {\"last_k\": 3}})\n",
        "print(result['llm']['replies'][0])"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "authorship_tag": "ABX9TyNp8MxX0RRKKrtRWPnGZ1xM",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python",
      "version": "3.9.7"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
