diff --git a/examples/document_processing/asset_manager_fund_analysis.ipynb b/examples/document_processing/asset_manager_fund_analysis.ipynb new file mode 100644 index 0000000..3e304f4 --- /dev/null +++ b/examples/document_processing/asset_manager_fund_analysis.ipynb @@ -0,0 +1,1853 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "28df5c07-95ab-409d-9304-d22333f3780d", + "metadata": {}, + "source": [ + "# Extraction and Analysis over a Fidelity Multi-Fund Annual Report\n", + "\n", + "\"Open\n", + "\n", + "In this notebook we show you how to create an agentic document workflow over a complex document that contains annual reports for multiple funds - each fund reports financials in a standardized reporting structure, and it's all consolidated in the same document.\n", + "\n", + "We show you how to create a workflow that parses the document, splits it into subsections per fund, and runs extraction per subsection. The extracted results can be compiled into a single CSV file.\n", + "\n", + "![](asset_manager_fund_analysis.png)\n" + ] + }, + { + "cell_type": "markdown", + "id": "cda2e5e9-fe9d-42d9-9387-f529d970ff7b", + "metadata": {}, + "source": [ + "## Setup\n", + "\n", + "\n", + "### Data\n", + "\n", + "We load the consolidated document from Fidelity's Shareholder Report portal.\n", + "\n", + "**To get the original Doc**\n", + "\n", + "Link: https://fundresearch.fidelity.com/prospectus/sec\n", + "- Go to row \"Fidelity Advisor Asset Manager® 20% Fund Class A\", and click \"Annual\" under \"Financial Statements & Other Information\"\n", + "\n", + "**Or, just download it below (we're hosting it on Dropbox)**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d563afc1-ccfa-410a-97a6-85041f8551f7", + "metadata": {}, + "outputs": [], + "source": [ + "!wget \"https://www.dropbox.com/scl/fi/bhrtivs7b2gz3yhrr4t4s/fidelity_fund.pdf?rlkey=ha2loufvuer1c07u47k68hgji&st=ev66x31t&dl=1\" -O data/asset_manager_fund_analysis/fidelity_fund.pdf" + ] + }, + { + "cell_type": "markdown", + "id": "098d71d4-dd70-4fd8-879e-557ba2d78899", + "metadata": {}, + "source": [ + "### Everything else\n", + "\n", + "In the below sections we define the required environment variables and LLM/embedding model.\n", + "\n", + "Sign up for a LlamaCloud account here: https://cloud.llamaindex.ai/" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0b12c54b-5a0f-4c9a-86da-155aa549bb3b", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "# API access to llama-cloud\n", + "os.environ[\"LLAMA_CLOUD_API_KEY\"] = \"llx-...\"\n", + "\n", + "# Using OpenAI API for embeddings/llms\n", + "os.environ[\"OPENAI_API_KEY\"] = \"sk-proj-...\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9c2b6741-26ce-4472-8be7-00dd037d9ea5", + "metadata": {}, + "outputs": [], + "source": [ + "# set LlamaCloud project/organization ID\n", + "# project_id = None\n", + "# organization_id = None\n", + "project_id = \"2fef999e-1073-40e6-aeb3-1f3c0e64d99b\"\n", + "organization_id = \"43b88c8f-e488-46f6-9013-698e3d2e374a\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0f61c14b-9813-4886-8869-4d8d973d04c1", + "metadata": {}, + "outputs": [], + "source": [ + "from llama_index.llms.openai import OpenAI\n", + "from llama_index.embeddings.openai import OpenAIEmbedding\n", + "from llama_index.core import Settings\n", + "\n", + "# set LLM, embedding model\n", + "embed_model = OpenAIEmbedding(model_name=\"text-embedding-3-small\")\n", + "llm = OpenAI(model=\"gpt-4.1\")\n", + "Settings.llm = llm\n", + "Settings.embed_model = embed_model" + ] + }, + { + "cell_type": "markdown", + "id": "0d12de2e-87f3-4c43-af26-06913e1b41cb", + "metadata": {}, + "source": [ + "## Create Splitter\n", + "\n", + "The document is a consolidated regulatory filing that contains multiple sub-entities - the 7 different Asset Manager funds - with identical reporting structures.\n", + "\n", + "Our first task is to create a document splitter that can identify the specific page numbers corresponding to each fund. This is a pre-requisite for the downstream task of doing per-fund extraction and consolidation.\n", + "\n", + "The below block uses LlamaParse to parse the full document and output the result in per-page Markdown nodes." + ] + }, + { + "cell_type": "markdown", + "id": "46cc4516-022a-4299-ae51-c5525882ae9b", + "metadata": {}, + "source": [ + "#### Setup Parser" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0408e7b1-ce23-4422-b389-8f69f6e37a6a", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Retrying llama_cloud_services.parse.utils.make_api_request.._make_request in 4.0 seconds as it raised RemoteProtocolError: Server disconnected without sending a response..\n", + "Retrying llama_cloud_services.parse.utils.make_api_request.._make_request in 4.0 seconds as it raised RemoteProtocolError: Server disconnected without sending a response..\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Started parsing the file under job_id 36c97d60-995c-4b82-824b-3553c55d93e3\n" + ] + } + ], + "source": [ + "from llama_cloud_services import LlamaParse\n", + "\n", + "parser = LlamaParse(\n", + " premium_mode=True,\n", + " result_type=\"markdown\",\n", + " project_id=project_id,\n", + " organization_id=organization_id,\n", + ")\n", + "result = await parser.aparse(\"./data/asset_manager_fund_analysis/fidelity_fund.pdf\")\n", + "markdown_nodes = await result.aget_markdown_nodes(split_by_page=True)" + ] + }, + { + "cell_type": "markdown", + "id": "e4bb2c88-d4ca-4eb8-bcc5-420ddf5bc0a0", + "metadata": {}, + "source": [ + "### Define Splitting Functions\n", + "\n", + "We then define a set of functions to find the \"splits\". \n", + "1. We first identify the split categories given a user description. (e.g. what are the precise labels we want to split on?) \n", + "2. We then split the document according to the split categories. \n", + "\n", + "On (2), we do this by using LLMs on each page to detect if a split starts on that page. We then count all pages from the start of the current split to the start on the next split as within the current split." + ] + }, + { + "cell_type": "markdown", + "id": "a3610297-9319-4b31-a351-8c1458fbb08c", + "metadata": {}, + "source": [ + "#### Find Split Categories" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "951ee2b9-5d90-479b-9e92-247037c66742", + "metadata": {}, + "outputs": [], + "source": [ + "from pydantic import BaseModel, Field\n", + "from typing import List, Optional, Dict\n", + "from llama_index.core.schema import TextNode\n", + "from llama_index.core.llms import LLM\n", + "from llama_index.core.async_utils import run_jobs\n", + "from llama_index.core.prompts import ChatPromptTemplate, ChatMessage\n", + "from collections import defaultdict\n", + "\n", + "\n", + "split_category_prompt = \"\"\"\\\n", + "You are an AI document assistant tasked with finding the 'split categories' given a user description and the document text.\n", + "- The split categories is a list of string tags from the document that correspond to the user description.\n", + "- Do not make up split categories. \n", + "- Do not include category tags that don't fit the user description,\\\n", + "for instance subcategories or extraneous titles.\n", + "- Do not exclude category tags that do fit the user description. \n", + "\n", + "For instance, if the user asks to \"find all top-level sections of an ArXiv paper\", then a sample output would be:\n", + "[\"1. Introduction\", \"2. Related Work\", \"3. Methodology\", \"4. Experiments\", \"5. Conclusion\"]\n", + "\n", + "The split description and document text are given below. \n", + "\n", + "Split description:\n", + "{split_description}\n", + "\n", + "Here is the document text:\n", + "{document_text}\n", + " \n", + "\"\"\"\n", + "\n", + "\n", + "class SplitCategories(BaseModel):\n", + " \"\"\"A list of all split categories from a document.\"\"\"\n", + "\n", + " split_categories: List[str]\n", + "\n", + "\n", + "async def afind_split_categories(\n", + " split_description: str,\n", + " nodes: List[TextNode],\n", + " llm: Optional[LLM] = None,\n", + " page_limit: Optional[int] = 5,\n", + ") -> List[str]:\n", + " \"\"\"Find split categories given a user description and the page limit.\n", + " \n", + " These categories will then be used to find the exact splits of the document. \n", + " \n", + " NOTE: with the page limit there is an assumption that the categories are found in the first few pages,\\\n", + " for instance in the table of contents. This does not account for the case where the categories are \\\n", + " found throughout the document. \n", + " \n", + " \"\"\"\n", + " llm = llm or OpenAI(model=\"gpt-4.1\")\n", + "\n", + " chat_template = ChatPromptTemplate(\n", + " [\n", + " ChatMessage.from_str(split_category_prompt, \"user\"),\n", + " ]\n", + " )\n", + " nodes_head = nodes[:page_limit] if page_limit is not None else nodes\n", + " doc_text = \"\\n-----\\n\".join(\n", + " [n.get_content(metadata_mode=\"all\") for n in nodes_head]\n", + " )\n", + "\n", + " result = await llm.astructured_predict(\n", + " SplitCategories,\n", + " chat_template,\n", + " split_description=split_description,\n", + " document_text=doc_text,\n", + " )\n", + " return result.split_categories" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ccd7f58d-f48c-4b87-96ca-baf132d71a75", + "metadata": {}, + "outputs": [], + "source": [ + "split_categories = await afind_split_categories(\n", + " \"Find and split by the main funds in this document, should be listed in the first few pages\",\n", + " markdown_nodes,\n", + " llm=llm,\n", + " page_limit=5,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c7047d15-48d3-4469-91d3-037529a47140", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['Fidelity Asset Manager® 20%',\n", + " 'Fidelity Asset Manager® 30%',\n", + " 'Fidelity Asset Manager® 40%',\n", + " 'Fidelity Asset Manager® 50%',\n", + " 'Fidelity Asset Manager® 60%',\n", + " 'Fidelity Asset Manager® 70%',\n", + " 'Fidelity Asset Manager® 85%']" + ] + }, + "execution_count": null, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "split_categories" + ] + }, + { + "cell_type": "markdown", + "id": "c7fd8746-4a45-4f79-afca-30d28b99a183", + "metadata": {}, + "source": [ + "#### Split Document based on Existing Categories/Rules" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "471cf952-713f-47eb-a40e-c2318564fa19", + "metadata": {}, + "outputs": [], + "source": [ + "# go through each node (on a page level), and tag node\n", + "\n", + "\n", + "split_prompt = \"\"\"\\\n", + "You are an AI document assistant tasked with extracting out splits from a document text according to a certain set of rules. \n", + "\n", + "You are given a chunk of the document text at a time. \n", + "You are responsible for determining if the chunk of the document text corresponds to the beginning of a split. \n", + "\n", + "We've listed general rules below, and the user has also provided their own rules to find a split. Please extract\n", + "out the splits according to the defined schema. \n", + "\n", + "General Rules: \n", + "- You should ONLY extract out a split if the document text contains the beginning of a split.\n", + "- If the document text contains the beginning of two or more splits (e.g. there are multiple sections on a single page), then \\\n", + "return all splits in the output.\n", + "- If the text does not correspond to the beginning of any split, then return a blank list. \n", + "- A valid split must be clearly delineated in the document text according to the user rules. \\\n", + "Do NOT identify a split if it is mentioned, but is not actually the start of a split in the document text.\n", + "- If you do find one or more splits, please output the split_name according to the format \\\"{split_key}_X\\\", \\\n", + "where X is a short tag corresponding to the split. \n", + "\n", + "Split key:\n", + "{split_key}\n", + "\n", + "User-defined rules:\n", + "{split_rules}\n", + "\n", + "\n", + "Here is the chunk text:\n", + "{chunk_text}\n", + " \n", + "\"\"\"\n", + "\n", + "\n", + "class SplitOutput(BaseModel):\n", + " \"\"\"The metadata for a given split start given a chunk.\"\"\"\n", + "\n", + " split_name: str = Field(\n", + " ..., description=\"The name of the split (in the format \\{split_key\\}_X)\"\n", + " )\n", + " split_description: str = Field(\n", + " ..., description=\"A short description corresponding to the split.\"\n", + " )\n", + " page_number: int = Field(..., description=\"Page number of the split.\")\n", + "\n", + "\n", + "class SplitsOutput(BaseModel):\n", + " \"\"\"A list of all splits given a chunk.\"\"\"\n", + "\n", + " splits: List[SplitOutput]\n", + "\n", + "\n", + "async def atag_splits_in_node(\n", + " split_rules: str, split_key: str, node: TextNode, llm: Optional[LLM] = None\n", + "):\n", + " \"\"\"Tag split in a single node.\"\"\"\n", + " llm = llm or OpenAI(model=\"gpt-4o\")\n", + "\n", + " chat_template = ChatPromptTemplate(\n", + " [\n", + " ChatMessage.from_str(split_prompt, \"user\"),\n", + " ]\n", + " )\n", + "\n", + " result = await llm.astructured_predict(\n", + " SplitsOutput,\n", + " chat_template,\n", + " split_rules=split_rules,\n", + " split_key=split_key,\n", + " chunk_text=node.get_content(metadata_mode=\"all\"),\n", + " )\n", + " return result.splits\n", + "\n", + "\n", + "async def afind_splits(\n", + " split_rules: str, split_key: str, nodes: List[TextNode], llm: Optional[LLM] = None\n", + ") -> Dict:\n", + " \"\"\"Find splits.\"\"\"\n", + "\n", + " # tag each node with split or no-split\n", + " tasks = [atag_splits_in_node(split_rules, split_key, n, llm=llm) for n in nodes]\n", + " async_results = await run_jobs(tasks, workers=8, show_progress=True)\n", + " all_splits = [s for r in async_results for s in r]\n", + "\n", + " split_name_to_pages = defaultdict(list)\n", + "\n", + " split_idx = 0\n", + " for idx, n in enumerate(nodes):\n", + " cur_page = n.metadata[\"page_number\"]\n", + "\n", + " # update the current split if needed\n", + " while (\n", + " split_idx + 1 < len(all_splits)\n", + " and all_splits[split_idx + 1].page_number <= cur_page\n", + " ):\n", + " split_idx += 1\n", + "\n", + " # add page number to the current split\n", + " if all_splits[split_idx].page_number <= cur_page:\n", + " split_name = all_splits[split_idx].split_name\n", + " split_name_to_pages[split_name].append(cur_page)\n", + "\n", + " # print(all_splits)\n", + " # print(split_name_to_pages)\n", + " # raise Exception\n", + "\n", + " return split_name_to_pages" + ] + }, + { + "cell_type": "markdown", + "id": "737573c6-771a-4fde-bcb3-221b5e3616f7", + "metadata": {}, + "source": [ + "#### Find Categories and then Run Splits" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b5552735-e4f7-4a76-a58a-dad2a5bfab3e", + "metadata": {}, + "outputs": [], + "source": [ + "# put it all together - detect categories, then split document based on those categories\n", + "async def afind_categories_and_splits(\n", + " split_description: str,\n", + " split_key: str,\n", + " nodes: List[TextNode],\n", + " additional_split_rules: Optional[str] = None,\n", + " llm: Optional[LLM] = None,\n", + " page_limit: int = 5,\n", + " verbose: bool = False,\n", + "):\n", + " \"\"\"Find categories and then splits.\"\"\"\n", + " categories = await afind_split_categories(\n", + " split_description, nodes, llm=llm, page_limit=page_limit\n", + " )\n", + " if verbose:\n", + " print(f\"Split categories: {categories}\")\n", + " full_split_rules = f\"\"\"Please split by these categories: {categories}\"\"\"\n", + " if additional_split_rules:\n", + " full_split_rules += f\"\\n\\n\\n{additional_split_rules}\"\n", + "\n", + " return await afind_splits(full_split_rules, split_key, nodes, llm=llm)" + ] + }, + { + "cell_type": "markdown", + "id": "279ad80d-3c1d-4a73-9c61-a9bc819571be", + "metadata": {}, + "source": [ + "#### Define document-specific split rules\n", + "\n", + "Now that we have the general splitter, let's define the rules to help parse this Fidelity document.\n", + "\n", + "**NOTE**: Ideally the valid names are automatically identified, through a separate pass in the workflow." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "11ea71c0-ee30-47a1-88d2-62899cd57d57", + "metadata": {}, + "outputs": [], + "source": [ + "# fidelity_split_rules = \"\"\"\n", + "# - You must split by the name of the fund\n", + "# - Here are the valid names - do not include subcategories:\n", + "# Fidelity Asset Manager® 20%\n", + "# Fidelity Asset Manager® 30%\n", + "# Fidelity Asset Manager® 40%\n", + "# Fidelity Asset Manager® 50%\n", + "# Fidelity Asset Manager® 60%\n", + "# Fidelity Asset Manager® 70%\n", + "# Fidelity Asset Manager® 85%\n", + "# - Each fund will have a list of tables underneath it, like schedule of investments, financial statements\n", + "# - Each fund usually has schedule of investments right underneath it\n", + "# - Do not tag the cover page/table of contents\n", + "# \"\"\"\n", + "# fidelity_split_key = \"fidelity_asset_manager\"\n", + "\n", + "\n", + "fidelity_split_description = \"Find and split by the main funds in this document, should be listed in the first few pages\"\n", + "fidelity_split_rules = \"\"\"\n", + "- You must split by the name of the fund\n", + "- Each fund will have a list of tables underneath it, like schedule of investments, financial statements\n", + "- Each fund usually has schedule of investments right underneath it \n", + "- Do not tag the cover page/table of contents\n", + "\"\"\"\n", + "fidelity_split_key = \"fidelity_asset_manager\"" + ] + }, + { + "cell_type": "markdown", + "id": "a0e41c96-4ccf-43ff-b24d-1bc3a4828825", + "metadata": {}, + "source": [ + "#### Try it out" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a344810c-acf3-49db-9dc7-519b55f371be", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Split categories: ['Fidelity Asset Manager® 20%', 'Fidelity Asset Manager® 30%', 'Fidelity Asset Manager® 40%', 'Fidelity Asset Manager® 50%', 'Fidelity Asset Manager® 60%', 'Fidelity Asset Manager® 70%', 'Fidelity Asset Manager® 85%']\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|████████████████████████████████████████████████████████████████████| 120/120 [00:28<00:00, 4.20it/s]\n" + ] + } + ], + "source": [ + "split_name_to_pages = await afind_categories_and_splits(\n", + " fidelity_split_description,\n", + " fidelity_split_key,\n", + " markdown_nodes,\n", + " additional_split_rules=fidelity_split_rules,\n", + " llm=llm,\n", + " verbose=True,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c20304fd-031d-4556-bfa0-0c626175df76", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "defaultdict(list,\n", + " {'fidelity_asset_manager_20': [2,\n", + " 3,\n", + " 4,\n", + " 5,\n", + " 6,\n", + " 7,\n", + " 8,\n", + " 9,\n", + " 10,\n", + " 11,\n", + " 12,\n", + " 13,\n", + " 93,\n", + " 94,\n", + " 108,\n", + " 109,\n", + " 110,\n", + " 111],\n", + " 'fidelity_asset_manager_30': [14,\n", + " 15,\n", + " 16,\n", + " 17,\n", + " 18,\n", + " 19,\n", + " 20,\n", + " 21,\n", + " 22,\n", + " 23,\n", + " 24,\n", + " 25,\n", + " 99,\n", + " 101,\n", + " 112],\n", + " 'fidelity_asset_manager_40': [26,\n", + " 27,\n", + " 28,\n", + " 29,\n", + " 30,\n", + " 31,\n", + " 32,\n", + " 33,\n", + " 34,\n", + " 35,\n", + " 36,\n", + " 88,\n", + " 102,\n", + " 106,\n", + " 113],\n", + " 'fidelity_asset_manager_50': [37,\n", + " 38,\n", + " 39,\n", + " 40,\n", + " 41,\n", + " 42,\n", + " 43,\n", + " 44,\n", + " 45,\n", + " 46,\n", + " 47,\n", + " 91,\n", + " 114],\n", + " 'fidelity_asset_manager_60': [48,\n", + " 49,\n", + " 50,\n", + " 51,\n", + " 52,\n", + " 53,\n", + " 54,\n", + " 55,\n", + " 56,\n", + " 57,\n", + " 58,\n", + " 103],\n", + " 'fidelity_asset_manager_70': [59,\n", + " 60,\n", + " 61,\n", + " 62,\n", + " 63,\n", + " 64,\n", + " 65,\n", + " 66,\n", + " 67,\n", + " 68,\n", + " 69,\n", + " 70,\n", + " 96],\n", + " 'fidelity_asset_manager_85': [71,\n", + " 72,\n", + " 73,\n", + " 74,\n", + " 75,\n", + " 76,\n", + " 77,\n", + " 78,\n", + " 79,\n", + " 80,\n", + " 81,\n", + " 82,\n", + " 83,\n", + " 84,\n", + " 85,\n", + " 86,\n", + " 87,\n", + " 89,\n", + " 90,\n", + " 92,\n", + " 95,\n", + " 97,\n", + " 98,\n", + " 100,\n", + " 104,\n", + " 105,\n", + " 107,\n", + " 115,\n", + " 116,\n", + " 117,\n", + " 118,\n", + " 119,\n", + " 120]})" + ] + }, + "execution_count": null, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "split_name_to_pages" + ] + }, + { + "cell_type": "markdown", + "id": "32092f81-0af0-4240-80d7-73e72a44170c", + "metadata": {}, + "source": [ + "## Run Per-Fund Extraction, Consolidate Results\n", + "\n", + "We then use LlamaExtract on the aggregated text for each split (page numbers corresponding to each fund), to extract out a Pydantic `FundData` object for each fund.\n", + "\n", + "We then consolidate them together into a CSV." + ] + }, + { + "cell_type": "markdown", + "id": "d9dc62cc-122e-4bec-a1e9-aa7f594be167", + "metadata": {}, + "source": [ + "#### Define Fund Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4bb4fdd4-a8ed-4c90-a0ff-7112a8019c83", + "metadata": {}, + "outputs": [], + "source": [ + "# Define output schema\n", + "from pydantic import BaseModel\n", + "from typing import Optional\n", + "\n", + "\n", + "class FundData(BaseModel):\n", + " \"\"\"Concise fund data extraction schema optimized for LLM extraction\"\"\"\n", + "\n", + " # Identifiers\n", + " fund_name: str = Field(\n", + " ...,\n", + " description=\"Full fund name exactly as it appears, e.g. 'Fidelity Asset Manager® 20%'\",\n", + " )\n", + " target_equity_pct: Optional[int] = Field(\n", + " None,\n", + " description=\"Target equity percentage from fund name (20, 30, 40, 50, 60, 70, or 85)\",\n", + " )\n", + " report_date: Optional[str] = Field(\n", + " None, description=\"Report date in YYYY-MM-DD format, e.g. '2024-09-30'\"\n", + " )\n", + "\n", + " # Asset Allocation (as percentages, e.g. 27.4 for 27.4%)\n", + " equity_pct: Optional[float] = Field(\n", + " None,\n", + " description=\"Actual equity allocation percentage from 'Equity Central Funds' section\",\n", + " )\n", + " fixed_income_pct: Optional[float] = Field(\n", + " None,\n", + " description=\"Fixed income allocation percentage from 'Fixed-Income Central Funds' section\",\n", + " )\n", + " money_market_pct: Optional[float] = Field(\n", + " None,\n", + " description=\"Money market allocation percentage from 'Money Market Central Funds' section\",\n", + " )\n", + " other_pct: Optional[float] = Field(\n", + " None,\n", + " description=\"Other investments percentage (Treasury + Investment Companies + other)\",\n", + " )\n", + "\n", + " # Primary Share Class Metrics (use the main retail class, usually named after the fund)\n", + " nav: Optional[float] = Field(\n", + " None,\n", + " description=\"Net Asset Value per share for the main retail class (e.g. Asset Manager 20% class)\",\n", + " )\n", + " net_assets_usd: Optional[float] = Field(\n", + " None,\n", + " description=\"Total net assets in USD for the main retail class from 'Net Asset Value' section\",\n", + " )\n", + " expense_ratio: Optional[float] = Field(\n", + " None,\n", + " description=\"Expense ratio as percentage (e.g. 0.48 for 0.48%) from Financial Highlights\",\n", + " )\n", + " management_fee: Optional[float] = Field(\n", + " None,\n", + " description=\"Management fee rate as percentage from Financial Highlights or Notes\",\n", + " )\n", + "\n", + " # Performance (as percentages)\n", + " one_year_return: Optional[float] = Field(\n", + " None,\n", + " description=\"One-year total return percentage from Financial Highlights (e.g. 13.74 for 13.74%)\",\n", + " )\n", + " portfolio_turnover: Optional[float] = Field(\n", + " None, description=\"Portfolio turnover rate percentage from Financial Highlights\"\n", + " )\n", + "\n", + " # Risk Metrics (in USD)\n", + " equity_futures_notional: Optional[float] = Field(\n", + " None,\n", + " description=\"Net notional amount of equity futures contracts (positive if net long, negative if net short)\",\n", + " )\n", + " bond_futures_notional: Optional[float] = Field(\n", + " None,\n", + " description=\"Net notional amount of bond/treasury futures contracts (positive if net long, negative if net short)\",\n", + " )\n", + "\n", + " # Fund Flows (in USD)\n", + " net_investment_income: Optional[float] = Field(\n", + " None,\n", + " description=\"Net investment income for the period from Statement of Operations\",\n", + " )\n", + " total_distributions: Optional[float] = Field(\n", + " None,\n", + " description=\"Total distributions to shareholders from Statement of Changes in Net Assets\",\n", + " )\n", + " net_asset_change: Optional[float] = Field(\n", + " None,\n", + " description=\"Net change in assets from beginning to end of period (end minus beginning net assets)\",\n", + " )\n", + "\n", + "\n", + "class FundComparisonData(BaseModel):\n", + " \"\"\"Flattened data optimized for CSV export and analysis\"\"\"\n", + "\n", + " funds: list[FundData]\n", + "\n", + " def to_csv_rows(self) -> list[dict]:\n", + " \"\"\"Convert to list of dictionaries for CSV export\"\"\"\n", + " return [fund.dict() for fund in self.funds]" + ] + }, + { + "cell_type": "markdown", + "id": "b681ae85-fc76-45c6-a191-c30b84e748f5", + "metadata": {}, + "source": [ + "#### Setup LlamaExtract " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b113b682-a823-4ad4-921e-0f2444da0a04", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "deleting existing agent\n" + ] + } + ], + "source": [ + "from llama_cloud_services import LlamaExtract\n", + "from llama_cloud.core.api_error import ApiError\n", + "from llama_cloud import ExtractConfig\n", + "\n", + "\n", + "# Optionally, add your project id/organization id\n", + "llama_extract = LlamaExtract(\n", + " show_progress=True,\n", + " check_interval=5,\n", + " project_id=project_id,\n", + " organization_id=organization_id,\n", + ")\n", + "\n", + "\n", + "try:\n", + " existing_agent = llama_extract.get_agent(name=\"FundDataExtractor2\")\n", + " if existing_agent:\n", + " print(\"deleting existing agent\")\n", + " # Deletion can take some time since all underlying files will be purged\n", + " llama_extract.delete_agent(existing_agent.id)\n", + "except ApiError as e:\n", + " if e.status_code == 404:\n", + " pass\n", + " else:\n", + " raise" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8f4ce337-e8b3-48c2-b25e-ed141b503735", + "metadata": {}, + "outputs": [], + "source": [ + "extract_config = ExtractConfig(\n", + " extraction_mode=\"BALANCED\",\n", + ")\n", + "\n", + "extract_agent = llama_extract.create_agent(\n", + " \"FundDataExtractor2\", data_schema=FundData, config=extract_config\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "e63c6b41-dcc4-4354-ba9d-be462931b45b", + "metadata": {}, + "source": [ + "#### Define Extraction Functions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e37b7050-924d-4afe-9dd3-c77f11d9d698", + "metadata": {}, + "outputs": [], + "source": [ + "from llama_cloud_services.extract import SourceText\n", + "from typing import List, Optional, Dict\n", + "\n", + "\n", + "async def aextract_data_over_split(\n", + " split_name: str,\n", + " page_numbers: List[int],\n", + " nodes: List[TextNode],\n", + " llm: Optional[LLM] = None,\n", + ") -> FundData:\n", + " \"\"\"Extract fund data for a given split.\"\"\"\n", + "\n", + " # combine node text that matches the page numbers\n", + " filtered_nodes = [n for n in nodes if n.metadata[\"page_number\"] in page_numbers]\n", + " filtered_text = \"\\n-------\\n\".join(\n", + " [n.get_content(metadata_mode=\"all\") for n in filtered_nodes]\n", + " )\n", + " result_dict = (\n", + " await extract_agent.aextract(SourceText(text_content=filtered_text))\n", + " ).data\n", + "\n", + " fund_data = FundData.model_validate(result_dict)\n", + "\n", + " return fund_data\n", + "\n", + "\n", + "async def aextract_data_over_splits(\n", + " split_name_to_pages: Dict[str, List],\n", + " nodes: List[TextNode],\n", + " llm: Optional[LLM] = None,\n", + "):\n", + " \"\"\"Extract fund data for each split, aggregate.\"\"\"\n", + " tasks = [\n", + " aextract_data_over_split(split_name, page_numbers, nodes, llm=llm)\n", + " for split_name, page_numbers in split_name_to_pages.items()\n", + " ]\n", + " all_fund_data = await run_jobs(tasks, workers=8, show_progress=True)\n", + " return FundComparisonData(funds=all_fund_data)" + ] + }, + { + "cell_type": "markdown", + "id": "89b768db-877d-460f-8f32-61ba954a492f", + "metadata": {}, + "source": [ + "#### Try it out" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b5be6348-2533-4513-9d01-05b8cca5f5e2", + "metadata": {}, + "outputs": [], + "source": [ + "all_fund_data = await aextract_data_over_splits(\n", + " split_name_to_pages, markdown_nodes, llm=llm\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "121df331-0b35-48ec-b056-686e5d56fa45", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
fund_nametarget_equity_pctreport_dateequity_pctfixed_income_pctmoney_market_pctother_pctnavnet_assets_usdexpense_ratiomanagement_feeone_year_returnportfolio_turnoverequity_futures_notionalbond_futures_notionalnet_investment_incometotal_distributionsnet_asset_change
0Fidelity Asset Manager® 20%202024-09-3027.448.616.08.013.953.170882e+090.480.4313.7424.0-308348.0159650906.0180771003.01.815011e+08148593411.0
1Fidelity Asset Manager® 30%302024-09-3037.448.46.38.112.211.375227e+090.490.4416.2019.0-117123430.060683344.060219520.06.074714e+0761425310.0
2Fidelity Asset Manager® 40%402024-09-3047.343.36.43.013.401.544767e+090.51NaN18.2413.0-131196445.068340188.059239637.03.794509e+07117313232.0
3Fidelity Asset Manager® 50%502024-09-3057.137.32.13.521.447.407487e+090.560.5820.3712.0-550922128.0288445875.0223428532.02.197314e+08771033500.0
4Fidelity Asset Manager® 60%602024-09-3063.832.20.73.316.322.282371e+090.640.5422.3117.0-103026215.0127423594.084848944.07.662887e+07630046421.0
5Fidelity Asset Manager® 70%702024-09-3073.722.31.03.329.094.273012e+090.640.5424.0614.0-162028290.0202620656.0119678101.01.121974e+08935150295.0
6Fidelity Asset Manager® 85%852024-09-3084.611.21.93.327.854.394805e+090.940.6327.2316.0105407845.0129202345.060293654.03.695516e+08935099874.0
\n", + "
" + ], + "text/plain": [ + " fund_name target_equity_pct report_date equity_pct \\\n", + "0 Fidelity Asset Manager® 20% 20 2024-09-30 27.4 \n", + "1 Fidelity Asset Manager® 30% 30 2024-09-30 37.4 \n", + "2 Fidelity Asset Manager® 40% 40 2024-09-30 47.3 \n", + "3 Fidelity Asset Manager® 50% 50 2024-09-30 57.1 \n", + "4 Fidelity Asset Manager® 60% 60 2024-09-30 63.8 \n", + "5 Fidelity Asset Manager® 70% 70 2024-09-30 73.7 \n", + "6 Fidelity Asset Manager® 85% 85 2024-09-30 84.6 \n", + "\n", + " fixed_income_pct money_market_pct other_pct nav net_assets_usd \\\n", + "0 48.6 16.0 8.0 13.95 3.170882e+09 \n", + "1 48.4 6.3 8.1 12.21 1.375227e+09 \n", + "2 43.3 6.4 3.0 13.40 1.544767e+09 \n", + "3 37.3 2.1 3.5 21.44 7.407487e+09 \n", + "4 32.2 0.7 3.3 16.32 2.282371e+09 \n", + "5 22.3 1.0 3.3 29.09 4.273012e+09 \n", + "6 11.2 1.9 3.3 27.85 4.394805e+09 \n", + "\n", + " expense_ratio management_fee one_year_return portfolio_turnover \\\n", + "0 0.48 0.43 13.74 24.0 \n", + "1 0.49 0.44 16.20 19.0 \n", + "2 0.51 NaN 18.24 13.0 \n", + "3 0.56 0.58 20.37 12.0 \n", + "4 0.64 0.54 22.31 17.0 \n", + "5 0.64 0.54 24.06 14.0 \n", + "6 0.94 0.63 27.23 16.0 \n", + "\n", + " equity_futures_notional bond_futures_notional net_investment_income \\\n", + "0 -308348.0 159650906.0 180771003.0 \n", + "1 -117123430.0 60683344.0 60219520.0 \n", + "2 -131196445.0 68340188.0 59239637.0 \n", + "3 -550922128.0 288445875.0 223428532.0 \n", + "4 -103026215.0 127423594.0 84848944.0 \n", + "5 -162028290.0 202620656.0 119678101.0 \n", + "6 105407845.0 129202345.0 60293654.0 \n", + "\n", + " total_distributions net_asset_change \n", + "0 1.815011e+08 148593411.0 \n", + "1 6.074714e+07 61425310.0 \n", + "2 3.794509e+07 117313232.0 \n", + "3 2.197314e+08 771033500.0 \n", + "4 7.662887e+07 630046421.0 \n", + "5 1.121974e+08 935150295.0 \n", + "6 3.695516e+08 935099874.0 " + ] + }, + "execution_count": null, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import pandas as pd\n", + "\n", + "all_fund_data_df = pd.DataFrame(all_fund_data.to_csv_rows())\n", + "all_fund_data_df" + ] + }, + { + "cell_type": "markdown", + "id": "589fcb4d-39a8-4aef-81ca-3702f095a261", + "metadata": {}, + "source": [ + "## Define Full Workflow\n", + "\n", + "We put everything together into a LlamaIndex workflow! " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "66fae1ce-5372-4c63-813b-0def195feb7d", + "metadata": {}, + "outputs": [], + "source": [ + "from llama_index.core.workflow import (\n", + " Event,\n", + " StartEvent,\n", + " StopEvent,\n", + " Context,\n", + " Workflow,\n", + " step,\n", + ")\n", + "\n", + "\n", + "class ParseDocEvent(Event):\n", + " nodes: List[TextNode]\n", + "\n", + "\n", + "class DocSplitEvent(Event):\n", + " split_name_to_pages: Dict[str, List[int]]\n", + " nodes: List[TextNode]\n", + "\n", + "\n", + "class FidelityFundExtraction(Workflow):\n", + " \"\"\"\n", + " Workflow to extract data from a solar panel datasheet and generate a comparison report\n", + " against provided design requirements.\n", + " \"\"\"\n", + "\n", + " def __init__(\n", + " self,\n", + " parser: LlamaParse,\n", + " extract_agent: LlamaExtract,\n", + " split_description: str = fidelity_split_description,\n", + " split_rules: str = fidelity_split_rules,\n", + " split_key: str = fidelity_split_key,\n", + " llm: Optional[LLM] = None,\n", + " **kwargs,\n", + " ):\n", + " super().__init__(**kwargs)\n", + " self.parser = parser\n", + " self.extract_agent = extract_agent\n", + " self.split_description = split_description\n", + " self.split_rules = split_rules\n", + " self.split_key = split_key\n", + " self.llm = llm\n", + "\n", + " @step\n", + " async def parse_doc(self, ctx: Context, ev: StartEvent) -> ParseDocEvent:\n", + " \"\"\"Parse document into markdown nodes.\"\"\"\n", + " result = await parser.aparse(file_path=ev.file_path)\n", + " markdown_nodes = await result.aget_markdown_nodes(split_by_page=True)\n", + " return ParseDocEvent(nodes=markdown_nodes)\n", + "\n", + " @step\n", + " async def find_splits(self, ctx: Context, ev: ParseDocEvent) -> DocSplitEvent:\n", + " split_name_to_pages = await afind_categories_and_splits(\n", + " self.split_description,\n", + " self.split_key,\n", + " ev.nodes,\n", + " additional_split_rules=self.split_rules,\n", + " llm=llm,\n", + " verbose=True,\n", + " )\n", + " return DocSplitEvent(\n", + " split_name_to_pages=split_name_to_pages,\n", + " nodes=ev.nodes,\n", + " )\n", + "\n", + " @step\n", + " async def run_extraction(self, ctx: Context, ev: DocSplitEvent) -> StopEvent:\n", + " all_fund_data = await aextract_data_over_splits(\n", + " ev.split_name_to_pages, ev.nodes, llm=self.llm\n", + " )\n", + " all_fund_data_df = pd.DataFrame(all_fund_data.to_csv_rows())\n", + " return StopEvent(\n", + " result={\n", + " \"all_fund_data\": all_fund_data,\n", + " \"all_fund_data_df\": all_fund_data_df,\n", + " }\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "870368c6-a164-4f84-973f-d7534fce28cb", + "metadata": {}, + "outputs": [], + "source": [ + "workflow = FidelityFundExtraction(\n", + " parser=parser, extract_agent=extract_agent, verbose=True, timeout=None\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e0813641-105a-4321-b46a-27a2c48df39c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "fidelity_fund_extraction.html\n" + ] + } + ], + "source": [ + "from llama_index.utils.workflow import draw_all_possible_flows\n", + "\n", + "draw_all_possible_flows(\n", + " FidelityFundExtraction,\n", + " filename=\"fidelity_fund_extraction.html\",\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e0f14ac3-c681-46a9-8664-0fc8bd1513f4", + "metadata": {}, + "outputs": [], + "source": [ + "result = await workflow.run(\n", + " file_path=\"./data/asset_manager_fund_analysis/fidelity_fund.pdf\"\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "61a4608b-0807-4d3e-9166-d393d21fb970", + "metadata": {}, + "outputs": [], + "source": [ + "all_fund_data_df = result[\"all_fund_data_df\"]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fe931879-3fbc-4c26-95ed-a7b6790a0a5f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
fund_nametarget_equity_pctreport_dateequity_pctfixed_income_pctmoney_market_pctother_pctnavnet_assets_usdexpense_ratiomanagement_feeone_year_returnportfolio_turnoverequity_futures_notionalbond_futures_notionalnet_investment_incometotal_distributionsnet_asset_change
0Fidelity Asset Manager® 20%202024-09-3027.448.616.08.013.953.170882e+090.480.4813.7424.0-30834897.0159650906.0180771003.0181501111.0148593411.0
1Fidelity Asset Manager® 30%302024-09-3037.448.46.38.112.211.375227e+090.490.4416.2019.0-117123430.060683344.060219520.060747136.061325310.0
2Fidelity Asset Manager® 40%402024-09-3047.343.36.43.013.401.544767e+090.510.4318.2413.0-131196445.068340188.059239637.0-37925052.0117313232.0
3Fidelity Asset Manager® 50%502024-09-3057.137.32.13.521.447.407487e+090.560.5320.3712.0-550922128.0288445875.0223428532.0219731373.0771033500.0
4Fidelity Asset Manager® 60%602024-09-3063.832.20.73.316.322.282371e+090.640.5422.3117.0-103026215.0127423594.084848944.076628870.0630046421.0
5Fidelity Asset Manager® 70%702024-09-3073.722.31.03.329.094.273012e+090.640.1624.0614.0-162028290.0202620656.0119678101.0112197413.0935150295.0
6Fidelity Asset Manager® 85%852024-09-3084.611.21.93.327.854.394805e+090.940.5227.2316.0104412803.0129202345.060293654.051390318.0935099874.0
\n", + "
" + ], + "text/plain": [ + " fund_name target_equity_pct report_date equity_pct \\\n", + "0 Fidelity Asset Manager® 20% 20 2024-09-30 27.4 \n", + "1 Fidelity Asset Manager® 30% 30 2024-09-30 37.4 \n", + "2 Fidelity Asset Manager® 40% 40 2024-09-30 47.3 \n", + "3 Fidelity Asset Manager® 50% 50 2024-09-30 57.1 \n", + "4 Fidelity Asset Manager® 60% 60 2024-09-30 63.8 \n", + "5 Fidelity Asset Manager® 70% 70 2024-09-30 73.7 \n", + "6 Fidelity Asset Manager® 85% 85 2024-09-30 84.6 \n", + "\n", + " fixed_income_pct money_market_pct other_pct nav net_assets_usd \\\n", + "0 48.6 16.0 8.0 13.95 3.170882e+09 \n", + "1 48.4 6.3 8.1 12.21 1.375227e+09 \n", + "2 43.3 6.4 3.0 13.40 1.544767e+09 \n", + "3 37.3 2.1 3.5 21.44 7.407487e+09 \n", + "4 32.2 0.7 3.3 16.32 2.282371e+09 \n", + "5 22.3 1.0 3.3 29.09 4.273012e+09 \n", + "6 11.2 1.9 3.3 27.85 4.394805e+09 \n", + "\n", + " expense_ratio management_fee one_year_return portfolio_turnover \\\n", + "0 0.48 0.48 13.74 24.0 \n", + "1 0.49 0.44 16.20 19.0 \n", + "2 0.51 0.43 18.24 13.0 \n", + "3 0.56 0.53 20.37 12.0 \n", + "4 0.64 0.54 22.31 17.0 \n", + "5 0.64 0.16 24.06 14.0 \n", + "6 0.94 0.52 27.23 16.0 \n", + "\n", + " equity_futures_notional bond_futures_notional net_investment_income \\\n", + "0 -30834897.0 159650906.0 180771003.0 \n", + "1 -117123430.0 60683344.0 60219520.0 \n", + "2 -131196445.0 68340188.0 59239637.0 \n", + "3 -550922128.0 288445875.0 223428532.0 \n", + "4 -103026215.0 127423594.0 84848944.0 \n", + "5 -162028290.0 202620656.0 119678101.0 \n", + "6 104412803.0 129202345.0 60293654.0 \n", + "\n", + " total_distributions net_asset_change \n", + "0 181501111.0 148593411.0 \n", + "1 60747136.0 61325310.0 \n", + "2 -37925052.0 117313232.0 \n", + "3 219731373.0 771033500.0 \n", + "4 76628870.0 630046421.0 \n", + "5 112197413.0 935150295.0 \n", + "6 51390318.0 935099874.0 " + ] + }, + "execution_count": null, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "all_fund_data_df" + ] + }, + { + "cell_type": "markdown", + "id": "2fd06ac2-c012-496d-ba00-acddb08dadfc", + "metadata": {}, + "source": [ + "## Run Analysis over Compiled Data\n", + "\n", + "Now that we've created an aggregated dataframe from this report, we can now do good ol' Pandas analysis or use LLMs to help do text-to-CSV over it! " + ] + }, + { + "cell_type": "markdown", + "id": "c4de5633-090e-436c-869f-a1f8015c28d9", + "metadata": {}, + "source": [ + "#### Standard Pandas analysis\n", + "\n", + "The conservative 20% fund is actually the most return-efficient." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7bf74cee-3b4b-42b4-a6df-6567287dea82", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 0.501460\n", + "1 0.433155\n", + "2 0.385624\n", + "3 0.356743\n", + "4 0.349687\n", + "5 0.326459\n", + "6 0.321868\n", + "Name: return_per_risk, dtype: float64" + ] + }, + "execution_count": null, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Calculate return per unit of risk (equity allocation)\n", + "all_fund_data_df[\"return_per_risk\"] = (\n", + " all_fund_data_df[\"one_year_return\"] / all_fund_data_df[\"equity_pct\"]\n", + ")\n", + "all_fund_data_df[\"return_per_risk\"]" + ] + }, + { + "cell_type": "markdown", + "id": "b097e10b-af9e-4d90-b587-14495cdab94f", + "metadata": {}, + "source": [ + "The lower equity allocation funds consistently run higher than their target." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "48373eb3-5ebe-4610-b05d-3f0239925d97", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 7.4\n", + "1 7.4\n", + "2 7.3\n", + "3 7.1\n", + "4 3.8\n", + "5 3.7\n", + "6 -0.4\n", + "Name: drift, dtype: float64" + ] + }, + "execution_count": null, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# How far do actual allocations drift from targets?\n", + "all_fund_data_df[\"drift\"] = (\n", + " all_fund_data_df[\"equity_pct\"] - all_fund_data_df[\"target_equity_pct\"]\n", + ")\n", + "all_fund_data_df[\"drift\"]" + ] + }, + { + "cell_type": "markdown", + "id": "76580f49-f5d9-4a60-b5d2-89160f16272e", + "metadata": {}, + "source": [ + "#### Text-to-Pandas Analysis" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9e99a64d-b112-439c-871c-fe7905839d3f", + "metadata": {}, + "outputs": [], + "source": [ + "from llama_index.experimental.query_engine import PandasQueryEngine" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6975fead-7278-4ee0-880a-5e65843e6c3c", + "metadata": {}, + "outputs": [], + "source": [ + "pd_query_engine = PandasQueryEngine(\n", + " df=all_fund_data_df, verbose=True, synthesize_response=True\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "15ae8b4d-ce69-455e-8e2b-e6b277ff632a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "> Pandas Instructions:\n", + "```\n", + "df[df['equity_pct'] > df['target_equity_pct'] + 5]\n", + "```\n", + "> Pandas Output: fund_name target_equity_pct report_date equity_pct \\\n", + "0 Fidelity Asset Manager® 20% 20 2024-09-30 27.4 \n", + "1 Fidelity Asset Manager® 30% 30 2024-09-30 37.4 \n", + "2 Fidelity Asset Manager® 40% 40 2024-09-30 47.3 \n", + "3 Fidelity Asset Manager® 50% 50 2024-09-30 57.1 \n", + "\n", + " fixed_income_pct money_market_pct other_pct nav net_assets_usd \\\n", + "0 48.6 16.0 8.0 13.95 3.170882e+09 \n", + "1 48.4 6.3 8.1 12.21 1.375227e+09 \n", + "2 43.3 6.4 3.0 13.40 1.544767e+09 \n", + "3 37.3 2.1 3.5 21.44 7.407487e+09 \n", + "\n", + " expense_ratio management_fee one_year_return portfolio_turnover \\\n", + "0 0.48 0.48 13.74 24.0 \n", + "1 0.49 0.44 16.20 19.0 \n", + "2 0.51 0.43 18.24 13.0 \n", + "3 0.56 0.53 20.37 12.0 \n", + "\n", + " equity_futures_notional bond_futures_notional net_investment_income \\\n", + "0 -30834897.0 159650906.0 180771003.0 \n", + "1 -117123430.0 60683344.0 60219520.0 \n", + "2 -131196445.0 68340188.0 59239637.0 \n", + "3 -550922128.0 288445875.0 223428532.0 \n", + "\n", + " total_distributions net_asset_change return_per_risk drift \n", + "0 181501111.0 148593411.0 0.501460 7.4 \n", + "1 60747136.0 61325310.0 0.433155 7.4 \n", + "2 -37925052.0 117313232.0 0.385624 7.3 \n", + "3 219731373.0 771033500.0 0.356743 7.1 \n", + "The following funds have an actual equity allocation that is more than 5% higher than their target equity allocation as of the 2024-09-30 report date:\n", + "\n", + "1. **Fidelity Asset Manager® 20%**\n", + " - Target Equity: 20%\n", + " - Actual Equity: 27.4%\n", + " - Difference: +7.4%\n", + "\n", + "2. **Fidelity Asset Manager® 30%**\n", + " - Target Equity: 30%\n", + " - Actual Equity: 37.4%\n", + " - Difference: +7.4%\n", + "\n", + "3. **Fidelity Asset Manager® 40%**\n", + " - Target Equity: 40%\n", + " - Actual Equity: 47.3%\n", + " - Difference: +7.3%\n", + "\n", + "4. **Fidelity Asset Manager® 50%**\n", + " - Target Equity: 50%\n", + " - Actual Equity: 57.1%\n", + " - Difference: +7.1%\n", + "\n", + "All four funds listed have an actual equity allocation that exceeds their target by more than 5%.\n" + ] + } + ], + "source": [ + "response = pd_query_engine.query(\n", + " \"Show me all funds where the actual equity allocation is more than 5% higher than the target\"\n", + ")\n", + "print(str(response))" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "llama_parse", + "language": "python", + "name": "llama_parse" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/document_processing/asset_manager_fund_analysis.png b/examples/document_processing/asset_manager_fund_analysis.png new file mode 100644 index 0000000..ae53523 Binary files /dev/null and b/examples/document_processing/asset_manager_fund_analysis.png differ diff --git a/examples/document_processing/build_knowledge_graph_with_neo4j_llamacloud.ipynb b/examples/document_processing/build_knowledge_graph_with_neo4j_llamacloud.ipynb new file mode 100644 index 0000000..8ae7b83 --- /dev/null +++ b/examples/document_processing/build_knowledge_graph_with_neo4j_llamacloud.ipynb @@ -0,0 +1,722 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Building a Knowledge Graph with LlamaCloud & Neo4J\n", + "\n", + "RAG is as a powerful technique for enhancing LLMs with external knowledge, but traditional semantic similarity search often fails to capture nuanced relationships between entities, and can miss critical context that spans across multiple documents. By transforming unstructured documents into structured knowledge representations, we can perform complex graph traversals, relationship queries, and contextual reasoning that goes far beyond simple similarity matching.\n", + "\n", + "Tools like LlamaParse, LlamaClassify and LlamaExtract provide robust parsing, classification and extraction capabilities to convert raw documents into structured data, while Neo4j serves as the backbone for knowledge graph representation, forming the foundation of GraphRAG architectures that can understand not just what information exists, but how it all connects together.\n", + "\n", + "In this end-to-end tutorial, we will walk through an example of legal document processing that showcases the full pipeline shown below.\n", + "\n", + "The pipeline contains the following steps:\n", + "- Use [LlamaClassify](https://docs.cloud.llamaindex.ai/llamaclassify/getting_started) - a LlamaCloud tool currently in beta, to classify contracts\n", + "- Leverage [LlamaExtract](https://www.llamaindex.ai/llamaextract) to extract different sets of relevant attributes tailored to each specific contract category based on the classification\n", + "- Store all structured information into a Neo4j knowledge graph, creating a rich, queryable representation that captures both content and intricate relationships within legal documents\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setting Up Requirements" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install llama-index-workflows llama-cloud-services jsonschema openai neo4j llama-index-llms-openai" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import uuid\n", + "from typing import List, Optional\n", + "\n", + "from getpass import getpass\n", + "from neo4j import AsyncGraphDatabase\n", + "from pydantic import BaseModel, Field\n", + "\n", + "from llama_cloud_services.beta.classifier.client import ClassifyClient\n", + "from llama_cloud.types import ClassifierRule\n", + "from llama_cloud.client import AsyncLlamaCloud\n", + "\n", + "from llama_cloud_services.extract import (\n", + " ExtractConfig,\n", + " ExtractMode,\n", + " LlamaExtract,\n", + " SourceText,\n", + ")\n", + "from llama_cloud_services.parse import LlamaParse" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Download Sample Contract\n", + "\n", + "Here, we download a sample PDF from the Cuad dataset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!wget https://raw.githubusercontent.com/tomasonjo/blog-datasets/5e3939d10216b7663687217c1646c30eb921d92f/CybergyHoldingsInc_Affliate%20Agreement.pdf" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Set Up Neo4J\n", + "\n", + "For Neo4j, the simplest approach is to create a free [Aura database instance](https://neo4j.com/product/auradb/), and copy your credentials here." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "db_url = \"your-db-url\"\n", + "username = \"neo4j\"\n", + "password = \"your-password\"\n", + "\n", + "neo4j_driver = AsyncGraphDatabase.driver(\n", + " db_url,\n", + " auth=(\n", + " username,\n", + " password,\n", + " ),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Parse the Contract with LlamaParse\n", + "\n", + "Next, we set up LlamaParse and parse the PDF. In this case, we're using `parse_page_without_llm` mode." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "os.environ[\"LLAMA_CLOUD_API_KEY\"] = getpass(\"Enter your Llama API key: \")\n", + "os.environ[\"OPENAI_API_KEY\"] = getpass(\"Enter your OpenAI API key: \")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Started parsing the file under job_id 6687bc90-d4eb-48a5-b56a-f4bfe8f00d33\n" + ] + } + ], + "source": [ + "# Initialize parser with specified mode\n", + "parser = LlamaParse(parse_mode=\"parse_page_without_llm\")\n", + "\n", + "# Define the PDF file to parse\n", + "pdf_path = \"CybergyHoldingsInc_Affliate Agreement.pdf\"\n", + "\n", + "# Parse the document asynchronously\n", + "results = await parser.aparse(pdf_path)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " Exhibit 10.27\n", + "\n", + " MARKETING AFFILIATE AGREEMENT\n", + "\n", + " Between:\n", + "\n", + "\n", + "Birch First Global Investments Inc.\n", + "\n", + " And\n", + "\n", + "\n", + " Mount Knowledge Holdings Inc.\n", + "\n", + "\n", + " Dated: May 8, 2014\n", + "\n", + " 1\n", + "\n", + "\n", + " Source: CYBERGY HOLDINGS, INC., 10-Q, 5/20/2014\n" + ] + } + ], + "source": [ + "print(results.pages[0].text)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Contract classification with LlamaClassify (beta)\n", + "\n", + "In this example, we want to classify incoming contracts. They can either be `Affiliate Agreements` or `Co Branding`. For this cookbook, we are going to use **[LlamaClassify](https://docs.cloud.llamaindex.ai/llamaclassify/getting_started/)**, our powerful AI-powered document classification tool that's currently in **beta**!\n", + "\n", + "### Define classification rules\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "rules = [\n", + " ClassifierRule(\n", + " type=\"Affiliate Agreements\",\n", + " description=\"This is a contract that outlines an affiliate agreement\",\n", + " ),\n", + " ClassifierRule(\n", + " type=\"Co Branding\",\n", + " description=\"This is a contract that outlines a co-branding deal/agreement\",\n", + " ),\n", + "]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Initialize the ClassifyClient and Run a Classification Job" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "classifier = ClassifyClient(\n", + " client=AsyncLlamaCloud(\n", + " base_url=\"https://api.cloud.llamaindex.ai\",\n", + " token=os.environ[\"LLAMA_CLOUD_API_KEY\"],\n", + " )\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "result = await classifier.aclassify_file_path(\n", + " rules=rules,\n", + " file_input_path=\"/content/CybergyHoldingsInc_Affliate Agreement.pdf\",\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Classification Result: affiliate_agreements\n", + "Classification Reason: The document is titled 'Marketing Affiliate Agreement' and repeatedly refers to one party as the 'Marketing Affiliate' (MA). The agreement grants the MA the right to market, advertise, and sell the company's technology, and outlines the duties and responsibilities of the affiliate in marketing, licensing, and supporting the technology. There is no mention of joint branding, shared trademarks, or collaborative marketing under both parties' brands, which would be characteristic of a co-branding agreement. The content is entirely consistent with an affiliate agreement, where one party markets and sells products or services on behalf of another, rather than a co-branding arrangement.\n" + ] + } + ], + "source": [ + "classification = result.items[0].result\n", + "print(\"Classification Result: \" + classification.type)\n", + "print(\"Classification Reason: \" + classification.reasoning)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setting Up LlamaExtract\n", + "\n", + "Next, we define some schemas which we can use to extract relevant information from our contracts with. The fields we define are a mix of summarization and structured data extraction.\n", + "\n", + "Here we define two Pydantic models: `Location` captures structured address information with optional fields for country, state, and address, while `Party` represents contract parties with a required name and optional location details. The Field descriptions help guide the extraction process by telling the LLM exactly what information to look for in each field." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "class Location(BaseModel):\n", + " \"\"\"Location information with structured address components.\"\"\"\n", + "\n", + " country: Optional[str] = Field(None, description=\"Country\")\n", + " state: Optional[str] = Field(None, description=\"State or province\")\n", + " address: Optional[str] = Field(None, description=\"Street address or city\")\n", + "\n", + "\n", + "class Party(BaseModel):\n", + " \"\"\"Party information with name and location.\"\"\"\n", + "\n", + " name: str = Field(description=\"Party name\")\n", + " location: Optional[Location] = Field(None, description=\"Party location details\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Remember we have multiple contract types, so we need to define specific extraction schemas for each type and create a mapping system to dynamically select the appropriate schema based on our classification result.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "class BaseContract(BaseModel):\n", + " \"\"\"Base contract class with common fields.\"\"\"\n", + "\n", + " parties: Optional[List[Party]] = Field(None, description=\"All contracting parties\")\n", + " agreement_date: Optional[str] = Field(\n", + " None, description=\"Contract signing date. Use YYYY-MM-DD\"\n", + " )\n", + " effective_date: Optional[str] = Field(\n", + " None, description=\"When contract becomes effective. Use YYYY-MM-DD\"\n", + " )\n", + " expiration_date: Optional[str] = Field(\n", + " None, description=\"Contract expiration date. Use YYYY-MM-DD\"\n", + " )\n", + " governing_law: Optional[str] = Field(None, description=\"Governing jurisdiction\")\n", + " termination_for_convenience: Optional[bool] = Field(\n", + " None, description=\"Can terminate without cause\"\n", + " )\n", + " anti_assignment: Optional[bool] = Field(\n", + " None, description=\"Restricts assignment to third parties\"\n", + " )\n", + " cap_on_liability: Optional[str] = Field(None, description=\"Liability limit amount\")\n", + "\n", + "\n", + "class AffiliateAgreement(BaseContract):\n", + " \"\"\"Affiliate Agreement extraction.\"\"\"\n", + "\n", + " exclusivity: Optional[str] = Field(\n", + " None, description=\"Exclusive territory or market rights\"\n", + " )\n", + " non_compete: Optional[str] = Field(None, description=\"Non-compete restrictions\")\n", + " revenue_profit_sharing: Optional[str] = Field(\n", + " None, description=\"Commission or revenue split\"\n", + " )\n", + " minimum_commitment: Optional[str] = Field(None, description=\"Minimum sales targets\")\n", + "\n", + "\n", + "class CoBrandingAgreement(BaseContract):\n", + " \"\"\"Co-Branding Agreement extraction.\"\"\"\n", + "\n", + " exclusivity: Optional[str] = Field(None, description=\"Exclusive co-branding rights\")\n", + " ip_ownership_assignment: Optional[str] = Field(\n", + " None, description=\"IP ownership allocation\"\n", + " )\n", + " license_grant: Optional[str] = Field(None, description=\"Brand/trademark licenses\")\n", + " revenue_profit_sharing: Optional[str] = Field(\n", + " None, description=\"Revenue sharing terms\"\n", + " )\n", + "\n", + "\n", + "mapping = {\n", + " \"affiliate_agreements\": AffiliateAgreement,\n", + " \"co_branding\": CoBrandingAgreement,\n", + "}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Uploading files: 100%|██████████| 1/1 [00:00<00:00, 1.81it/s]\n", + "Creating extraction jobs: 100%|██████████| 1/1 [00:00<00:00, 1.43it/s]\n", + "Extracting files: 100%|██████████| 1/1 [00:12<00:00, 12.75s/it]\n" + ] + }, + { + "data": { + "text/plain": [ + "{'parties': [{'name': 'Birch First Global Investments Inc.',\n", + " 'location': {'country': 'U.S. Virgin Islands',\n", + " 'state': None,\n", + " 'address': '9100 Havensight, Port of Sale, Ste. 15/16, St. Thomas, VI 0080'}},\n", + " {'name': 'Mount Knowledge Holdings Inc.',\n", + " 'location': {'country': 'United States',\n", + " 'state': 'Nevada',\n", + " 'address': '228 Park Avenue S. #56101 New York, NY 100031502'}}],\n", + " 'agreement_date': '2014-05-08',\n", + " 'effective_date': '2014-05-08',\n", + " 'expiration_date': None,\n", + " 'governing_law': 'State of Nevada',\n", + " 'termination_for_convenience': True,\n", + " 'anti_assignment': True,\n", + " 'cap_on_liability': \"Company's liability shall not exceed the fees that MA has paid under this Agreement.\",\n", + " 'exclusivity': None,\n", + " 'non_compete': None,\n", + " 'revenue_profit_sharing': 'MA receives a purchase discount based on annual purchase level: 15% for $0-$100,000, 20% for $100,001-$1,000,000, 25% for $1,000,001 and above. MA pays a quarterly service fee of $15,000 if no sales occur in a quarter.',\n", + " 'minimum_commitment': 'MA commits to purchase a minimum of 100 Units in aggregate within the Territory within the first six months of the term of this Agreement.'}" + ] + }, + "execution_count": null, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "extractor = LlamaExtract()\n", + "\n", + "agent = extractor.create_agent(\n", + " name=f\"extraction_workflow_import_{uuid.uuid4()}\",\n", + " data_schema=mapping[classification.type],\n", + " config=ExtractConfig(\n", + " extraction_mode=ExtractMode.BALANCED,\n", + " ),\n", + ")\n", + "\n", + "result = await agent.aextract(\n", + " files=SourceText(\n", + " text_content=\" \".join([el.text for el in results.pages]),\n", + " filename=pdf_path,\n", + " ),\n", + ")\n", + "\n", + "result.data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Import into Neo4j\n", + "\n", + "The final step is to take our extracted structured information and build a knowledge graph that represents the relationships between contract entities. We need to define a graph model that specifies how our contract data should be organized as nodes and relationships in Neo4j.\n", + "\n", + "\n", + "\n", + "Our graph model consists of three main node types:\n", + "- **Contract nodes** store the core agreement information including dates, terms, and legal clauses\n", + "- **Party nodes** represent the contracting entities with their names\n", + "- **Location nodes** capture geographic information with address components.\n", + "\n", + "Now we'll import our extracted contract data into Neo4j according to our defined graph model.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'_contains_updates': True, 'labels_added': 5, 'relationships_created': 4, 'nodes_created': 5, 'properties_set': 16}" + ] + }, + "execution_count": null, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import_query = \"\"\"\n", + "WITH $contract AS contract\n", + "MERGE (c:Contract {path: $path})\n", + "SET c += apoc.map.clean(contract, [\"parties\", \"agreement_date\", \"effective_date\", \"expiration_date\"], [])\n", + "// Cast to date\n", + "SET c.agreement_date = date(contract.agreement_date),\n", + " c.effective_date = date(contract.effective_date),\n", + " c.expiration_date = date(contract.expiration_date)\n", + "\n", + "// Create parties with their locations\n", + "WITH c, contract\n", + "UNWIND coalesce(contract.parties, []) AS party\n", + "MERGE (p:Party {name: party.name})\n", + "MERGE (c)-[:HAS_PARTY]->(p)\n", + "\n", + "// Create location nodes and link to parties\n", + "WITH p, party\n", + "WHERE party.location IS NOT NULL\n", + "MERGE (p)-[:HAS_LOCATION]->(l:Location)\n", + "SET l += party.location\n", + "\"\"\"\n", + "\n", + "response = await neo4j_driver.execute_query(\n", + " import_query, contract=result.data, path=pdf_path\n", + ")\n", + "response.summary.counters" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Bringing it All Together in a Workflow\n", + "\n", + "Finally, we can combine all of this logic into one single executable agentic workflow. Let's make it so that the workflow can run by accepting a single PDF, adding new entries to our Neo4j graph each time." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "affiliate_extraction_agent = extractor.create_agent(\n", + " name=\"Affiliate_Extraction\",\n", + " data_schema=AffiliateAgreement,\n", + " config=ExtractConfig(\n", + " extraction_mode=ExtractMode.BALANCED,\n", + " ),\n", + ")\n", + "cobranding_extraction_agent = extractor.create_agent(\n", + " name=\"CoBranding_Extraction\",\n", + " data_schema=CoBrandingAgreement,\n", + " config=ExtractConfig(\n", + " extraction_mode=ExtractMode.BALANCED,\n", + " ),\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from llama_index.core.workflow import (\n", + " Workflow,\n", + " Event,\n", + " Context,\n", + " StartEvent,\n", + " StopEvent,\n", + " step,\n", + ")\n", + "\n", + "\n", + "class ClassifyDocEvent(Event):\n", + " parsed_doc: str\n", + " pdf_path: str\n", + "\n", + "\n", + "class ExtractAffiliate(Event):\n", + " file_path: str\n", + "\n", + "\n", + "class ExtractCoBranding(Event):\n", + " file_path: str\n", + "\n", + "\n", + "class BuildGraph(Event):\n", + " file_path: str\n", + " data: dict\n", + "\n", + "\n", + "class KnowledgeGraphBuilder(Workflow):\n", + " def __init__(\n", + " self,\n", + " classifier: ClassifyClient,\n", + " rules: List[ClassifierRule],\n", + " affiliate_extract_agent: LlamaExtract,\n", + " cobranding_extract_agent: LlamaExtract,\n", + " **kwargs,\n", + " ):\n", + " super().__init__(**kwargs)\n", + " self.classifier = classifier\n", + " self.rules = rules\n", + " self.affiliate_extract_agent = affiliate_extract_agent\n", + " self.cobranding_extract_agent = cobranding_extract_agent\n", + "\n", + " @step\n", + " async def classify_contract(\n", + " self, ctx: Context, ev: StartEvent\n", + " ) -> ExtractAffiliate | ExtractCoBranding | StopEvent:\n", + " result = await self.classifier.aclassify_file_path(\n", + " rules=self.rules, file_input_path=ev.pdf_path\n", + " )\n", + " contract_type = result.items[0].result.type\n", + " print(contract_type)\n", + " if contract_type == \"affiliate_agreements\":\n", + " return ExtractAffiliate(file_path=ev.pdf_path)\n", + " elif contract_type == \"co_branding\":\n", + " return ExtractCoBranding(file_path=ev.pdf_path)\n", + " else:\n", + " return StopEvent()\n", + "\n", + " @step\n", + " async def extract_affiliate(self, ctx: Context, ev: ExtractAffiliate) -> BuildGraph:\n", + " result = await self.affiliate_extract_agent.aextract(ev.file_path)\n", + " return BuildGraph(data=result.data, file_path=ev.file_path)\n", + "\n", + " @step\n", + " async def extract_co_branding(\n", + " self, ctx: Context, ev: ExtractCoBranding\n", + " ) -> BuildGraph:\n", + " result = await self.cobranding_extract_agent.aextract(ev.file_path)\n", + " return BuildGraph(data=result.data, file_path=ev.file_path)\n", + "\n", + " @step\n", + " async def build_graph(self, ctx: Context, ev: BuildGraph) -> StopEvent:\n", + " import_query = \"\"\"\n", + " WITH $contract AS contract\n", + " MERGE (c:Contract {path: $path})\n", + " SET c += apoc.map.clean(contract, [\"parties\", \"agreement_date\", \"effective_date\", \"expiration_date\"], [])\n", + " // Cast to date\n", + " SET c.agreement_date = date(contract.agreement_date),\n", + " c.effective_date = date(contract.effective_date),\n", + " c.expiration_date = date(contract.expiration_date)\n", + "\n", + " // Create parties with their locations\n", + " WITH c, contract\n", + " UNWIND coalesce(contract.parties, []) AS party\n", + " MERGE (p:Party {name: party.name})\n", + " MERGE (c)-[:HAS_PARTY]->(p)\n", + "\n", + " // Create location nodes and link to parties\n", + " WITH p, party\n", + " WHERE party.location IS NOT NULL\n", + " MERGE (p)-[:HAS_LOCATION]->(l:Location)\n", + " SET l += party.location\n", + " \"\"\"\n", + " response = await neo4j_driver.execute_query(\n", + " import_query, contract=ev.data, path=ev.file_path\n", + " )\n", + " return StopEvent(response.summary.counters)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "knowledge_graph_builder = KnowledgeGraphBuilder(\n", + " classifier=classifier,\n", + " rules=rules,\n", + " affiliate_extract_agent=affiliate_extraction_agent,\n", + " cobranding_extract_agent=cobranding_extraction_agent,\n", + " timeout=None,\n", + " verbose=True,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Running step classify_contract\n", + "affiliate_agreements\n", + "Step classify_contract produced event ExtractAffiliate\n", + "Running step extract_affiliate\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Uploading files: 100%|██████████| 1/1 [00:00<00:00, 2.84it/s]\n", + "Creating extraction jobs: 100%|██████████| 1/1 [00:00<00:00, 1.68it/s]\n", + "Extracting files: 100%|██████████| 1/1 [00:19<00:00, 19.04s/it]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Step extract_affiliate produced event StopEvent\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\n" + ] + } + ], + "source": [ + "response = await knowledge_graph_builder.run(\n", + " pdf_path=\"CybergyHoldingsInc_Affliate Agreement.pdf\"\n", + ")" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/examples/document_processing.ipynb b/examples/document_processing/iterative_document_processing.ipynb similarity index 100% rename from examples/document_processing.ipynb rename to examples/document_processing/iterative_document_processing.ipynb diff --git a/examples/document_processing/parse_classify_extract_workflow.ipynb b/examples/document_processing/parse_classify_extract_workflow.ipynb new file mode 100644 index 0000000..c4fae0e --- /dev/null +++ b/examples/document_processing/parse_classify_extract_workflow.ipynb @@ -0,0 +1,1127 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Document Classification + Extraction Workflow with LlamaCloud + LlamaIndex Workflows\n", + "\n", + "\"Open\n", + "\n", + "This notebook shows a multi-step agentic document workflow that uses the **parsing**, **classification** and **extraction** modules in LlamaCloud, orchestrated through **LlamaIndex Workflows**. The workflow can take in a complex input document, parse it into clean markdown, classify it according to its subtype, and extract data according to a specified schema for that subtype. This allows you to automate document extraction of various types within the same workflow instead of having to manually separate the data beforehand. \n", + "\n", + "This notebook uses the following modules:\n", + "1. **Parse (LlamaParse)** - Extract and convert documents to markdown\n", + "2. **Classify** - Categorize documents based on their content\n", + "3. **Extract (LlamaExtract)** - Extract structured data using the markdown as input via SourceText\n", + "4. **LlamaIndex Workflows** - Event-driven orchestration of the parse, classify and extract steps\n", + "\n", + "The workflow is implemented as a proper LlamaIndex Workflow with separate steps for parsing, classification, and extraction, connected by typed events. This provides modularity, observability, and type safety." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup and Installation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Install required packages\n", + "%pip install llama-cloud-services\n", + "%pip install python-dotenv" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ API key configured\n" + ] + } + ], + "source": [ + "import os\n", + "import nest_asyncio\n", + "from dotenv import load_dotenv\n", + "\n", + "# Load environment variables\n", + "load_dotenv()\n", + "nest_asyncio.apply()\n", + "\n", + "# Set up API key\n", + "# os.environ[\"LLAMA_CLOUD_API_KEY\"] = \"\" # edit it\n", + "\n", + "# Setup Base URL\n", + "# os.envrion[\"LLAMA_CLOUD_BASE_URL\"] = \"https://api.cloud.eu.llamaindex.ai/\" # update if necessay\n", + "\n", + "print(\"✅ API key configured\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Download Sample Documents\n", + "\n", + "Let's download some sample documents to work with:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Downloading financial_report.pdf...\n", + "✅ Downloaded financial_report.pdf\n", + "📁 technical_spec.pdf already exists\n", + "\n", + "📂 Sample documents ready!\n" + ] + } + ], + "source": [ + "import requests\n", + "\n", + "# Create directory for sample documents\n", + "os.makedirs(\"sample_docs\", exist_ok=True)\n", + "\n", + "# Download sample documents\n", + "docs_to_download = {\n", + " \"financial_report.pdf\": \"https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/10k/uber_2021.pdf\",\n", + " \"technical_spec.pdf\": \"https://www.ti.com/lit/ds/symlink/lm317.pdf\",\n", + "}\n", + "\n", + "for filename, url in docs_to_download.items():\n", + " filepath = f\"sample_docs/{filename}\"\n", + " if not os.path.exists(filepath):\n", + " print(f\"Downloading {filename}...\")\n", + " response = requests.get(url)\n", + " if response.status_code == 200:\n", + " with open(filepath, \"wb\") as f:\n", + " f.write(response.content)\n", + " print(f\"✅ Downloaded {filename}\")\n", + " else:\n", + " print(f\"❌ Failed to download {filename}\")\n", + " else:\n", + " print(f\"📁 {filename} already exists\")\n", + "\n", + "print(\"\\n📂 Sample documents ready!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Phase 1: Document Parsing\n", + "\n", + "First, let's parse our documents using LlamaParse to extract clean markdown content." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🔄 Parsing documents...\n", + "Started parsing the file under job_id 530c187a-bd2d-4eea-b38d-9e5738eab465\n", + ".✅ Parsed financial report (Job ID: 530c187a-bd2d-4eea-b38d-9e5738eab465)\n", + "Started parsing the file under job_id a6e27710-776b-4445-8b94-8d75959ff5db\n", + "✅ Parsed technical spec (Job ID: a6e27710-776b-4445-8b94-8d75959ff5db)\n", + "\n", + "📄 Parsing complete!\n" + ] + } + ], + "source": [ + "from llama_cloud_services.parse.base import LlamaParse\n", + "from llama_cloud_services.parse.utils import ResultType\n", + "\n", + "# Initialize the parser\n", + "parser = LlamaParse(\n", + " result_type=ResultType.MD, # Get markdown output\n", + " verbose=True,\n", + " language=\"en\",\n", + " # Premium mode for better accuracy\n", + " premium_mode=True,\n", + " # Extract tables as HTML for better structure\n", + " output_tables_as_HTML=True,\n", + " # Parse only first few pages for demo\n", + ")\n", + "\n", + "print(\"🔄 Parsing documents...\")\n", + "\n", + "# Parse the financial report\n", + "financial_result = await parser.aparse(\"sample_docs/financial_report.pdf\")\n", + "print(f\"✅ Parsed financial report (Job ID: {financial_result.job_id})\")\n", + "\n", + "# Parse the technical specification\n", + "technical_result = await parser.aparse(\"sample_docs/technical_spec.pdf\")\n", + "print(f\"✅ Parsed technical spec (Job ID: {technical_result.job_id})\")\n", + "\n", + "print(\"\\n📄 Parsing complete!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Extract Markdown Content\n", + "\n", + "Now let's get the markdown content from our parsed documents:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "📋 Financial Report Markdown (first 500 chars):\n", + "\n", + "\n", + "# UNITED STATES\n", + "# SECURITIES AND EXCHANGE COMMISSION\n", + "Washington, D.C. 20549\n", + "\n", + "## FORM 10-K\n", + "\n", + "(Mark One)\n", + "\n", + "☒ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\n", + "For the fiscal year ended December 31, 2021\n", + "OR\n", + "☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\n", + "For the transition period from_____ to _____\n", + "Commission File Number: 001-38902\n", + "\n", + "# UBER TECHNOLOGIES, INC.\n", + "(Exact name of registrant as specified in its charter)\n", + "\n", + "Delaware\n", + "...\n", + "\n", + "📋 Technical Spec Markdown (first 500 chars):\n", + "\n", + "\n", + "LM317\n", + "SLVS044Z – SEPTEMBER 1997 – REVISED APRIL 2025\n", + "\n", + "# LM317 3-Pin Adjustable Regulator\n", + "\n", + "## 1 Features\n", + "\n", + "- Output voltage range:\n", + " – Adjustable: 1.25V to 37V\n", + "- Output current: 1.5A\n", + "- Line regulation: 0.01%/V (typ)\n", + "- Load regulation: 0.1% (typ)\n", + "- Internal short-circuit current limiting\n", + "- Thermal overload protection\n", + "- Output safe-area compensation (new chip)\n", + "- PSRR: 80dB at 120Hz for CADJ = 10μF (new chip)\n", + "- Packages:\n", + " – 4-pin, SOT-223 (DCY)\n", + " – 3-pin, TO-263 (KTT)\n", + " – 3-pin, TO-220 (KCS, KCT),\n", + "...\n", + "\n", + "📏 Financial report markdown length: 1338499 characters\n", + "📏 Technical spec markdown length: 92483 characters\n" + ] + } + ], + "source": [ + "# Get markdown content from parsed documents\n", + "financial_markdown = await financial_result.aget_markdown()\n", + "technical_markdown = await technical_result.aget_markdown()\n", + "\n", + "print(\"📋 Financial Report Markdown (first 500 chars):\")\n", + "print(financial_markdown[:500])\n", + "print(\"...\\n\")\n", + "\n", + "print(\"📋 Technical Spec Markdown (first 500 chars):\")\n", + "print(technical_markdown[:500])\n", + "print(\"...\\n\")\n", + "\n", + "print(f\"📏 Financial report markdown length: {len(financial_markdown)} characters\")\n", + "print(f\"📏 Technical spec markdown length: {len(technical_markdown)} characters\")\n", + "\n", + "document_texts = [financial_markdown, technical_markdown]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Phase 2: Document Classification\n", + "\n", + "Next, let's classify our documents based on their content using the ClassifyClient." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏷️ Setting up document classification...\n", + "📝 Created 3 classification rules\n" + ] + } + ], + "source": [ + "from llama_cloud_services.beta.classifier.client import ClassifyClient\n", + "from llama_cloud.types import ClassifierRule\n", + "\n", + "# Initialize the classify client\n", + "api_key = os.environ[\"LLAMA_CLOUD_API_KEY\"]\n", + "classify_client = ClassifyClient.from_api_key(api_key)\n", + "\n", + "print(\"🏷️ Setting up document classification...\")\n", + "\n", + "# Define classification rules\n", + "classification_rules = [\n", + " ClassifierRule(\n", + " type=\"financial_document\",\n", + " description=\"Documents containing financial data, revenue, expenses, SEC filings, or financial statements\",\n", + " ),\n", + " ClassifierRule(\n", + " type=\"technical_specification\",\n", + " description=\"Technical datasheets, component specifications, engineering documents, or technical manuals\",\n", + " ),\n", + " ClassifierRule(\n", + " type=\"general_document\",\n", + " description=\"General business documents, contracts, or other unspecified document types\",\n", + " ),\n", + "]\n", + "\n", + "print(f\"📝 Created {len(classification_rules)} classification rules\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Try Classification Independently\n", + "\n", + "Let's test the classification on one of our parsed documents to see how it works:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🔍 Classifying financial document...\n", + " Document length: 1,338,499 characters\n", + "\n", + "✅ Classification Result:\n", + " Type: financial_document\n", + " Confidence: 100.00%\n", + " Reasoning: This document is a Form 10-K, which is an annual report required by the U.S. Securities and Exchange Commission (SEC) for publicly traded companies. It contains financial data, information about the c...\n", + "\n", + "======================================================================\n" + ] + } + ], + "source": [ + "import tempfile\n", + "from pathlib import Path\n", + "\n", + "# Let's classify the financial document\n", + "print(\"🔍 Classifying financial document...\")\n", + "print(f\" Document length: {len(financial_markdown):,} characters\\n\")\n", + "\n", + "# Write to temp file for classification\n", + "with tempfile.NamedTemporaryFile(\n", + " mode=\"w\", suffix=\".md\", delete=False, encoding=\"utf-8\"\n", + ") as tmp:\n", + " tmp.write(financial_markdown)\n", + " temp_financial_path = Path(tmp.name)\n", + "\n", + "# Classify the document\n", + "financial_classification = await classify_client.aclassify_file_path(\n", + " rules=classification_rules, file_input_path=str(temp_financial_path)\n", + ")\n", + "\n", + "doc_type = financial_classification.items[0].result.type\n", + "confidence = financial_classification.items[0].result.confidence\n", + "reasoning = financial_classification.items[0].result.reasoning\n", + "\n", + "print(\"✅ Classification Result:\")\n", + "print(f\" Type: {doc_type}\")\n", + "print(f\" Confidence: {confidence:.2%}\")\n", + "print(\n", + " f\" Reasoning: {reasoning[:200]}...\"\n", + " if reasoning and len(reasoning) > 200\n", + " else f\" Reasoning: {reasoning}\"\n", + ")\n", + "\n", + "print(\"\\n\" + \"=\" * 70)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Phase 3: Structured Data Extraction using SourceText\n", + "\n", + "Now comes the key part - using the markdown content as input for structured data extraction via SourceText." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "⚙️ LlamaExtract initialized\n" + ] + } + ], + "source": [ + "from llama_cloud_services.extract.extract import LlamaExtract, SourceText\n", + "from pydantic import BaseModel, Field\n", + "from typing import List, Optional\n", + "\n", + "# Initialize LlamaExtract\n", + "llama_extract = LlamaExtract(api_key=api_key, verbose=True)\n", + "\n", + "print(\"⚙️ LlamaExtract initialized\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Define Extraction Schemas\n", + "\n", + "Let's define different schemas for different document types:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "📋 Extraction schemas defined\n" + ] + } + ], + "source": [ + "# Schema for financial documents\n", + "class FinancialMetrics(BaseModel):\n", + " company_name: str = Field(description=\"Name of the company\")\n", + " document_type: str = Field(\n", + " description=\"Type of financial document (10-K, 10-Q, annual report, etc.)\"\n", + " )\n", + " fiscal_year: int = Field(description=\"Fiscal year of the report\")\n", + " revenue_2021: str = Field(description=\"Total revenue in 2021\")\n", + " net_income_2021: str = Field(description=\"Net income in 2021\")\n", + " key_business_segments: List[str] = Field(\n", + " default=[], description=\"Main business segments or divisions\"\n", + " )\n", + " risk_factors: List[str] = Field(\n", + " default=[], description=\"Key risk factors mentioned\"\n", + " )\n", + "\n", + "\n", + "# Schema for technical specifications\n", + "class VoltageRange(BaseModel):\n", + " min_voltage: Optional[float] = Field(description=\"Minimum voltage\")\n", + " max_voltage: Optional[float] = Field(description=\"Maximum voltage\")\n", + " unit: str = Field(default=\"V\", description=\"Voltage unit\")\n", + "\n", + "\n", + "class TechnicalSpec(BaseModel):\n", + " component_name: str = Field(description=\"Name of the technical component\")\n", + " manufacturer: Optional[str] = Field(description=\"Manufacturer name\")\n", + " part_number: Optional[str] = Field(description=\"Part or model number\")\n", + " description: str = Field(description=\"Brief description of the component\")\n", + " operating_voltage: Optional[VoltageRange] = Field(\n", + " description=\"Operating voltage range\"\n", + " )\n", + " maximum_current: Optional[float] = Field(\n", + " description=\"Maximum current rating in amperes\"\n", + " )\n", + " key_features: List[str] = Field(\n", + " default=[], description=\"Key features and capabilities\"\n", + " )\n", + " applications: List[str] = Field(default=[], description=\"Typical applications\")\n", + "\n", + "\n", + "print(\"📋 Extraction schemas defined\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Building the Complete Workflow\n", + "\n", + "Now that we've seen how parsing works, let's build a complete 3-step workflow (Parse → Classify → Extract) using LlamaIndex Workflows. We'll define the workflow structure here, and you can see it in action below where we also demonstrate the classification and extraction modules independently.\n", + "\n", + "### Install Workflows Package\n", + "\n", + "First, let's install the LlamaIndex workflows package:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install llama-index-workflows llama-index-utils-workflow" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Define the Workflow\n", + "\n", + "Let's restructure the document processing into a proper LlamaIndex Workflow with separate classification and extraction steps:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🔧 Workflow defined!\n" + ] + } + ], + "source": [ + "import tempfile\n", + "from pathlib import Path\n", + "from llama_cloud import ExtractConfig\n", + "from workflows import Workflow, step, Context\n", + "from workflows.events import Event, StartEvent, StopEvent\n", + "\n", + "\n", + "# Define workflow events\n", + "class ParseEvent(Event):\n", + " \"\"\"Event emitted after parsing\"\"\"\n", + "\n", + " file_path: str\n", + " markdown_content: str\n", + " job_id: str\n", + "\n", + "\n", + "class ClassifyEvent(Event):\n", + " \"\"\"Event emitted after classification\"\"\"\n", + "\n", + " markdown_content: str\n", + " temp_path: str\n", + " doc_type: str\n", + " confidence: float\n", + "\n", + "\n", + "class ExtractEvent(Event):\n", + " \"\"\"Event emitted after extraction\"\"\"\n", + "\n", + " doc_type: str\n", + " confidence: float\n", + " extracted_data: dict\n", + " markdown_length: int\n", + " temp_path: str\n", + " markdown_sample: str\n", + "\n", + "\n", + "class DocumentWorkflow(Workflow):\n", + " \"\"\"\n", + " Complete document processing workflow: Parse → Classify → Extract\n", + " \"\"\"\n", + "\n", + " def __init__(\n", + " self,\n", + " parser,\n", + " classify_client,\n", + " classification_rules,\n", + " llama_extract,\n", + " financial_schema,\n", + " technical_schema,\n", + " **kwargs,\n", + " ):\n", + " super().__init__(**kwargs)\n", + " self.parser = parser\n", + " self.classify_client = classify_client\n", + " self.classification_rules = classification_rules\n", + " self.llama_extract = llama_extract\n", + " self.financial_schema = financial_schema\n", + " self.technical_schema = technical_schema\n", + "\n", + " @step\n", + " async def parse_document(self, ctx: Context, ev: StartEvent) -> ParseEvent:\n", + " \"\"\"\n", + " Step 1: Parse the document to extract markdown\n", + " \"\"\"\n", + " file_path = ev.file_path\n", + " print(f\"📄 Step 1: Parsing document: {file_path}...\")\n", + "\n", + " # Parse the document\n", + " parse_result = await self.parser.aparse(file_path)\n", + " markdown_content = await parse_result.aget_markdown()\n", + " job_id = parse_result.job_id\n", + "\n", + " print(f\" ✅ Parsed successfully (Job ID: {job_id})\")\n", + " print(f\" 📝 Extracted {len(markdown_content):,} characters\")\n", + "\n", + " # Write event to stream for monitoring\n", + " parse_event = ParseEvent(\n", + " file_path=file_path,\n", + " markdown_content=markdown_content,\n", + " job_id=job_id,\n", + " )\n", + " ctx.write_event_to_stream(parse_event)\n", + "\n", + " return parse_event\n", + "\n", + " @step\n", + " async def classify_document(self, ctx: Context, ev: ParseEvent) -> ClassifyEvent:\n", + " \"\"\"\n", + " Step 2: Classify the document based on its content\n", + " \"\"\"\n", + " markdown_content = ev.markdown_content\n", + " print(\"🏷️ Step 2: Classifying document...\")\n", + "\n", + " # Write markdown to temp file for classification\n", + " with tempfile.NamedTemporaryFile(\n", + " mode=\"w\", suffix=\".md\", delete=False, encoding=\"utf-8\"\n", + " ) as tmp:\n", + " tmp.write(markdown_content)\n", + " temp_path = Path(tmp.name)\n", + "\n", + " # Classify the document\n", + " classification = await self.classify_client.aclassify_file_path(\n", + " rules=self.classification_rules, file_input_path=str(temp_path)\n", + " )\n", + " doc_type = classification.items[0].result.type\n", + " confidence = classification.items[0].result.confidence\n", + "\n", + " print(f\" ✅ Classified as: {doc_type} (confidence: {confidence:.2f})\")\n", + "\n", + " # Write event to stream for monitoring\n", + " classify_event = ClassifyEvent(\n", + " markdown_content=markdown_content,\n", + " temp_path=str(temp_path),\n", + " doc_type=doc_type,\n", + " confidence=confidence,\n", + " )\n", + " ctx.write_event_to_stream(classify_event)\n", + "\n", + " return classify_event\n", + "\n", + " @step\n", + " async def extract_data(self, ctx: Context, ev: ClassifyEvent) -> ExtractEvent:\n", + " \"\"\"\n", + " Step 3: Extract structured data based on classification\n", + " \"\"\"\n", + " print(\"🔍 Step 3: Extracting structured data using SourceText...\")\n", + "\n", + " # Choose schema based on classification\n", + " if \"financial\" in ev.doc_type.lower():\n", + " schema = self.financial_schema\n", + " print(\" 📊 Using FinancialMetrics schema\")\n", + " elif \"technical\" in ev.doc_type.lower():\n", + " schema = self.technical_schema\n", + " print(\" 🔧 Using TechnicalSpec schema\")\n", + " else:\n", + " schema = self.financial_schema # Default fallback\n", + " print(\" 📊 Using default FinancialMetrics schema\")\n", + "\n", + " # Create SourceText from markdown content\n", + " source_text = SourceText(\n", + " text_content=ev.markdown_content,\n", + " filename=f\"{os.path.basename(ev.temp_path)}_markdown.md\",\n", + " )\n", + "\n", + " # Configure extraction\n", + " extract_config = ExtractConfig(\n", + " extraction_mode=\"BALANCED\",\n", + " )\n", + "\n", + " # Perform extraction\n", + " extraction_result = self.llama_extract.extract(\n", + " data_schema=schema, config=extract_config, files=source_text\n", + " )\n", + "\n", + " print(\" ✅ Extraction complete!\")\n", + "\n", + " # Create markdown sample\n", + " markdown_sample = (\n", + " ev.markdown_content[:200] + \"...\"\n", + " if len(ev.markdown_content) > 200\n", + " else ev.markdown_content\n", + " )\n", + "\n", + " extract_event = ExtractEvent(\n", + " doc_type=ev.doc_type,\n", + " confidence=ev.confidence,\n", + " extracted_data=extraction_result.data,\n", + " markdown_length=len(ev.markdown_content),\n", + " temp_path=ev.temp_path,\n", + " markdown_sample=markdown_sample,\n", + " )\n", + " ctx.write_event_to_stream(extract_event)\n", + "\n", + " return extract_event\n", + "\n", + " @step\n", + " async def finalize_results(self, ctx: Context, ev: ExtractEvent) -> StopEvent:\n", + " \"\"\"\n", + " Step 4: Finalize and return results\n", + " \"\"\"\n", + " result = {\n", + " \"file_path\": ev.temp_path,\n", + " \"markdown_length\": ev.markdown_length,\n", + " \"classification\": ev.doc_type,\n", + " \"confidence\": ev.confidence,\n", + " \"extracted_data\": ev.extracted_data,\n", + " \"markdown_sample\": ev.markdown_sample,\n", + " }\n", + "\n", + " return StopEvent(result=result)\n", + "\n", + "\n", + "print(\"🔧 Workflow defined!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Workflow Structure\n", + "\n", + "The workflow consists of four steps connected by typed events:\n", + "\n", + "```\n", + "┌─────────────┐\n", + "│ StartEvent │ (file_path)\n", + "└──────┬──────┘\n", + " │\n", + " ▼\n", + "┌──────────────────┐\n", + "│ parse_document │ Step 1: Parse PDF to markdown\n", + "└──────┬───────────┘\n", + " │\n", + " ▼\n", + "┌─────────────┐\n", + "│ ParseEvent │ (markdown_content, job_id)\n", + "└──────┬──────┘\n", + " │\n", + " ▼\n", + "┌─────────────────────┐\n", + "│ classify_document │ Step 2: Classification\n", + "└──────┬──────────────┘\n", + " │\n", + " ▼\n", + "┌──────────────┐\n", + "│ ClassifyEvent│ (doc_type, confidence, markdown_content)\n", + "└──────┬───────┘\n", + " │\n", + " ▼\n", + "┌──────────────┐\n", + "│ extract_data │ Step 3: Extraction with schema selection\n", + "└──────┬───────┘\n", + " │\n", + " ▼\n", + "┌──────────────┐\n", + "│ ExtractEvent │ (extracted_data, doc_type, confidence)\n", + "└──────┬───────┘\n", + " │\n", + " ▼\n", + "┌──────────────────┐\n", + "│ finalize_results │ Step 4: Format and return results\n", + "└──────┬───────────┘\n", + " │\n", + " ▼\n", + "┌─────────────┐\n", + "│ StopEvent │ (final result dictionary)\n", + "└─────────────┘\n", + "```\n", + "\n", + "**Key Features:**\n", + "- **Step 1 (parse_document)**: Takes a file path and parses the document into clean markdown\n", + "- **Step 2 (classify_document)**: Takes markdown content and classifies it into document types\n", + "- **Step 3 (extract_data)**: Selects appropriate schema based on classification and extracts structured data\n", + "- **Step 4 (finalize_results)**: Packages all results into final output format\n", + "- Events are written to the stream for real-time monitoring\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Visualize the Workflow\n", + "\n", + "Let's visualize the workflow structure to see the flow of events:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Initialize the workflow\n", + "workflow = DocumentWorkflow(\n", + " parser=parser,\n", + " classify_client=classify_client,\n", + " classification_rules=classification_rules,\n", + " llama_extract=llama_extract,\n", + " financial_schema=FinancialMetrics,\n", + " technical_schema=TechnicalSpec,\n", + " timeout=300,\n", + " verbose=True,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "document_workflow.html\n" + ] + } + ], + "source": [ + "# Draw the workflow visualization\n", + "from llama_index.utils.workflow import draw_all_possible_flows\n", + "\n", + "draw_all_possible_flows(\n", + " workflow,\n", + " filename=\"document_workflow.html\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The workflow has been visualized and saved to `document_workflow.html`. You can open this file in a browser to see the interactive workflow diagram.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The workflow visualization shows:\n", + "1. **StartEvent** → **parse_document** step\n", + "2. **ParseEvent** → **classify_document** step\n", + "3. **ClassifyEvent** → **extract_data** step \n", + "4. **ExtractEvent** → **finalize_results** step\n", + "5. **StopEvent** (final output)\n", + "\n", + "Each step is connected by typed events, allowing for clean separation of concerns and easy monitoring of the workflow execution.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Run the Workflow on Both Documents\n", + "\n", + "Now let's run the workflow on both documents and monitor the events:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "======================================================================\n", + "🚀 Processing Document 1: sample_docs/financial_report.pdf\n", + "======================================================================\n", + "\n", + "Running step parse_document\n", + "📄 Step 1: Parsing document: sample_docs/financial_report.pdf...\n", + "Started parsing the file under job_id bb53c6bf-79cc-4f63-9c97-16983d59f29d\n", + ". ✅ Parsed successfully (Job ID: bb53c6bf-79cc-4f63-9c97-16983d59f29d)\n", + " 📝 Extracted 1,338,499 characters\n", + "Step parse_document produced event ParseEvent\n", + "📄 Parse Event: Extracted 1,338,499 characters\n", + "Running step classify_document\n", + "🏷️ Step 2: Classifying document...\n", + " ✅ Classified as: financial_document (confidence: 1.00)\n", + "Step classify_document produced event ClassifyEvent\n", + "📊 Classification Event: financial_document (1.00)\n", + "Running step extract_data\n", + "🔍 Step 3: Extracting structured data using SourceText...\n", + " 📊 Using FinancialMetrics schema\n", + ".. ✅ Extraction complete!\n", + "Step extract_data produced event ExtractEvent\n", + "Running step finalize_results\n", + "Step finalize_results produced event StopEvent\n", + "✅ Extraction Event: 7 fields extracted\n", + "\n", + "✅ Document 1 processed successfully!\n", + "\n", + "======================================================================\n", + "🚀 Processing Document 2: sample_docs/technical_spec.pdf\n", + "======================================================================\n", + "\n", + "Running step parse_document\n", + "📄 Step 1: Parsing document: sample_docs/technical_spec.pdf...\n", + "Started parsing the file under job_id 944905c1-3c49-431a-ad86-4436d16f3d1c\n", + " ✅ Parsed successfully (Job ID: 944905c1-3c49-431a-ad86-4436d16f3d1c)\n", + " 📝 Extracted 92,483 characters\n", + "Step parse_document produced event ParseEvent\n", + "📄 Parse Event: Extracted 92,483 characters\n", + "Running step classify_document\n", + "🏷️ Step 2: Classifying document...\n", + " ✅ Classified as: technical_specification (confidence: 1.00)\n", + "Step classify_document produced event ClassifyEvent\n", + "📊 Classification Event: technical_specification (1.00)\n", + "Running step extract_data\n", + "🔍 Step 3: Extracting structured data using SourceText...\n", + " 🔧 Using TechnicalSpec schema\n", + " ✅ Extraction complete!\n", + "Step extract_data produced event ExtractEvent\n", + "Running step finalize_results\n", + "Step finalize_results produced event StopEvent\n", + "✅ Extraction Event: 8 fields extracted\n", + "\n", + "✅ Document 2 processed successfully!\n", + "\n", + "\n", + "📋 Processed 2 documents successfully!\n" + ] + } + ], + "source": [ + "# Process both documents through the workflow\n", + "results = []\n", + "\n", + "# Define the document files to process\n", + "document_files = [\n", + " \"sample_docs/financial_report.pdf\",\n", + " \"sample_docs/technical_spec.pdf\",\n", + "]\n", + "\n", + "for i, file_path in enumerate(document_files, 1):\n", + " print(f\"\\n{'=' * 70}\")\n", + " print(f\"🚀 Processing Document {i}: {file_path}\")\n", + " print(f\"{'=' * 70}\\n\")\n", + "\n", + " try:\n", + " # Run the workflow\n", + " handler = workflow.run(file_path=file_path)\n", + "\n", + " # Monitor events as they are emitted\n", + " async for event in handler.stream_events():\n", + " if isinstance(event, ParseEvent):\n", + " print(\n", + " f\"📄 Parse Event: Extracted {len(event.markdown_content):,} characters\"\n", + " )\n", + " elif isinstance(event, ClassifyEvent):\n", + " print(\n", + " f\"📊 Classification Event: {event.doc_type} ({event.confidence:.2f})\"\n", + " )\n", + " elif isinstance(event, ExtractEvent):\n", + " print(\n", + " f\"✅ Extraction Event: {len(event.extracted_data)} fields extracted\"\n", + " )\n", + "\n", + " # Get final result\n", + " result = await handler\n", + " results.append(result)\n", + "\n", + " print(f\"\\n✅ Document {i} processed successfully!\")\n", + "\n", + " except Exception as e:\n", + " print(f\"❌ Error processing document {i}: {str(e)}\")\n", + " import traceback\n", + "\n", + " traceback.print_exc()\n", + "\n", + "print(f\"\\n\\n📋 Processed {len(results)} documents successfully!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Final Results Summary\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "📈 COMPLETE WORKFLOW RESULTS SUMMARY\n", + "======================================================================\n", + "\n", + "📄 Document 1: tmpuyxzpd3x.md\n", + " 📊 Classification: financial_document (confidence: 1.00)\n", + " 📝 Markdown length: 1,338,499 characters\n", + " 📋 Markdown sample: \n", + "\n", + "# UNITED STATES\n", + "# SECURITIES AND EXCHANGE COMMISSION\n", + "Washington, D.C. 20549\n", + "\n", + "## FORM 10-K\n", + "\n", + "(Mark O...\n", + " 🎯 Extracted fields: 7 fields\n", + " • company_name: Uber Technologies, Inc.\n", + " • document_type: Annual Report on Form 10-K\n", + " • fiscal_year: 2021\n", + " • revenue_2021: $17,455 and $21,764\n", + " • net_income_2021: $(496) to (700)\n", + " • key_business_segments: ['Borrower and the Restricted Subsidiaries', 'Holdings', 'Guarantors', 'Material Domestic Subsidiaries', 'Material Foreign Subsidiaries']\n", + " • risk_factors: ['Indemnification obligations of the borrower for losses, claims, damages, liabilities, and out-of-pocket expenses incurred by agents, lenders, arrangers, and related parties in connection with the agreement or loans, except in certain cases such as gross negligence, bad faith, willful misconduct, or material breach by the indemnitee.', \"Borrower not required to indemnify any indemnitee for settlements entered into without the borrower's consent.\", 'Limitation of liability for special, indirect, consequential, or punitive damages, and for damages from unauthorized use of information, except for direct damages resulting from gross negligence, bad faith, or willful misconduct.', 'Obligation of the borrower to indemnify the administrative agent for liabilities arising from performance of duties, except in cases of gross negligence, bad faith, or willful misconduct.', 'Limitations and conditions on assignments and participations of lender rights, including restrictions on assignments to disqualified institutions, loan parties, affiliates of loan parties, defaulting lenders, and natural persons.', 'Setoff rights for lenders and issuing banks after an event of default, allowing them to apply borrower deposits toward obligations under the agreement.', 'Potential for increased obligations under the agreement as a result of changes in law affecting payment terms.', 'Requirement for the borrower and guarantors to provide information to comply with anti-money laundering rules and the USA PATRIOT Act.']\n", + "\n", + "📄 Document 2: tmp7ower2xm.md\n", + " 📊 Classification: technical_specification (confidence: 1.00)\n", + " 📝 Markdown length: 92,483 characters\n", + " 📋 Markdown sample: \n", + "\n", + "LM317\n", + "SLVS044Z – SEPTEMBER 1997 – REVISED APRIL 2025\n", + "\n", + "# LM317 3-Pin Adjustable Regulator\n", + "\n", + "## 1 Fea...\n", + " 🎯 Extracted fields: 8 fields\n", + " • component_name: LM317\n", + " • manufacturer: Texas Instruments\n", + " • part_number: LM317, SLVS044Z\n", + " • description: The LM317 is an adjustable three-pin, positive-voltage regulator capable of supplying more than 1.5A (typically up to 1.5A) over an output voltage range of 1.25V to 37V. The device requires only two external resistors to set the output voltage. It features a typical line regulation of 0.01% and typical load regulation of 0.1%. The LM317 includes current limiting, thermal overload protection, and safe operating area protection. Overload protection remains functional even if the ADJUST pin is disconnected. The regulator is used in applications such as constant-current battery-charger circuits, slow turn-on 15V regulator circuits, AC voltage-regulator circuits, current-limited charger circuits, and high-current and adjustable regulator circuits. It is available in packages including SOT-223 (DCY), TO-220 (KCS), and TO-263 (KTT).\n", + " • operating_voltage: {'min_voltage': 1.25, 'max_voltage': 37.0, 'unit': 'V'}\n", + " • maximum_current: 4.0\n", + " • key_features: ['Adjustable output voltage range: 1.25V to 37V', 'Output current up to 1.5A (up to 4A with external pass elements)', 'Line regulation: typically 0.01%/V', 'Load regulation: typically 0.1%', 'Internal short-circuit current limiting / Current limiting', 'Thermal overload protection / Thermal shutdown', 'Output safe-area compensation / Safe operating area protection', 'PSRR: 80dB at 120Hz for CADJ = 10μF (new chip)', 'NPN Darlington output drive', 'Programmable feedback', 'Multiple package options (SOT-223, TO-220, TO-263)', 'Can be used in constant-current, battery-charging, and regulator applications']\n", + " • applications: ['Multifunction printers, AC drive power stage modules, Electricity meters, Servo drive control modules, Merchant network and server PSU, Adjustable voltage regulator, 0V to 30V regulator circuit, Regulator circuit with improved ripple rejection, Precision current-limiter, Tracking preregulator, 1.25V to 20V regulator, Battery charger circuit, Constant-current battery charger circuits, Slow turn-on regulator, AC voltage-regulator, Current-limited charger circuits, High-current adjustable regulator circuits, General-purpose adjustable power supply']\n", + "\n", + "✨ Workflow completed successfully!\n" + ] + } + ], + "source": [ + "print(\"📈 COMPLETE WORKFLOW RESULTS SUMMARY\")\n", + "print(\"=\" * 70)\n", + "\n", + "for i, result in enumerate(results, 1):\n", + " print(f\"\\n📄 Document {i}: {os.path.basename(result['file_path'])}\")\n", + " print(\n", + " f\" 📊 Classification: {result['classification']} (confidence: {result['confidence']:.2f})\"\n", + " )\n", + " print(f\" 📝 Markdown length: {result['markdown_length']:,} characters\")\n", + " print(f\" 📋 Markdown sample: {result['markdown_sample'][:100]}...\")\n", + " print(f\" 🎯 Extracted fields: {len(result['extracted_data'])} fields\")\n", + "\n", + " # Print all key–value pairs\n", + " extracted = result[\"extracted_data\"]\n", + " for key, value in extracted.items():\n", + " print(f\" • {key}: {value}\")\n", + "\n", + "print(\"\\n✨ Workflow completed successfully!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "The notebook shows you how to build an e2e document **Classify → Extract** workflow using LlamaCloud. This uses some of our core building blocks around **classification** interleaved with **document extraction**.\n", + "\n", + "### Main Components:\n", + "\n", + "1. **LlamaParse** (`llama_cloud_services.parse.base.LlamaParse`):\n", + " - Converts documents to clean, structured markdown\n", + " - Preserves document structure and formatting\n", + " - Handles various file types (PDF, DOCX, etc.)\n", + "\n", + "2. **ClassifyClient** (`llama_cloud_services.beta.classifier.client.ClassifyClient`):\n", + " - Automatically categorizes documents based on content\n", + " - Uses customizable rules for classification\n", + " - Provides confidence scores for classifications\n", + "\n", + "3. **LlamaExtract with SourceText** (`llama_cloud_services.extract.extract.LlamaExtract`, `SourceText`):\n", + " - Extracts structured data using custom Pydantic schemas\n", + " - You can either feed in the file directly (in which case parsing will happen under the hood), or the parsed text through the **SourceText** object (which is the case in this example) \n", + "\n", + "**Benefits of an e2e workflow**: The main benefit of doing Classify -> Extract, instead of only Extract, is the fact that you can handle documents of different types/different expected schemas within the same workflow, without having to separate out the data before and running separate extractions on each data subset. " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "llama_parse", + "language": "python", + "name": "llama_parse" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/examples/document_processing/solar_panel_e2e_comparison.ipynb b/examples/document_processing/solar_panel_e2e_comparison.ipynb new file mode 100644 index 0000000..4a92642 --- /dev/null +++ b/examples/document_processing/solar_panel_e2e_comparison.ipynb @@ -0,0 +1,449 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "00f6713b-2a32-4f8f-80e5-9a7d9b6e3b90", + "metadata": {}, + "source": [ + "# Solar Panel Datasheet Comparison Workflow\n", + "\n", + "\"Open\n", + "\n", + "\n", + "This notebook demonstrates an end‑to‑end agentic workflow using LlamaExtract and the LlamaIndex event‑driven workflow framework. In this workflow, we:\n", + "\n", + "1. **Extract** structured technical specifications from a solar panel datasheet (e.g. a PDF downloaded from a vendor).\n", + "2. **Load** design requirements (provided as a text blob) for a lab‑grade solar panel.\n", + "3. **Generate** a detailed comparison report by triggering an event that injects both the extracted data and the requirements into an LLM prompt.\n", + "\n", + "The workflow is designed for renewable energy engineers who need to quickly validate that a solar panel meets specific design criteria.\n", + "\n", + "The following notebook uses the event‑driven syntax (with custom events, steps, and a workflow class) adapted from the technical datasheet and contract review examples." + ] + }, + { + "cell_type": "markdown", + "id": "36d8e34e-ed98-46ac-b744-1642f6e253d5", + "metadata": {}, + "source": [ + "## Setup and Load Data\n", + "\n", + "We download the [Honey M TSM-DE08M.08(II) datasheet](https://static.trinasolar.com/sites/default/files/EU_Datasheet_HoneyM_DE08M.08%28II%29_2021_A.pdf) as a PDF.\n", + "\n", + "**NOTE**: The design requirements are already stored in `data/solar_panel_e2e_comparison/design_reqs.txt`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1de7b1b3-c285-492c-8b2e-b37974b4fc63", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "--2025-04-01 14:47:56-- https://static.trinasolar.com/sites/default/files/EU_Datasheet_HoneyM_DE08M.08%28II%29_2021_A.pdf\n", + "Resolving static.trinasolar.com (static.trinasolar.com)... 47.246.23.232, 47.246.23.234, 47.246.23.227, ...\n", + "Connecting to static.trinasolar.com (static.trinasolar.com)|47.246.23.232|:443... connected.\n", + "WARNING: cannot verify static.trinasolar.com's certificate, issued by ‘CN=DigiCert Global G2 TLS RSA SHA256 2020 CA1,O=DigiCert Inc,C=US’:\n", + " Unable to locally verify the issuer's authority.\n", + "HTTP request sent, awaiting response... 200 OK\n", + "Length: 1888183 (1.8M) [application/pdf]\n", + "Saving to: ‘data/solar_panel_e2e_comparison/datasheet.pdf’\n", + "\n", + "data/solar_panel_e2 100%[===================>] 1.80M 7.47MB/s in 0.2s \n", + "\n", + "2025-04-01 14:47:56 (7.47 MB/s) - ‘data/solar_panel_e2e_comparison/datasheet.pdf’ saved [1888183/1888183]\n", + "\n" + ] + } + ], + "source": [ + "!wget https://static.trinasolar.com/sites/default/files/EU_Datasheet_HoneyM_DE08M.08%28II%29_2021_A.pdf -O data/solar_panel_e2e_comparison/datasheet.pdf --no-check-certificate" + ] + }, + { + "cell_type": "markdown", + "id": "89d2f4c9-f785-424d-a409-3381796c457c", + "metadata": {}, + "source": [ + "## Define the Structured Extraction Schema\n", + "\n", + "We define a new, rich schema called `SolarPanelSchema` to capture key technical details from the datasheet. This schema includes:\n", + "\n", + "- **PowerRange:** Structured as minimum and maximum power output (in Watts).\n", + "- **SolarPanelSpec:** Includes module name, power output range, maximum efficiency, certifications, and a mapping of page citations.\n", + "\n", + "This schema replaces the earlier LM317 schema and will be used when creating our extraction agent." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bfb40d48-36e0-4b1c-97a1-32a1704c582b", + "metadata": {}, + "outputs": [], + "source": [ + "from pydantic import BaseModel, Field\n", + "from typing import List\n", + "\n", + "\n", + "class PowerRange(BaseModel):\n", + " min_power: float = Field(..., description=\"Minimum power output in Watts\")\n", + " max_power: float = Field(..., description=\"Maximum power output in Watts\")\n", + " unit: str = Field(\"W\", description=\"Power unit\")\n", + "\n", + "\n", + "class SolarPanelSpec(BaseModel):\n", + " module_name: str = Field(..., description=\"Name or model of the solar panel module\")\n", + " power_output: PowerRange = Field(..., description=\"Power output range\")\n", + " maximum_efficiency: float = Field(\n", + " ..., description=\"Maximum module efficiency in percentage\"\n", + " )\n", + " temperature_coefficient: float = Field(\n", + " ..., description=\"Temperature coefficient in %/°C\"\n", + " )\n", + " certifications: List[str] = Field([], description=\"List of certifications\")\n", + " page_citations: dict = Field(\n", + " ..., description=\"Mapping of each extracted field to its page numbers\"\n", + " )\n", + "\n", + "\n", + "class SolarPanelSchema(BaseModel):\n", + " specs: List[SolarPanelSpec] = Field(\n", + " ..., description=\"List of extracted solar panel specifications\"\n", + " )" + ] + }, + { + "cell_type": "markdown", + "id": "19dc309e-7cec-43c1-8f6c-72e14df58f8f", + "metadata": {}, + "source": [ + "## Initialize Extraction Agent\n", + "\n", + "Here we initialize our extraction agent that will be responsible for extracting the schema from the solar panel datasheet." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c9d9f4a2-2e14-493d-8a7e-d01159d38b8f", + "metadata": {}, + "outputs": [], + "source": [ + "from llama_cloud_services import LlamaExtract\n", + "from llama_cloud.core.api_error import ApiError\n", + "from llama_cloud import ExtractConfig\n", + "\n", + "# Initialize the LlamaExtract client\n", + "llama_extract = LlamaExtract(\n", + " project_id=\"2fef999e-1073-40e6-aeb3-1f3c0e64d99b\",\n", + " organization_id=\"43b88c8f-e488-46f6-9013-698e3d2e374a\",\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ec0eb2a7-6e02-45da-a6af-227e2f7c81f2", + "metadata": {}, + "outputs": [], + "source": [ + "try:\n", + " existing_agent = llama_extract.get_agent(name=\"solar-panel-datasheet\")\n", + " if existing_agent:\n", + " llama_extract.delete_agent(existing_agent.id)\n", + "except ApiError as e:\n", + " if e.status_code == 404:\n", + " pass\n", + " else:\n", + " raise\n", + "\n", + "extract_config = ExtractConfig(\n", + " extraction_mode=\"BALANCED\",\n", + ")\n", + "\n", + "agent = llama_extract.create_agent(\n", + " name=\"solar-panel-datasheet\", data_schema=SolarPanelSchema, config=extract_config\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "b4d7bb60-0456-4a2d-8d48-14f9bb3e71d2", + "metadata": {}, + "source": [ + "## Workflow Overview\n", + "\n", + "The workflow consists of four main steps:\n", + "\n", + "1. **parse_datasheet:** Reads the solar panel datasheet (PDF) and converts its content into text (with page citations).\n", + "2. **load_requirements:** Loads the design requirements (as a text blob) that will be injected into the prompt.\n", + "3. **generate_comparison_report:** Constructs a prompt using the extracted datasheet content and design requirements and triggers the LLM to generate a comparison report.\n", + "4. **output_result:** Logs and returns the final report as the workflow’s result.\n", + "\n", + "Each step is implemented as an asynchronous function decorated with `@step`, and the workflow is built by subclassing `Workflow`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7c482e3a-66b4-4e1b-8d2d-9a9c6b3967f3", + "metadata": {}, + "outputs": [], + "source": [ + "from llama_index.core.workflow import (\n", + " Event,\n", + " StartEvent,\n", + " StopEvent,\n", + " Context,\n", + " Workflow,\n", + " step,\n", + ")\n", + "from llama_index.llms.openai import OpenAI\n", + "from llama_index.core.prompts import ChatPromptTemplate\n", + "from llama_cloud_services import LlamaExtract\n", + "from llama_cloud.core.api_error import ApiError\n", + "from pydantic import BaseModel, Field\n", + "from typing import List\n", + "\n", + "\n", + "# Define output schema for the comparison report (for reference)\n", + "class ComparisonReportOutput(BaseModel):\n", + " component_name: str = Field(\n", + " ..., description=\"The name of the component being evaluated.\"\n", + " )\n", + " meets_requirements: bool = Field(\n", + " ...,\n", + " description=\"Overall indicator of whether the component meets the design criteria.\",\n", + " )\n", + " summary: str = Field(..., description=\"A brief summary of the evaluation results.\")\n", + " details: dict = Field(\n", + " ..., description=\"Detailed comparisons for each key parameter.\"\n", + " )\n", + "\n", + "\n", + "# Define custom events\n", + "\n", + "\n", + "class DatasheetParseEvent(Event):\n", + " datasheet_content: dict\n", + "\n", + "\n", + "class RequirementsLoadEvent(Event):\n", + " requirements_text: str\n", + "\n", + "\n", + "class ComparisonReportEvent(Event):\n", + " report: ComparisonReportOutput\n", + "\n", + "\n", + "class LogEvent(Event):\n", + " msg: str\n", + " delta: bool = False\n", + "\n", + "\n", + "# For our demonstration, we assume that LlamaExtract is used to parse the datasheet into text.\n", + "# We'll also use OpenAI (via LlamaIndex) as our LLM for generating the report.\n", + "\n", + "llm = OpenAI(model=\"gpt-4o\") # or your preferred model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "67a0c391-c7f5-4b93-8d6b-9e31b2d7a817", + "metadata": {}, + "outputs": [], + "source": [ + "class SolarPanelComparisonWorkflow(Workflow):\n", + " \"\"\"\n", + " Workflow to extract data from a solar panel datasheet and generate a comparison report\n", + " against provided design requirements.\n", + " \"\"\"\n", + "\n", + " def __init__(self, agent: LlamaExtract, requirements_path: str, **kwargs):\n", + " super().__init__(**kwargs)\n", + " self.agent = agent\n", + " # Load design requirements from file as a text blob\n", + " with open(requirements_path, \"r\") as f:\n", + " self.requirements_text = f.read()\n", + "\n", + " @step\n", + " async def parse_datasheet(\n", + " self, ctx: Context, ev: StartEvent\n", + " ) -> DatasheetParseEvent:\n", + " # datasheet_path is provided in the StartEvent\n", + " datasheet_path = (\n", + " ev.datasheet_path\n", + " ) # e.g., \"./data/solar_panel_comparison/datasheet.pdf\"\n", + " extraction_result = await self.agent.aextract(datasheet_path)\n", + " datasheet_dict = (\n", + " extraction_result.data\n", + " ) # assumed to be a string with page citations\n", + " await ctx.set(\"datasheet_content\", datasheet_dict)\n", + " ctx.write_event_to_stream(LogEvent(msg=\"Datasheet parsed successfully.\"))\n", + " return DatasheetParseEvent(datasheet_content=datasheet_dict)\n", + "\n", + " @step\n", + " async def load_requirements(\n", + " self, ctx: Context, ev: DatasheetParseEvent\n", + " ) -> RequirementsLoadEvent:\n", + " # Use the pre-loaded requirements text from __init__\n", + " req_text = self.requirements_text\n", + " ctx.write_event_to_stream(LogEvent(msg=\"Design requirements loaded.\"))\n", + " return RequirementsLoadEvent(requirements_text=req_text)\n", + "\n", + " @step\n", + " async def generate_comparison_report(\n", + " self, ctx: Context, ev: RequirementsLoadEvent\n", + " ) -> StopEvent:\n", + " # Build a prompt that injects both the extracted datasheet content and the design requirements\n", + " datasheet_content = await ctx.get(\"datasheet_content\")\n", + " prompt_str = \"\"\"\n", + "You are an expert renewable energy engineer.\n", + "\n", + "Compare the following solar panel datasheet information with the design requirements.\n", + "\n", + "Design Requirements:\n", + "{requirements_text}\n", + "\n", + "Extracted Datasheet Information:\n", + "{datasheet_content}\n", + "\n", + "Generate a detailed comparison report in JSON format with the following schema:\n", + " - component_name: string\n", + " - meets_requirements: boolean\n", + " - summary: string\n", + " - details: dictionary of comparisons for each parameter\n", + "\n", + "For each parameter (Maximum Power, Open-Circuit Voltage, Short-Circuit Current, Efficiency, Temperature Coefficient),\n", + "indicate PASS or FAIL and provide brief explanations and recommendations.\n", + "\"\"\"\n", + "\n", + " # extract from contract\n", + " prompt = ChatPromptTemplate.from_messages([(\"user\", prompt_str)])\n", + "\n", + " # Call the LLM to generate the report using the prompt\n", + " report_output = await llm.astructured_predict(\n", + " ComparisonReportOutput,\n", + " prompt,\n", + " requirements_text=ev.requirements_text,\n", + " datasheet_content=str(datasheet_content),\n", + " )\n", + " ctx.write_event_to_stream(LogEvent(msg=\"Comparison report generated.\"))\n", + " return StopEvent(\n", + " result={\"report\": report_output, \"datasheet_content\": datasheet_content}\n", + " )" + ] + }, + { + "cell_type": "markdown", + "id": "d205f532-1a11-4a48-b5a8-87a7f85e9ce7", + "metadata": {}, + "source": [ + "## Running the Workflow\n", + "\n", + "Below, we instantiate and run the workflow. We inject the design requirements as a text blob (no custom code to load) and pass the path to the solar panel datasheet (the HoneyM datasheet from Trina).\n", + "\n", + "The design requirements are:\n", + "\n", + "```\n", + "Solar Panel Design Requirements:\n", + "- Power Output Range: ≥ 350 W\n", + "- Maximum Efficiency: ≥ 18%\n", + "- Certifications: Must include IEC61215 and UL1703\n", + "```\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6b24fa61-a2f5-4ebb-84eb-1c9b48683b1b", + "metadata": {}, + "outputs": [], + "source": [ + "import nest_asyncio\n", + "\n", + "nest_asyncio.apply()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "be3ebad5-1f70-4671-a2ec-17bf9e4d788f", + "metadata": {}, + "outputs": [], + "source": [ + "# Path to design requirements file (e.g., a text file with design criteria for solar panels)\n", + "requirements_path = \"./data/solar_panel_e2e_comparison/design_reqs.txt\"\n", + "\n", + "# Instantiate the workflow\n", + "workflow = SolarPanelComparisonWorkflow(\n", + " agent=agent, requirements_path=requirements_path, verbose=True, timeout=120\n", + ")\n", + "\n", + "# Run the workflow; pass the datasheet path in the StartEvent\n", + "result = await workflow.run(\n", + " datasheet_path=\"./data/solar_panel_e2e_comparison/datasheet.pdf\"\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e1e61f1e-8701-4acc-8f99-cc89d8aae535", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "********Final Comparison Report:********\n", + "\n", + "{\n", + " \"component_name\": \"TSM-DE08M.08(II)\",\n", + " \"meets_requirements\": true,\n", + " \"summary\": \"The solar panel TSM-DE08M.08(II) meets all the design requirements, making it a suitable choice for the intended application.\",\n", + " \"details\": {\n", + " \"Maximum Power Output\": \"PASS - The panel's power output ranges from 360 W to 385 W, exceeding the minimum requirement of 350 W.\",\n", + " \"Open-Circuit Voltage\": \"PASS - The datasheet does not specify Voc, but the panel meets other critical requirements. Verification of Voc is recommended.\",\n", + " \"Short-Circuit Current\": \"PASS - The datasheet does not specify Isc, but the panel meets other critical requirements. Verification of Isc is recommended.\",\n", + " \"Efficiency\": \"PASS - The panel's efficiency is 21.0%, which is above the required 18%.\",\n", + " \"Temperature Coefficient\": \"PASS - The temperature coefficient is -0.34%/°C, which is better than the maximum allowable -0.5%/°C.\"\n", + " }\n", + "}\n" + ] + } + ], + "source": [ + "print(\"\\n********Final Comparison Report:********\\n\")\n", + "print(result[\"report\"].model_dump_json(indent=4))\n", + "# print(\"\\n********Datasheet Content:********\\n\", result[\"datasheet_content\"])" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "llama_parse", + "language": "python", + "name": "llama_parse" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}