LEDAA project is about building a conversational AI assistant for FRAGMENT (documentation). Towards this purpose, the LEDAA Text Splitter is mainly intended to be used in a RAG pipeline in the data preparation stage before data ingestion.
To learn more check: Building AI Assistant for FRAGMENT documentation
Below is the process flow for how ground truth data updates are handled and how the knowledge base is effectively updated.
ledaa_updates_scanner
Lambda function monitors for changes in content of documentation.- On detecting changes, it triggers the
ledaa_load_data
Lambda function passing it the URL of webpage. ledaa_load_data
Lambda function invokes theledaa_text_splitter
Lambda function to initiate the process of scraping data from a given URL and to get a list of strings (representing text chunks or documents) which will be used in data ingestion.ledaa_text_splitter
Lambda function invokes theledaa_web_scrapper
Lambda function to scrape the URL and store the processed markdown data in S3.ledaa_web_scrapper
function also stores the hash of the processed data in DynamoDB which will later be compared byledaa_updates_scanner
function to detect changes.- On receiving processed document chunks back,
ledaa_load_data
Lambda function stores the data in the vector store.
The main functionality is implemented in the core.py
file.
- Execution: The
ledaa_load_data
Lambda function synchronously invokes theledaa_text_splitter
Lambda function to initiate the process of scraping data from a given URL (related to documentation) to return the processed list of strings (representing text chunks or documents) which can be further used in data ingestion pipeline for building the knowledge base for RAG. - Web Scraping: Invokes the
ledaa_web_scrapper
Lambda function to scrape the URL and store markdown data in S3. - Data Fetching: Fetches the markdown data for the URL from S3.
- Data Preprocessing: Splits the markdown text into document chunks and adds the URL as metadata to each chunk.
- Return Data: Returns the processed list of strings (representing text chunks or documents) to the
ledaa_load_data
Lambda function, which can then be used on for embeddings generation and storage in the vector store.
We deploy the function to AWS Lambda using Terraform. The Terraform configuration files can be found in the terraform
directory. The configuration file creates:
- Appropriate AWS role and policy for the Lambda function.
- AWS Lambda Layer for the Lambda function using pre-built compressed lambda layer zip file (present in
terraform/packages
, created usingcreate_lambda_layer.sh
). - Data archive file for the core code (
core.py
). - AWS Lambda function using the data archive file, the Lambda Layer, and the appropriate role.
- Lambda function is configured appropriately to access S3.
There are certain scripts in terraform
directory, like apply.sh
and plan.sh
, which can be used to apply and plan the Terraform configuration respectively. These scripts extract necessary environment variables from the .env
file and pass them to Terraform.
This project is licensed under the MIT License - see the LICENSE file for details.