Document Ingestor

agoncal · sinedied · commit 62d3de83d0fd · 2024-04-08T10:45:57.000+02:00
diff --git a/.gitignore b/.gitignore
@@ -17,7 +17,7 @@ lerna-debug.log*
 *.iml
 
 # Deployment
-*.env
+*.env*
 .azure
 
 # DB Storage
diff --git a/docs/assets/qdrant-dashboard-collection.png b/docs/assets/qdrant-dashboard-collection.png
diff --git a/docs/sections/java-quarkus/00-welcome.md b/docs/sections/java-quarkus/00-welcome.md
@@ -18,7 +18,7 @@ banner_url: assets/banner.jpg
 duration_minutes: 120
 audience: students, devs
 level: intermediate
-tags: chatgpt, openai, langchain4j, retrieval-augmented-generation, azure, containers, docker, static web apps, java, quarkus, azure ai search, azure container apps
+tags: chatgpt, openai, langchain4j, retrieval-augmented-generation, azure, containers, docker, static web apps, java, quarkus, azure ai search, azure container apps, qdrant, vector database
 published: false
 wt_id: javaquarkus-0000-cxa
 sections_title:
@@ -37,27 +37,18 @@ In this workshop, we'll explore the fundamentals of custom ChatGPT experiences b
 - Use [Azure OpenAI](https://azure.microsoft.com/products/ai-services/openai-service) models and [LangChain4j](https://langchain4j.github.io/langchain4j/) to generate answers based on a prompt.
 - Query a vector database and augment a prompt to generate responses.
 - Connect your Web API to a ChatGPT-like website.
-- Deploy your application on Azure.
+- (optionally) Deploy your application to Azure.
 
 ## Prerequisites
 
-|                            |                                                                      |
-|----------------------------|----------------------------------------------------------------------|
-| GitHub account             | [Get a free GitHub account](https://github.com/join)                 |
-| Azure account              | [Get a free Azure account](https://azure.microsoft.com/free)         |
-| Access to Azure OpenAI API | [Request access to Azure OpenAI](https://aka.ms/oaiapply)            |
-| A Web browser              | [Get Microsoft Edge](https://www.microsoft.com/edge)                 |
-| Java knowledge             | [Java tutorial on W3schools](https://www.w3schools.com/java/)        |
-| Quarkus knowledge          | [Quarkus Getting Started](https://quarkus.io/guides/getting-started) |
+|                   |                                                                      |
+|-------------------|----------------------------------------------------------------------|
+| GitHub account    | [Get a free GitHub account](https://github.com/join)                 |
+| A Web browser     | [Get Microsoft Edge](https://www.microsoft.com/edge)                 |
+| An HTTP client    | [For example curl](https://curl.se/)                                 |
+| Java knowledge    | [Java tutorial on W3schools](https://www.w3schools.com/java/)        |
+| Quarkus knowledge | [Quarkus Getting Started](https://quarkus.io/guides/getting-started) |
 
-We'll use [GitHub Codespaces](https://github.com/features/codespaces) to have an instant dev environment already prepared for this workshop.
+As for development, you can either use your local environment or [GitHub Codespaces](https://github.com/features/codespaces). Thanks to GitHub Codespaces you can have an instant dev environment already prepared for this workshop.
 
 If you prefer to work locally, we'll also provide instructions to setup a local dev environment using either VS Code with a [dev container](https://aka.ms/vscode/ext/devcontainer) or a manual install of the needed tools with your favourite IDE (Intellij IDEA, VS Code, etc.).
-
-<div class="info" data-title="note">
-
-> Your Azure account must have `Microsoft.Authorization/roleAssignments/write` permissions, such as [Role Based Access Control Administrator](https://learn.microsoft.com/azure/role-based-access-control/built-in-roles#role-based-access-control-administrator-preview), [User Access Administrator](https://learn.microsoft.com/azure/role-based-access-control/built-in-roles#user-access-administrator), or [Owner](https://learn.microsoft.com/azure/role-based-access-control/built-in-roles#owner). Your account also needs `Microsoft.Resources/deployments/write` permissions at a subscription level to allow deployment of Azure resources.
->
-> If you have your own personal Azure subscription, you should be good to go. If you're using an Azure subscription provided by your company, you may need to contact your IT department to ensure you have the necessary permissions.
-
-</div>
diff --git a/docs/sections/java-quarkus/02.1-additional-setup.md b/docs/sections/java-quarkus/02.1-additional-setup.md
@@ -5,3 +5,35 @@ To complete the template setup, please run the following command in a terminal,
 ```bash
 ./scripts/setup-template.sh java-quarkus
 ```
+
+### Using a local proxy
+
+<div data-visible="$$proxy$$">
+
+We have deployed an Open AI proxy service for you, so you can use it to work on this workshop locally before deploying anything to Azure.
+
+Create a `.env` file at the root of the project, and add the following content:
+
+```
+AZURE_OPENAI_URL=$$proxy$$
+QDRANT_URL=http://localhost:6333
+```
+
+</div>
+
+### Deploy to Azure
+
+If you want to deploy your application to Azure, you will need an Azure account (more on that later).
+
+|                            |                                                                      |
+|----------------------------|----------------------------------------------------------------------|
+| Azure account              | [Get a free Azure account](https://azure.microsoft.com/free)         |
+| Access to Azure OpenAI API | [Request access to Azure OpenAI](https://aka.ms/oaiapply)            |
+
+<div class="info" data-title="note">
+
+> Your Azure account must have `Microsoft.Authorization/roleAssignments/write` permissions, such as [Role Based Access Control Administrator](https://learn.microsoft.com/azure/role-based-access-control/built-in-roles#role-based-access-control-administrator-preview), [User Access Administrator](https://learn.microsoft.com/azure/role-based-access-control/built-in-roles#user-access-administrator), or [Owner](https://learn.microsoft.com/azure/role-based-access-control/built-in-roles#owner). Your account also needs `Microsoft.Resources/deployments/write` permissions at a subscription level to allow deployment of Azure resources.
+>
+> If you have your own personal Azure subscription, you should be good to go. If you're using an Azure subscription provided by your company, you may need to contact your IT department to ensure you have the necessary permissions.
+
+</div>
diff --git a/docs/sections/java-quarkus/03-overview.md b/docs/sections/java-quarkus/03-overview.md
@@ -5,7 +5,7 @@ The project template you've forked is a monorepo, which means it's a single repo
 ```sh
 .devcontainer/             # Configuration for the development container
 data/                      # Sample PDFs to serve as custom data
-infra/                     # Templates and scripts for Azure infrastructure
+infra/                     # Templates and scripts for Docker and Azure infrastructure
 scripts/                   # Utility scripts for document ingestion
 src/                       # Source code for the application's services
 ├── backend-java-quarkus/  # The Chat API developed with Quarkus
diff --git a/docs/sections/java-quarkus/04-vector-db.md b/docs/sections/java-quarkus/04-vector-db.md
@@ -31,13 +31,20 @@ For this workshop, we'll use Qdrant as our vector database as it works well with
 
 ### Running Qdrant locally
 
-To start Qdrant locally, you can use the following command:
+To start Qdrant locally we have setup a Docker Compose file. You can use the following command from the root of the project:
 
 ```bash
-docker run -p 6333:6333 -v $(pwd)/.qdrant:/qdrant/storage:z qdrant/qdrant:v1.7.3
+docker compose -f infra/docker-compose/qdrant.yml up
 ```
 
-This will pull the Docker image, start Qdrant on port `6333` and mount a volume to store the data in the `.qdrant` folder.
+This will pull the Docker image, start Qdrant on port `6333` and mount a volume to store the data in the `.qdrant` folder. You should see logs that look like:
+
+```text
+qdrant-1  | INFO qdrant::actix: Qdrant HTTP listening on 6333    
+qdrant-1  | INFO actix_server::builder: Starting 9 workers
+qdrant-1  | INFO qdrant::tonic: Qdrant gRPC listening on
+qdrant-1  | INFO actix_server::server: Actix runtime found; starting in Actix runtime
+```
 
 You can test that Qdrant is running by opening the following URL in your browser: [http://localhost:6333/dashboard](http://localhost:6333/dashboard).
 
@@ -48,4 +55,8 @@ You can test that Qdrant is running by opening the following URL in your browser
 
 </div>
 
-Once you tested that Qdrant is running correctly, you can stop it by pressing `CTRL+C` in your terminal.
+Once you tested that Qdrant is running correctly, you can stop it by pressing `CTRL+C` in your terminal or executing the following command from the root directory of the project:
+
+```bash
+docker compose -f infra/docker-compose/qdrant.yml down
+```
diff --git a/docs/sections/java-quarkus/05-ingestion.md b/docs/sections/java-quarkus/05-ingestion.md
@@ -16,68 +16,212 @@ PDFs files, which are stored in the `data` folder, will be read by the `Document
 
 </div>
 
+Create the `DocumentIngestor` under the `src/main/java` directory, inside the `ai.azure.openai.rag.workshop.ingestion` package. The `main` method of the `DocumentIngestor` class looks like the following:
+
+```java
+public class DocumentIngestor {
+
+  private static final Logger log = LoggerFactory.getLogger(DocumentIngestor.class);
+
+  public static void main(String[] args) {
+    
+    // Setup Qdrant store for embeddings storage and retrieval
+    // Load all the PDFs, compute embeddings and store them in Qdrant store
+    
+    System.exit(0);
+  }
+}
+```
+
+LangChain4j uses [TinyLog](https://tinylog.org) as a logging framework. Create the `src/ingestion-java/src/main/resources/tinylog.properties` and set the log level to `info` (you can also set it to `debug` if you want more logs):
+
+```properties
+writer.level = info
+```
+
+#### Setup the Qadrant client
+
+Now that we have the `DocumentIngestor` class, we need to setup the Qdrant client to interact with the vector database. We'll use the `QdrantEmbeddingStore` class from LangChain4j to interact with Qdrant. Notice the name of the collection (`rag-workshop-collection`), the port (`localhost` as Qdrant is running locally) and th GRPC port (`6334`):
+
+```java
+public class DocumentIngestor {
+
+  public static void main(String[] args) {
+
+    // Setup Qdrant store for embeddings storage and retrieval
+    log.info("### Setup Qdrant store for embeddings storage and retrieval");
+    EmbeddingStore<TextSegment> qdrantEmbeddingStore = QdrantEmbeddingStore.builder()
+      .collectionName("rag-workshop-collection")
+      .host("localhost")
+      .port(6334)
+      .build();
+
+    // Load all the PDFs, compute embeddings and store them in Qdrant store
+
+    System.exit(0);
+  }
+}
+```
+
 #### Reading the PDF files content
 
-The content the PDFs files will be used as part of the *Retriever* component of the RAG architecture, to generate answers to your questions using the GPT model.
+The content of the PDFs files will be used as part of the *Retriever* component of the RAG architecture, to generate answers to your questions using the GPT model. To read these files we need to iterate through the PDF files located under the classpath. We'll use the `findPdfFiles()` method to get the list of PDF files and then load them with the `FileSystemDocumentLoader` from LangChain4j:
+
+```java
+public class DocumentIngestor {
+
+  public static void main(String[] args) {
+
+    // Setup Qdrant store for embeddings storage and retrieval
+
+    // Load all the PDFs, compute embeddings and store them in Qdrant store
+    log.info("### Read all the PDFs");
+    List<Path> pdfFiles = findPdfFiles();
+    for (Path pdfFile : pdfFiles) {
+
+      log.info("### Load PDF: {}", pdfFile.toAbsolutePath());
+      Document document = FileSystemDocumentLoader.loadDocument(pdfFile, new ApachePdfBoxDocumentParser());
+
+      // ...
+    }
+
+    System.exit(0);
+  }
+
+  public static List<Path> findPdfFiles() {
+    try {
+      return Files.walk(Paths.get("./"))
+        .filter(path -> path.toString().endsWith(".pdf"))
+        .collect(Collectors.toList());
+    } catch (IOException e) {
+      throw new RuntimeException("Error reading files from directory", e);
+    }
+  }
+}
+```
+
+#### Split the document into segments
+
+Now that the PDF files are loaded, we need to split each PDF file (thanks to `DocumentSplitter`) into smaller chunks, called `TextSegment`:
+
+
+```java
+public class DocumentIngestor {
+
+  public static void main(String[] args) {
 
-Text from the PDF files is extracted in the `DocumentIngestor` using LangChain4j. You can have a look at code of the `extractTextFromPdf()` method if you're curious about how it works.
+    // Setup Qdrant store for embeddings storage and retrieval
+
+    // Load all the PDFs, compute embeddings and store them in Qdrant store
+    for (Path pdfFile : pdfFiles) {
+
+      // ...
+      log.info("### Split document into segments 100 tokens each");
+      DocumentSplitter splitter = DocumentSplitters.recursive(100, 0, new OpenAiTokenizer(GPT_3_5_TURBO));
+      List<TextSegment> segments = splitter.split(document);
+
+      // ...
+    }
+
+    System.exit(0);
+  }
+}
+```
 
 #### Computing the embeddings
 
-After the text is extracted, it's then transformed into embeddings using the [OpenAI JavaScript library](https://github.com/openai/openai-node):
+After the text is extracted into segments, they are then transformed into embeddings using the [AllMiniLmL6V2EmbeddingModel](https://github.com/langchain4j/langchain4j-embeddings) from LangChain4j. This model runs locally in memory (no need to connect to a remote LLM) and generates embeddings for each segment:
+
+```java
+public class DocumentIngestor {
+
+  public static void main(String[] args) {
+
+    // Setup Qdrant store for embeddings storage and retrieval
 
-```ts
-async createEmbedding(text: string): Promise<number[]> {
-  const embeddingsClient = await this.openai.getEmbeddings();
-  const result = await embeddingsClient.create({ input: text, model: this.embeddingModelName });
-  return result.data[0].embedding;
+    // Load all the PDFs, compute embeddings and store them in Qdrant store
+    for (Path pdfFile : pdfFiles) {
+
+      // ...
+
+      log.info("### Embed segments (convert them into vectors that represent the meaning) using embedding model");
+      EmbeddingModel embeddingModel = new AllMiniLmL6V2EmbeddingModel();
+      List<Embedding> embeddings = embeddingModel.embedAll(segments).content();
+
+      // ...
+    }
+  }
 }
 ```
 
-#### Adding the documents to the vector database
-
-The embeddings along with the original texts are then added to the vector database using the [Qdrant JavaScript client library](https://www.npmjs.com/package/@qdrant/qdrant-js). This process is done in batches, to improve performance and limit the number of requests:
-
-```ts
-const points = sections.map((section) => ({
-  // ID must be either a 64-bit integer or a UUID
-  id: getUuid(section.id, 5),
-  vector: section.embedding!,
-  payload: {
-    id: section.id,
-    content: section.content,
-    category: section.category,
-    sourcepage: section.sourcepage,
-    sourcefile: section.sourcefile,
-  },
-}));
-
-await this.qdrantClient.upsert(indexName, { points });
+#### Adding the embeddings to the vector database
+
+The embeddings along with the original texts are then added to the vector database using the `QdrantEmbeddingStore` API:
+
+```java
+public class DocumentIngestor {
+
+  public static void main(String[] args) {
+
+    // Setup Qdrant store for embeddings storage and retrieval
+
+    // Load all the PDFs, compute embeddings and store them in Qdrant store
+    for (Path pdfFile : pdfFiles) {
+
+      // ...
+
+      log.info("### Store embeddings into Qdrant store for further search / retrieval");
+      qdrantEmbeddingStore.addAll(embeddings, segments);
+    }
+  }
+}
 ```
 
 ### Running the ingestion process
 
-Let's now execute this process. First, you need to make sure you have Qdrant and the indexer service running locally. We'll use Docker Compose to run both services at the same time. Run the following command in a terminal (**make sure you stopped the Qdrant container before!**):
+Let's now execute this process. First, you need to make sure you have Qdrant running locally and all setup. Run the following command in a terminal to start up Qdrant (**make sure you stopped the Qdrant container before!**):
 
 ```bash
-docker compose up
+docker compose -f infra/docker-compose/qdrant.yml up
 ```
 
-This will start both Qdrant and the indexer service locally. This may takes a few minutes the first time, as Docker needs to download the images.
+This will start Qdrant locally. Make sure you can access the Qdrant dashboard at the URL http://localhost:6333/dashboard. Then, create a new collection named `rag-workshop-collection` with the following cUrl command:
 
-<div class="tip" data-title="tip">
+```bash
+curl -X PUT 'http://localhost:6333/collections/rag-workshop-collection' \
+     -H 'Content-Type: application/json' \
+     --data-raw '{
+       "vectors": {
+         "size": 384,
+         "distance": "Dot"
+       }
+     }'
+```
 
-> You can look at the `docker-compose.yml` file at the root of the project to see how the services are configured. Docker Compose automatically loads the `.env` file, so we can use the environment variables exposed there. To learn more about Docker Compose, check out the [official documentation](https://docs.docker.com/compose/).
+You should see the collection in the dashabord:
 
-</div>
+![Collection listed in the Qdrant dashboard](./assets/qdrant-dashboard-collection.png)
+
+You can also use a few cUrl commands to visualize the collection:
 
-Once all services are started, you can run the ingestion process by opening a new terminal and running the `./scripts/index-data.sh` script on Linux or macOS, or `./scripts/index-data.ps1` on Windows:
+```bash
+curl http://localhost:6333/collections
+curl http://localhost:6333/collections/rag-workshop-collection | jq
+```
+
+Once Qdrant is started and the collection is created, you can run the ingestion process by opening a new terminal and running the following Maven command under the `src/ingestion-java` folder. This will compile the code and run the ingestion process by running `DocumentIngestor`:
 
 ```bash
-./scripts/index-data.sh
+mvn clean compile exec:java
 ```
 
-![Screenshot of the indexer CLI](./assets/indexer-cli.png)
+<div class="tip" data-title="tip">
+
+> If you want to increase the logs you can set the level to debug instead of info in the src/main/resources/tinylog.properties file.
+
+> writer.level = debug
+
+</div>
 
 Once this process is executed, a new collection will be available in your database, where you can see the documents that were ingested.
 
@@ -91,7 +235,7 @@ Open the Qdrant dashboard again by opening the following URL in your browser: [h
 
 </div>
 
-You should see the collection named `kbindex` in the list:
+You should see the collection named `rag-workshop-collection` in the list:
 
 ![Screenshot of the Qdrant dashboard](./assets/qdrant-dashboard.png)
 
diff --git a/src/ingestion-java/pom.xml b/src/ingestion-java/pom.xml
diff --git a/src/ingestion-java/src/main/java/ai/azure/openai/rag/workshop/ingestion/DocumentIngestor.java b/src/ingestion-java/src/main/java/ai/azure/openai/rag/workshop/ingestion/DocumentIngestor.java