Skip to content

Commit 62d3de8

Browse files
agoncalsinedied
authored andcommitted
Document Ingestor
1 parent a554036 commit 62d3de8

File tree

9 files changed

+308
-103
lines changed

9 files changed

+308
-103
lines changed

.gitignore

+1-1
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ lerna-debug.log*
1717
*.iml
1818

1919
# Deployment
20-
*.env
20+
*.env*
2121
.azure
2222

2323
# DB Storage
89.3 KB
Loading

docs/sections/java-quarkus/00-welcome.md

+10-19
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ banner_url: assets/banner.jpg
1818
duration_minutes: 120
1919
audience: students, devs
2020
level: intermediate
21-
tags: chatgpt, openai, langchain4j, retrieval-augmented-generation, azure, containers, docker, static web apps, java, quarkus, azure ai search, azure container apps
21+
tags: chatgpt, openai, langchain4j, retrieval-augmented-generation, azure, containers, docker, static web apps, java, quarkus, azure ai search, azure container apps, qdrant, vector database
2222
published: false
2323
wt_id: javaquarkus-0000-cxa
2424
sections_title:
@@ -37,27 +37,18 @@ In this workshop, we'll explore the fundamentals of custom ChatGPT experiences b
3737
- Use [Azure OpenAI](https://azure.microsoft.com/products/ai-services/openai-service) models and [LangChain4j](https://langchain4j.github.io/langchain4j/) to generate answers based on a prompt.
3838
- Query a vector database and augment a prompt to generate responses.
3939
- Connect your Web API to a ChatGPT-like website.
40-
- Deploy your application on Azure.
40+
- (optionally) Deploy your application to Azure.
4141

4242
## Prerequisites
4343

44-
| | |
45-
|----------------------------|----------------------------------------------------------------------|
46-
| GitHub account | [Get a free GitHub account](https://github.com/join) |
47-
| Azure account | [Get a free Azure account](https://azure.microsoft.com/free) |
48-
| Access to Azure OpenAI API | [Request access to Azure OpenAI](https://aka.ms/oaiapply) |
49-
| A Web browser | [Get Microsoft Edge](https://www.microsoft.com/edge) |
50-
| Java knowledge | [Java tutorial on W3schools](https://www.w3schools.com/java/) |
51-
| Quarkus knowledge | [Quarkus Getting Started](https://quarkus.io/guides/getting-started) |
44+
| | |
45+
|-------------------|----------------------------------------------------------------------|
46+
| GitHub account | [Get a free GitHub account](https://github.com/join) |
47+
| A Web browser | [Get Microsoft Edge](https://www.microsoft.com/edge) |
48+
| An HTTP client | [For example curl](https://curl.se/) |
49+
| Java knowledge | [Java tutorial on W3schools](https://www.w3schools.com/java/) |
50+
| Quarkus knowledge | [Quarkus Getting Started](https://quarkus.io/guides/getting-started) |
5251

53-
We'll use [GitHub Codespaces](https://github.com/features/codespaces) to have an instant dev environment already prepared for this workshop.
52+
As for development, you can either use your local environment or [GitHub Codespaces](https://github.com/features/codespaces). Thanks to GitHub Codespaces you can have an instant dev environment already prepared for this workshop.
5453

5554
If you prefer to work locally, we'll also provide instructions to setup a local dev environment using either VS Code with a [dev container](https://aka.ms/vscode/ext/devcontainer) or a manual install of the needed tools with your favourite IDE (Intellij IDEA, VS Code, etc.).
56-
57-
<div class="info" data-title="note">
58-
59-
> Your Azure account must have `Microsoft.Authorization/roleAssignments/write` permissions, such as [Role Based Access Control Administrator](https://learn.microsoft.com/azure/role-based-access-control/built-in-roles#role-based-access-control-administrator-preview), [User Access Administrator](https://learn.microsoft.com/azure/role-based-access-control/built-in-roles#user-access-administrator), or [Owner](https://learn.microsoft.com/azure/role-based-access-control/built-in-roles#owner). Your account also needs `Microsoft.Resources/deployments/write` permissions at a subscription level to allow deployment of Azure resources.
60-
>
61-
> If you have your own personal Azure subscription, you should be good to go. If you're using an Azure subscription provided by your company, you may need to contact your IT department to ensure you have the necessary permissions.
62-
63-
</div>

docs/sections/java-quarkus/02.1-additional-setup.md

+32
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,35 @@ To complete the template setup, please run the following command in a terminal,
55
```bash
66
./scripts/setup-template.sh java-quarkus
77
```
8+
9+
### Using a local proxy
10+
11+
<div data-visible="$$proxy$$">
12+
13+
We have deployed an Open AI proxy service for you, so you can use it to work on this workshop locally before deploying anything to Azure.
14+
15+
Create a `.env` file at the root of the project, and add the following content:
16+
17+
```
18+
AZURE_OPENAI_URL=$$proxy$$
19+
QDRANT_URL=http://localhost:6333
20+
```
21+
22+
</div>
23+
24+
### Deploy to Azure
25+
26+
If you want to deploy your application to Azure, you will need an Azure account (more on that later).
27+
28+
| | |
29+
|----------------------------|----------------------------------------------------------------------|
30+
| Azure account | [Get a free Azure account](https://azure.microsoft.com/free) |
31+
| Access to Azure OpenAI API | [Request access to Azure OpenAI](https://aka.ms/oaiapply) |
32+
33+
<div class="info" data-title="note">
34+
35+
> Your Azure account must have `Microsoft.Authorization/roleAssignments/write` permissions, such as [Role Based Access Control Administrator](https://learn.microsoft.com/azure/role-based-access-control/built-in-roles#role-based-access-control-administrator-preview), [User Access Administrator](https://learn.microsoft.com/azure/role-based-access-control/built-in-roles#user-access-administrator), or [Owner](https://learn.microsoft.com/azure/role-based-access-control/built-in-roles#owner). Your account also needs `Microsoft.Resources/deployments/write` permissions at a subscription level to allow deployment of Azure resources.
36+
>
37+
> If you have your own personal Azure subscription, you should be good to go. If you're using an Azure subscription provided by your company, you may need to contact your IT department to ensure you have the necessary permissions.
38+
39+
</div>

docs/sections/java-quarkus/03-overview.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ The project template you've forked is a monorepo, which means it's a single repo
55
```sh
66
.devcontainer/ # Configuration for the development container
77
data/ # Sample PDFs to serve as custom data
8-
infra/ # Templates and scripts for Azure infrastructure
8+
infra/ # Templates and scripts for Docker and Azure infrastructure
99
scripts/ # Utility scripts for document ingestion
1010
src/ # Source code for the application's services
1111
├── backend-java-quarkus/ # The Chat API developed with Quarkus

docs/sections/java-quarkus/04-vector-db.md

+15-4
Original file line numberDiff line numberDiff line change
@@ -31,13 +31,20 @@ For this workshop, we'll use Qdrant as our vector database as it works well with
3131

3232
### Running Qdrant locally
3333

34-
To start Qdrant locally, you can use the following command:
34+
To start Qdrant locally we have setup a Docker Compose file. You can use the following command from the root of the project:
3535

3636
```bash
37-
docker run -p 6333:6333 -v $(pwd)/.qdrant:/qdrant/storage:z qdrant/qdrant:v1.7.3
37+
docker compose -f infra/docker-compose/qdrant.yml up
3838
```
3939

40-
This will pull the Docker image, start Qdrant on port `6333` and mount a volume to store the data in the `.qdrant` folder.
40+
This will pull the Docker image, start Qdrant on port `6333` and mount a volume to store the data in the `.qdrant` folder. You should see logs that look like:
41+
42+
```text
43+
qdrant-1 | INFO qdrant::actix: Qdrant HTTP listening on 6333
44+
qdrant-1 | INFO actix_server::builder: Starting 9 workers
45+
qdrant-1 | INFO qdrant::tonic: Qdrant gRPC listening on
46+
qdrant-1 | INFO actix_server::server: Actix runtime found; starting in Actix runtime
47+
```
4148

4249
You can test that Qdrant is running by opening the following URL in your browser: [http://localhost:6333/dashboard](http://localhost:6333/dashboard).
4350

@@ -48,4 +55,8 @@ You can test that Qdrant is running by opening the following URL in your browser
4855
4956
</div>
5057

51-
Once you tested that Qdrant is running correctly, you can stop it by pressing `CTRL+C` in your terminal.
58+
Once you tested that Qdrant is running correctly, you can stop it by pressing `CTRL+C` in your terminal or executing the following command from the root directory of the project:
59+
60+
```bash
61+
docker compose -f infra/docker-compose/qdrant.yml down
62+
```

docs/sections/java-quarkus/05-ingestion.md

+181-37
Original file line numberDiff line numberDiff line change
@@ -16,68 +16,212 @@ PDFs files, which are stored in the `data` folder, will be read by the `Document
1616
1717
</div>
1818

19+
Create the `DocumentIngestor` under the `src/main/java` directory, inside the `ai.azure.openai.rag.workshop.ingestion` package. The `main` method of the `DocumentIngestor` class looks like the following:
20+
21+
```java
22+
public class DocumentIngestor {
23+
24+
private static final Logger log = LoggerFactory.getLogger(DocumentIngestor.class);
25+
26+
public static void main(String[] args) {
27+
28+
// Setup Qdrant store for embeddings storage and retrieval
29+
// Load all the PDFs, compute embeddings and store them in Qdrant store
30+
31+
System.exit(0);
32+
}
33+
}
34+
```
35+
36+
LangChain4j uses [TinyLog](https://tinylog.org) as a logging framework. Create the `src/ingestion-java/src/main/resources/tinylog.properties` and set the log level to `info` (you can also set it to `debug` if you want more logs):
37+
38+
```properties
39+
writer.level = info
40+
```
41+
42+
#### Setup the Qadrant client
43+
44+
Now that we have the `DocumentIngestor` class, we need to setup the Qdrant client to interact with the vector database. We'll use the `QdrantEmbeddingStore` class from LangChain4j to interact with Qdrant. Notice the name of the collection (`rag-workshop-collection`), the port (`localhost` as Qdrant is running locally) and th GRPC port (`6334`):
45+
46+
```java
47+
public class DocumentIngestor {
48+
49+
public static void main(String[] args) {
50+
51+
// Setup Qdrant store for embeddings storage and retrieval
52+
log.info("### Setup Qdrant store for embeddings storage and retrieval");
53+
EmbeddingStore<TextSegment> qdrantEmbeddingStore = QdrantEmbeddingStore.builder()
54+
.collectionName("rag-workshop-collection")
55+
.host("localhost")
56+
.port(6334)
57+
.build();
58+
59+
// Load all the PDFs, compute embeddings and store them in Qdrant store
60+
61+
System.exit(0);
62+
}
63+
}
64+
```
65+
1966
#### Reading the PDF files content
2067

21-
The content the PDFs files will be used as part of the *Retriever* component of the RAG architecture, to generate answers to your questions using the GPT model.
68+
The content of the PDFs files will be used as part of the *Retriever* component of the RAG architecture, to generate answers to your questions using the GPT model. To read these files we need to iterate through the PDF files located under the classpath. We'll use the `findPdfFiles()` method to get the list of PDF files and then load them with the `FileSystemDocumentLoader` from LangChain4j:
69+
70+
```java
71+
public class DocumentIngestor {
72+
73+
public static void main(String[] args) {
74+
75+
// Setup Qdrant store for embeddings storage and retrieval
76+
77+
// Load all the PDFs, compute embeddings and store them in Qdrant store
78+
log.info("### Read all the PDFs");
79+
List<Path> pdfFiles = findPdfFiles();
80+
for (Path pdfFile : pdfFiles) {
81+
82+
log.info("### Load PDF: {}", pdfFile.toAbsolutePath());
83+
Document document = FileSystemDocumentLoader.loadDocument(pdfFile, new ApachePdfBoxDocumentParser());
84+
85+
// ...
86+
}
87+
88+
System.exit(0);
89+
}
90+
91+
public static List<Path> findPdfFiles() {
92+
try {
93+
return Files.walk(Paths.get("./"))
94+
.filter(path -> path.toString().endsWith(".pdf"))
95+
.collect(Collectors.toList());
96+
} catch (IOException e) {
97+
throw new RuntimeException("Error reading files from directory", e);
98+
}
99+
}
100+
}
101+
```
102+
103+
#### Split the document into segments
104+
105+
Now that the PDF files are loaded, we need to split each PDF file (thanks to `DocumentSplitter`) into smaller chunks, called `TextSegment`:
106+
107+
108+
```java
109+
public class DocumentIngestor {
110+
111+
public static void main(String[] args) {
22112

23-
Text from the PDF files is extracted in the `DocumentIngestor` using LangChain4j. You can have a look at code of the `extractTextFromPdf()` method if you're curious about how it works.
113+
// Setup Qdrant store for embeddings storage and retrieval
114+
115+
// Load all the PDFs, compute embeddings and store them in Qdrant store
116+
for (Path pdfFile : pdfFiles) {
117+
118+
// ...
119+
log.info("### Split document into segments 100 tokens each");
120+
DocumentSplitter splitter = DocumentSplitters.recursive(100, 0, new OpenAiTokenizer(GPT_3_5_TURBO));
121+
List<TextSegment> segments = splitter.split(document);
122+
123+
// ...
124+
}
125+
126+
System.exit(0);
127+
}
128+
}
129+
```
24130

25131
#### Computing the embeddings
26132

27-
After the text is extracted, it's then transformed into embeddings using the [OpenAI JavaScript library](https://github.com/openai/openai-node):
133+
After the text is extracted into segments, they are then transformed into embeddings using the [AllMiniLmL6V2EmbeddingModel](https://github.com/langchain4j/langchain4j-embeddings) from LangChain4j. This model runs locally in memory (no need to connect to a remote LLM) and generates embeddings for each segment:
134+
135+
```java
136+
public class DocumentIngestor {
137+
138+
public static void main(String[] args) {
139+
140+
// Setup Qdrant store for embeddings storage and retrieval
28141

29-
```ts
30-
async createEmbedding(text: string): Promise<number[]> {
31-
const embeddingsClient = await this.openai.getEmbeddings();
32-
const result = await embeddingsClient.create({ input: text, model: this.embeddingModelName });
33-
return result.data[0].embedding;
142+
// Load all the PDFs, compute embeddings and store them in Qdrant store
143+
for (Path pdfFile : pdfFiles) {
144+
145+
// ...
146+
147+
log.info("### Embed segments (convert them into vectors that represent the meaning) using embedding model");
148+
EmbeddingModel embeddingModel = new AllMiniLmL6V2EmbeddingModel();
149+
List<Embedding> embeddings = embeddingModel.embedAll(segments).content();
150+
151+
// ...
152+
}
153+
}
34154
}
35155
```
36156

37-
#### Adding the documents to the vector database
38-
39-
The embeddings along with the original texts are then added to the vector database using the [Qdrant JavaScript client library](https://www.npmjs.com/package/@qdrant/qdrant-js). This process is done in batches, to improve performance and limit the number of requests:
40-
41-
```ts
42-
const points = sections.map((section) => ({
43-
// ID must be either a 64-bit integer or a UUID
44-
id: getUuid(section.id, 5),
45-
vector: section.embedding!,
46-
payload: {
47-
id: section.id,
48-
content: section.content,
49-
category: section.category,
50-
sourcepage: section.sourcepage,
51-
sourcefile: section.sourcefile,
52-
},
53-
}));
54-
55-
await this.qdrantClient.upsert(indexName, { points });
157+
#### Adding the embeddings to the vector database
158+
159+
The embeddings along with the original texts are then added to the vector database using the `QdrantEmbeddingStore` API:
160+
161+
```java
162+
public class DocumentIngestor {
163+
164+
public static void main(String[] args) {
165+
166+
// Setup Qdrant store for embeddings storage and retrieval
167+
168+
// Load all the PDFs, compute embeddings and store them in Qdrant store
169+
for (Path pdfFile : pdfFiles) {
170+
171+
// ...
172+
173+
log.info("### Store embeddings into Qdrant store for further search / retrieval");
174+
qdrantEmbeddingStore.addAll(embeddings, segments);
175+
}
176+
}
177+
}
56178
```
57179

58180
### Running the ingestion process
59181

60-
Let's now execute this process. First, you need to make sure you have Qdrant and the indexer service running locally. We'll use Docker Compose to run both services at the same time. Run the following command in a terminal (**make sure you stopped the Qdrant container before!**):
182+
Let's now execute this process. First, you need to make sure you have Qdrant running locally and all setup. Run the following command in a terminal to start up Qdrant (**make sure you stopped the Qdrant container before!**):
61183

62184
```bash
63-
docker compose up
185+
docker compose -f infra/docker-compose/qdrant.yml up
64186
```
65187

66-
This will start both Qdrant and the indexer service locally. This may takes a few minutes the first time, as Docker needs to download the images.
188+
This will start Qdrant locally. Make sure you can access the Qdrant dashboard at the URL http://localhost:6333/dashboard. Then, create a new collection named `rag-workshop-collection` with the following cUrl command:
67189

68-
<div class="tip" data-title="tip">
190+
```bash
191+
curl -X PUT 'http://localhost:6333/collections/rag-workshop-collection' \
192+
-H 'Content-Type: application/json' \
193+
--data-raw '{
194+
"vectors": {
195+
"size": 384,
196+
"distance": "Dot"
197+
}
198+
}'
199+
```
69200

70-
> You can look at the `docker-compose.yml` file at the root of the project to see how the services are configured. Docker Compose automatically loads the `.env` file, so we can use the environment variables exposed there. To learn more about Docker Compose, check out the [official documentation](https://docs.docker.com/compose/).
201+
You should see the collection in the dashabord:
71202

72-
</div>
203+
![Collection listed in the Qdrant dashboard](./assets/qdrant-dashboard-collection.png)
204+
205+
You can also use a few cUrl commands to visualize the collection:
73206

74-
Once all services are started, you can run the ingestion process by opening a new terminal and running the `./scripts/index-data.sh` script on Linux or macOS, or `./scripts/index-data.ps1` on Windows:
207+
```bash
208+
curl http://localhost:6333/collections
209+
curl http://localhost:6333/collections/rag-workshop-collection | jq
210+
```
211+
212+
Once Qdrant is started and the collection is created, you can run the ingestion process by opening a new terminal and running the following Maven command under the `src/ingestion-java` folder. This will compile the code and run the ingestion process by running `DocumentIngestor`:
75213

76214
```bash
77-
./scripts/index-data.sh
215+
mvn clean compile exec:java
78216
```
79217

80-
![Screenshot of the indexer CLI](./assets/indexer-cli.png)
218+
<div class="tip" data-title="tip">
219+
220+
> If you want to increase the logs you can set the level to debug instead of info in the src/main/resources/tinylog.properties file.
221+
222+
> writer.level = debug
223+
224+
</div>
81225

82226
Once this process is executed, a new collection will be available in your database, where you can see the documents that were ingested.
83227

@@ -91,7 +235,7 @@ Open the Qdrant dashboard again by opening the following URL in your browser: [h
91235
92236
</div>
93237

94-
You should see the collection named `kbindex` in the list:
238+
You should see the collection named `rag-workshop-collection` in the list:
95239

96240
![Screenshot of the Qdrant dashboard](./assets/qdrant-dashboard.png)
97241

0 commit comments

Comments
 (0)