|
5 | 5 | "id": "7b33678f-67d2-48a1-801f-302622e43e0f",
|
6 | 6 | "metadata": {},
|
7 | 7 | "source": [
|
8 |
| - "## Goal\n", |
| 8 | + "## Chunking\n", |
9 | 9 | "The goal of chunking for InstructLab SDG is to provide the teacher model small and logical pieces of the source document to generate data off of.\n",
|
10 | 10 | "\n",
|
11 | 11 | "In this notebook we are doing chunking with Docling[https://docling-project.github.io/docling/examples/hybrid_chunking/#hybrid-chunking].\n",
|
12 | 12 | "\n",
|
13 |
| - "First let's ensure docling is installed." |
| 13 | + "The input to this notebook is a docling JSON file created after a docling conversion, or a directory of docling JSON files." |
| 14 | + ] |
| 15 | + }, |
| 16 | + { |
| 17 | + "cell_type": "markdown", |
| 18 | + "id": "d9f268fd-35d2-4c7a-8cfa-47630de00837", |
| 19 | + "metadata": {}, |
| 20 | + "source": [ |
| 21 | + "### Dependencies" |
14 | 22 | ]
|
15 | 23 | },
|
16 | 24 | {
|
|
272 | 280 | " c = dict(chunk=chunk, file=file.stem)\n",
|
273 | 281 | " all_chunks.append(c)\n",
|
274 | 282 | " except ConversionError as e:\n",
|
275 |
| - " print(f\"Skipping file {file}\")\n", |
276 |
| - "# print(all_chunks)" |
| 283 | + " print(f\"Skipping file {file}\")" |
277 | 284 | ]
|
278 | 285 | },
|
279 | 286 | {
|
|
286 | 293 | "To view the chunks, run through the following cell. As you can see the document is broken into small pieces with metadata about the chunk based on the document's format"
|
287 | 294 | ]
|
288 | 295 | },
|
| 296 | + { |
| 297 | + "cell_type": "code", |
| 298 | + "execution_count": 1, |
| 299 | + "id": "ff88cf5c-1315-4eca-afcd-25706eaf7d6b", |
| 300 | + "metadata": {}, |
| 301 | + "outputs": [], |
| 302 | + "source": [ |
| 303 | + "# print(all_chunks)" |
| 304 | + ] |
| 305 | + }, |
289 | 306 | {
|
290 | 307 | "cell_type": "markdown",
|
291 | 308 | "id": "84826055-a7f1-4334-a12b-bbc07a523199",
|
292 | 309 | "metadata": {
|
293 | 310 | "tags": []
|
294 | 311 | },
|
295 | 312 | "source": [
|
296 |
| - "## Save the chunks to a text file each" |
| 313 | + "## Save the chunks to a text file for each chunk\n", |
| 314 | + "\n", |
| 315 | + "Each chunk is saved to an individual text file in the format: `{docling-json-file-name}-{chunk #}.txt`. Having chunking in this format is important as an input to create-sdg-seed-data notebook." |
297 | 316 | ]
|
298 | 317 | },
|
299 | 318 | {
|
|
0 commit comments