Skip to content

Conversation

@nerpaula
Copy link
Contributor

@nerpaula nerpaula commented Dec 22, 2025

Description


Note

Restructures GraphRAG Importer documentation into a dedicated section and expands capabilities documentation.

  • New reference/importer/ section with pages: LLM Configuration, Import Files, Semantic Units, Verify and Explore, and Parameters
  • Documents multi-file import via POST /v1/import-multiple with streaming progress events
  • Adds semantic units/image processing options and related collections/fields
  • Adds detailed parameter and vector index configuration references; clarifies LLM setup (OpenAI-compatible and Triton)
  • Updates site-wide links from reference/importer.md to reference/importer/ and fixes anchors (incl. Triton page)
  • Updates Web/Technical Overview/AI Orchestrator/Retriever pages to align with new structure and features; removes single-file limitation

Written by Cursor Bugbot for commit cb6ea66. This will update automatically on new commits. Configure here.

@arangodb-docs-automation
Copy link
Contributor

Deploy Preview Available Via
https://deploy-preview-856--docs-hugo.netlify.app

@bluepal-pavan-kothapalli
Copy link

bluepal-pavan-kothapalli commented Dec 23, 2025

Thanks for your work @nerpaula , I have added one comment, please resolve it, Otherwise LGTM.
Also, there are some missing parameters that need to be included, such as vector_index_n_lists, vector_index_metric, and vector_index_use_hnsw. I would suggest getting a review from @aMahanna, because he is the one who implemented most of these parameters.

Full example JSON payload for ImportMultipleFilesRequest (/v1/import-multiple):

{
  "files": [
    {
      "name": "document1.txt",
      "content": "VGhpcyBpcyBkb2MxIGNvbnRlbnQgaW4gYmFzZTY0Lg==",
      "citable_url": "https://example.com/doc1"
    },
    {
      "name": "document2.pdf",
      "content": "VGhpcyBpcyBkb2MyIGNvbnRlbnQgaW4gYmFzZTY0Lg==",
      "citable_url": "https://example.com/doc2"
    }
  ],
  "store_in_s3": false,
  "batch_size": 1000,
  "enable_chunk_embeddings": true,
  "enable_edge_embeddings": true,
  "chunk_token_size": 1000,
  "chunk_overlap_token_size": 200,
  "entity_types": [
    "PERSON",
    "ORGANIZATION",
    "LOCATION",
    "TECHNOLOGY"
  ],
  "relationship_types": [
    "RELATED_TO",
    "PART_OF",
    "USES",
    "LOCATED_IN"
  ],
  "community_report_num_findings": "5-10",
  "community_report_instructions": "Focus on key entities, relationships, and risk-related findings.",
  "partition_id": "my_partition_id_001",
  "enable_semantic_units": true,
  "process_images": true,
  "store_image_data": true,
  "chunk_min_token_size": 50,
  "chunk_custom_separators": [
    "\n\n",
    "---",
    "###"
  ],
  "preserve_chunk_separator": true,
  "smart_graph_attribute": "region",
  "shard_count": 3,
  "is_disjoint": false,
  "satellite_collections": [
    "sat_col_1",
    "sat_col_2"
  ],
  "enable_strict_types": true,
  "entity_extract_max_gleaning": 1,
  "vector_index_n_lists": 2048,
  "vector_index_metric": "cosine",
  "vector_index_use_hnsw": true,
  "enable_community_embeddings": true,
}

Copy link
Contributor

@diegomendez40 diegomendez40 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work @nerpaula . I have added some comments. Please feel free to reach out for any clarification.

Comment on lines +83 to +91
## Multi-File Import

Use multi-file import when you need to process multiple documents into a single
Knowledge Graph. This API provides streaming progress updates, making it
ideal for batch processing and long-running imports where you need to track progress.

```
POST /v1/import-multiple
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no mention of a reference architecture nor the AutoGraph. Our recommended way to ingest a huge corpus is by using the AutoGraph to create different clusters (mini-topics), then ingesting each of them with the multi-file importer into its own graphrag partition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is expected right now. Will be enhanced with reference architecture and documentation for AutoGraph, I will treat that in a separate task/PR.

Comment on lines +160 to +162
- `vector_index_metric`: Distance metric for vector similarity search. The supported values are `"cosine"` (default), `"l2"`, and `"innerProduct"`.
- `vector_index_n_lists`: Number of lists for approximate search (optional). If not set, it is automatically computed as `8 * sqrt(collection_size)`. This parameter is ignored when using HNSW.
- `vector_index_use_hnsw`: Whether to use HNSW (Hierarchical Navigable Small World) index instead of the default inverted index (default: `false`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure these params will reach February release. I will check and come back to you.

Comment on lines +203 to +206
- `smart_graph_attribute`: SmartGraph attribute for graph sharding.
- `shard_count`: Number of shards for the collections.
- `is_disjoint`: Whether the graphs must be disjoint.
- `satellite_collections`: An array of collection names to create as Satellite Collections.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure these params will reach February release. I will check and come back to you.

Comment on lines +150 to +163
## Performance Considerations

### Size Guidelines

- **Small Documents** (< 1MB): All features enabled with minimal impact.
- **Medium Documents** (1-10MB): Consider disabling `store_image_data` for large images.
- **Large Documents** (> 10MB): Use `enable_semantic_units=true, process_images=false, store_image_data=false` for basic URL extraction.

### LLM Compatibility

The semantic units processing works with all LLM providers:
- **OpenAI**: GPT-4o, GPT-4o-mini (all models supported).
- **OpenRouter**: Gemini Flash, Claude Sonnet (all models supported).
- **Triton**: Mistral-Nemo-Instruct (all models supported).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did this come from? I don't think most of this information is correct. Semantic Units require multimodal LLMs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +132 to +147
### Semantic Units Collection

- **Purpose**: Stores semantic units extracted from documents, including image
references and web URLs. This collection is only created when `enable_semantic_units`
is set to `true`.
- **Key Fields**:
- `_key`: Unique identifier for the semantic unit.
- `type`: Type of semantic unit (always "image" for image references).
- `image_url`: URL or reference to the image/web resource.
- `is_storage_url`: Boolean indicating if the URL is a storage URL (base64/S3) or web URL.
- `import_number`: Import batch number for tracking.
- `source_chunk_id`: Reference to the chunk where this semantic unit was found.

{{< info >}}
Learn more about semantic units in the [Semantic Units guide](semantic-units.md).
{{< /info >}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is intended for immediate release, it makes sense. If we would want to use it for February release, then we should also add that Semantic Units can also store other sources, such as DB entities via the VirtualGraph.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is intended for immediate release. VirtualGraph will be treated in a separate task/PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants