-
Notifications
You must be signed in to change notification settings - Fork 1
Refresh 'Import' documentation #114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughRewrote import documentation to center on file-based imports (local files, URLs, AWS S3, Azure, MongoDB) with a unified "File Import" flow, updated images and form terminology, clarified S3/Azure and multi-file (wildcard) guidance, renamed schema-evolution controls to "Allow schema evolution", and added File Format Limitations and sample-data guidance. Changes
Sequence Diagram(s)(omitted) Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/cluster/import.md (1)
111-123: Fix double space in Schema Evolution section.Line 115 contains a formatting inconsistency with double space before "evolution" in "Allow schema evolution".
✏️ Proposed fix
import process. It can be toggled via the 'Allow schema evolution' checkbox + import process. It can be toggled via the 'Allow schema evolution' checkbox
🤖 Fix all issues with AI agents
In @docs/cluster/import.md:
- Around line 11-15: Update the typo in the import formats list by replacing the
incorrect word "Paquet" with "Parquet" in the bullet list (the line that
currently reads "Paquet"); ensure the list remains: CSV, JSON (JSON-Lines, JSON
Arrays and JSON Documents), Parquet, MongoDB collection.
- Around line 52-58: Update the sentence in the S3 import documentation to
correct the typo "file form bucket" to "file from bucket" (in the paragraph
describing CrateDB Cloud imports in docs/cluster/import.md) so the sentence
reads "To import a file from a bucket, provide the name of your bucket, and path
to the file."; ensure only the typo is changed and punctuation remains
consistent.
📜 Review details
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (9)
docs/_assets/img/cluster-import-file-form.pngis excluded by!**/*.pngdocs/_assets/img/cluster-import-globbing.pngis excluded by!**/*.pngdocs/_assets/img/cluster-import-tab-azure.pngis excluded by!**/*.pngdocs/_assets/img/cluster-import-tab-choice.pngis excluded by!**/*.pngdocs/_assets/img/cluster-import-tab-file.pngis excluded by!**/*.pngdocs/_assets/img/cluster-import-tab-s3.pngis excluded by!**/*.pngdocs/_assets/img/cluster-import-tab-url.pngis excluded by!**/*.pngdocs/_assets/img/cluster-import-tab.pngis excluded by!**/*.pngdocs/_assets/img/cluster-import.pngis excluded by!**/*.png
📒 Files selected for processing (1)
docs/cluster/import.md
🧰 Additional context used
🪛 LanguageTool
docs/cluster/import.md
[grammar] ~13-~13: Ensure spelling is correct
Context: ...N-Lines, JSON Arrays and JSON Documents) - Paquet - MongoDB collection :::{note} If you don...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~53-~53: Ensure spelling is correct
Context: ...S3-compatible storage. To import a file form bucket, provide the name of your bucket...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🔇 Additional comments (4)
docs/cluster/import.md (4)
30-47: File Import section looks good.Clear and concise explanation of the import process with appropriate image reference. The schema evolution toggle description is accurate and helpful.
86-100: Azure subsection is well-structured.Clear documentation of Azure Blob Storage imports with appropriate guidance on secrets and file path configuration. Wildcard notation and file size limits are clearly documented.
102-105: Integration section reference is appropriate.Brief section appropriately directs users to more comprehensive integration documentation. Cross-reference syntax is correct.
126-165: File Format Limitations section is well-documented.Comprehensive coverage of format-specific limitations with clear code examples for JSON variants. Documentation effectively demonstrates expected input formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @docs/cluster/import.md:
- Line 115: Fix the double space in the checkbox label by replacing "Allow
schema evolution" with "Allow schema evolution" and make capitalization
consistent across the doc (e.g., use "Schema Evolution" or "schema evolution"
everywhere); update the string used in the text to the chosen capitalization so
the checkbox reference matches other occurrences.
🧹 Nitpick comments (2)
docs/cluster/import.md (2)
11-15: Consolidate duplicate data format listings.The supported data formats are listed twice: once in the introduction (lines 12-14) and again in the File Import section (lines 41-43). Consider removing the duplicate in the File Import section or consolidating into a single reference, as this may confuse users about whether there are different format constraints for different import methods.
Also applies to: 39-45
126-165: File Format Limitations section is well-documented for JSON and CSV.The examples are clear and helpful. However, Parquet format is listed as supported but has no details or examples, unlike CSV and JSON. Consider adding Parquet documentation for completeness.
📜 Review details
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/cluster/import.md
🔇 Additional comments (2)
docs/cluster/import.md (2)
30-47: File Import section is clear and well-structured.The simplified flow with file format, source, and table selection is straightforward. The schema evolution explanation is helpful context at this level.
100-100: Add missing period.Line 100 ends without a period.
Fix missing punctuation
-File size limitation for imports is 10 GiB per file. +File size limitation for imports is 10 GiB per file.Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/cluster/import.md (1)
126-165: File Format Limitations section is incomplete; missing Parquet and MongoDB documentation.The section documents CSV and JSON formats well with helpful examples, but is missing documentation for two formats that are prominently promoted earlier:
- Parquet is mentioned in the introduction (Line 14) and File Import section (Line 43) but has no entry in File Format Limitations.
- MongoDB collection is mentioned in the introduction (Line 15) but has no entry in File Format Limitations.
Additionally, the JSON code examples use
:::{code} consolelanguage tags (Lines 138, 147, 162), which may be semantically incorrect; these should likely use:::{code} jsonfor proper syntax highlighting.Please add documentation for Parquet and MongoDB collection formats, explaining:
- Parquet: any schema/type handling specifics, nested structure behavior
- MongoDB collection: connection requirements, field mapping behavior
Also review the code block language tags for semantic accuracy.
🤖 Fix all issues with AI agents
In @docs/cluster/import.md:
- Around line 4-9: Add three new subsections under the "File Import" area
mirroring the scope of the existing "S3" and "Azure" sections: "Local file",
"URL", and "MongoDB". For "Local file" document the upload process (steps to
upload via UI/CLI/API), accepted file formats, size limits, and any
preprocessing or format requirements; for "URL" document supported URL schemes
(http/https), authentication options (basic, bearer, signed URLs), timeout/retry
behavior, and how to reference the URL in import commands; for "MongoDB"
document connection string format, required drivers/versions, auth mechanisms,
how to select a database/collection, field mapping/transform examples, and any
batching/throughput limits. Follow the same tone/structure and examples used in
the "S3 Import" and "Azure Import" sections so the new subsections are
consistent and include sample commands, configuration keys, and known
limitations.
📜 Review details
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/cluster/import.md
🔇 Additional comments (6)
docs/cluster/import.md (6)
4-21: Verify import history UI reference and image asset.Line 23-24 references a "Import history" tab, but this feature description appears after the introduction and before explaining how to perform imports. For new users, the logical flow would benefit from explaining the basic import process first before mentioning historical references. Additionally, verify that the referenced image at Line 28 (
cluster-import.png) exists and correctly depicts the current UI.
30-48: Verify image asset and source documentation completeness.The File Import section provides a clear overview of the unified workflow. However, verify that:
- The image
cluster-import-file-form.png(Line 47) exists and correctly depicts the current file import form.- All five import sources mentioned in the introduction (local file, URL, AWS S3, Azure, MongoDB) are documented in dedicated sections. Currently, only S3 (Lines 49-85) and Azure (Lines 86-101) have subsections; local file, URL, and MongoDB guidance is missing.
49-85: S3 guidance is comprehensive.The AWS S3 section provides clear instructions including bucket/path requirements, authentication, wildcard support for multi-file imports, and relevant IAM policy examples. The 10 GiB file size limit and egress cost warning are appropriately documented.
Please verify that the IAM policy example (Lines 68-83) reflects current AWS S3 best practices and that no additional S3-specific permissions (e.g.,
s3:ListBucketfor prefix matching) are required for wildcard imports to function correctly.
86-101: Azure guidance is clear and consistent with S3 structure.The Azure section appropriately documents secret-based authentication, path format, wildcard support, and file size limits. The mention of admin-level secret management is important operational guidance.
102-106: Clarify the Integration section purpose.The Integration section defers entirely to another documentation page via cross-reference. This is acceptable if comprehensive data integration guidance exists elsewhere, but the section feels incomplete for users reading the import documentation. Consider adding 1-2 sentences explaining what integrations are (e.g., "Integrations allow connecting external data sources for continuous sync") before the cross-reference to provide better context.
Also, verify that the reference
{ref}cluster-integrationsis correct and that the target page exists and is appropriately maintained.
108-124: Schema evolution section is well-documented with good examples.The explanation of schema evolution behavior is clear, limitations are explicit, and the type-mismatch example effectively illustrates edge cases. The toggle naming is consistent with the File Import section (Line 36).
Confirm that the described schema evolution behavior (automatic column addition only, type mismatch failures) matches the current product implementation. Also verify whether there are additional limitations (e.g., constraints on column types, handling of nested JSON structures) that should be documented.
docs/cluster/import.md
Outdated
| You can import data into your CrateDB directly from sources like: | ||
| - local file | ||
| - URL | ||
| - AWS S3 bucket | ||
| - Azure storage | ||
| - MongoDB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing documentation for three import sources: local file, URL, and MongoDB.
The introduction lists five import sources (local file, URL, AWS S3, Azure, MongoDB), and the File Import section implies all are equally accessible. However, only S3 (Lines 49-85) and Azure (Lines 86-101) have dedicated documentation sections. Local file, URL, and MongoDB sources lack any guidance.
For users trying to use these sources, there is no information about:
- Local file: upload process, file size limits, format requirements
- URL: authentication (if needed), supported URL schemes, timeout behavior
- MongoDB: connection string format, collection selection, authentication, field mapping
Please add subsections (similar in scope to the S3 and Azure sections) for each missing source, documenting their specific requirements, limitations, and any relevant configuration details.
Also applies to: 30-48
🤖 Prompt for AI Agents
In @docs/cluster/import.md around lines 4 - 9, Add three new subsections under
the "File Import" area mirroring the scope of the existing "S3" and "Azure"
sections: "Local file", "URL", and "MongoDB". For "Local file" document the
upload process (steps to upload via UI/CLI/API), accepted file formats, size
limits, and any preprocessing or format requirements; for "URL" document
supported URL schemes (http/https), authentication options (basic, bearer,
signed URLs), timeout/retry behavior, and how to reference the URL in import
commands; for "MongoDB" document connection string format, required
drivers/versions, auth mechanisms, how to select a database/collection, field
mapping/transform examples, and any batching/throughput limits. Follow the same
tone/structure and examples used in the "S3 Import" and "Azure Import" sections
so the new subsections are consistent and include sample commands, configuration
keys, and known limitations.
| like date and status. | ||
| By navigating to "Show details", you can display details of a particular | ||
| import job. | ||
| You can import data into your CrateDB directly from various sources, including: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General style question: This page has two very different styles of adressing the user/capabilities:
You can import vs CrateDB Cloud allows further down.
Do we have a general recommended style/guidelines?
cc @bmunkholm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
docs/cluster/import.md (2)
39-45: Remove duplicate format listing.The supported data formats are already documented in the introduction (lines 11-15). This duplication should be removed to avoid maintenance issues if formats change in the future.
Proposed refactor to remove duplication
-The following data formats are supported: - -- CSV (all variants) -- JSON (JSON-Lines, JSON Arrays and JSON Documents) -- Parquet - -Gzip compressed files are also supported.Instead, reference the intro section or consolidate format details into a single location. If compression support is important to highlight here, consider: "All formats listed above support Gzip compression."
132-165: Fix JSON code block directives and document Parquet format limitations.Two issues:
Code block directive: JSON examples use
:::{code} console(lines 138, 147, 162), which may not provide proper syntax highlighting. Usejsoninstead for better readability::::{code} jsonIncomplete format coverage: The intro mentions both Parquet (line 14) and MongoDB collection (line 15) as supported formats, but the File Format Limitations section omits both. Add subsections documenting Parquet behavior and MongoDB-specific limitations or behaviors.
Proposed fix for JSON code block directive
- :::{code} console + :::{code} json { "id":1, "text": "example" } - ::: + :::Repeat for lines 147 and 162 as well.
🤖 Fix all issues with AI agents
In @docs/cluster/import.md:
- Around line 102-105: The Integration section titled
"(cluster-import-integration) ## Integration" is too minimal and causes
confusion about how integrations differ from File Import and whether MongoDB is
supported as a one-time source; expand this section to briefly define
"integrations" vs "one-time file imports", explicitly state that MongoDB CDC
covers real-time sync while noting whether one-time MongoDB imports are
supported (and link to the MongoDB CDC page), and either move this expanded
section to follow the File Import subsections or add a clear
cross-reference/note under "File Import" clarifying that "MongoDB collection" in
formats refers to format support (not source import) unless one-time MongoDB
import is implemented—if one-time MongoDB import exists, add a short how-to
summary or link to its docs.
📜 Review details
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/cluster/import.md
🔇 Additional comments (5)
docs/cluster/import.md (5)
4-21: LGTM! The introduction effectively summarizes available sources and formats, and the sample data note is a nice UX touch for users.
49-84: LGTM! The S3 section is comprehensive, with clear guidance on permissions and multi-file imports via wildcards. The JSON policy example is helpful.
86-100: LGTM! The Azure section covers the key requirements and includes multi-file wildcard support. The 10 GiB per-file limit is consistently stated.
115-116: Verify schema evolution checkbox label matches the UI.The documentation states the feature "can be toggled via the 'Allow schema evolution' checkbox" (lines 115-116). Confirm this exact label matches the current UI control name, as checkbox labels are frequently targets of UX refinement and may differ from documentation.
28-28: All image asset paths are correct. Bothcluster-import.png(line 28) andcluster-import-file-form.png(line 47) exist indocs/_assets/img/and are properly referenced in the documentation.
| (cluster-import-integration)= | ||
| ## Integration | ||
|
|
||
| As with other imports, the supported file types are CSV, JSON, and | ||
| Parquet. | ||
|
|
||
| (cluster-import-file)= | ||
| ## File | ||
|
|
||
| Uploading directly from your computer offers more control over your | ||
| data. From the security point of view, you don't have to share the data | ||
| on the internet just to be able to import it to your cluster. You also | ||
| have more control over who has access to your data. Your files are | ||
| temporarily uploaded to a secure location managed by Crate (an S3 bucket | ||
| in AWS) which is not publicly accessible. The files are automatically | ||
| deleted after 3 days. You may re-import the same file into multiple | ||
| tables without having to re-upload it within those 3 days. Up to 5 files | ||
| may be uploaded at the same time, with the oldest ones being | ||
| automatically deleted if you upload more. | ||
|
|
||
|  | ||
|
|
||
| As with other import, the supported file formats are: | ||
|
|
||
| - CSV (all variants) | ||
| - JSON (JSON-Lines, JSON Arrays and JSON Documents) | ||
| - Parquet | ||
| {ref}`More info about data integration. <cluster-integrations>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, check if the file exists and get its size
wc -l docs/cluster/import.mdRepository: crate/cloud-docs
Length of output: 86
🏁 Script executed:
# Read the entire file to understand structure and content
cat -n docs/cluster/import.mdRepository: crate/cloud-docs
Length of output: 5804
🏁 Script executed:
# Search for MongoDB references in the file
rg -n "MongoDB|mongodb" docs/cluster/import.mdRepository: crate/cloud-docs
Length of output: 104
🏁 Script executed:
# Search for S3 and Azure sections to compare with MongoDB
rg -n "S3|Azure|azure" docs/cluster/import.mdRepository: crate/cloud-docs
Length of output: 769
🏁 Script executed:
# Search for MongoDB in related documentation
fd -e md -e rst --search-path docs/cluster | head -20Repository: crate/cloud-docs
Length of output: 277
🏁 Script executed:
# Search for cluster-integrations reference target
rg -n "cluster-integrations" docs/cluster/import.md docs/cluster/integrations/Repository: crate/cloud-docs
Length of output: 211
🏁 Script executed:
# Check the integrations index file
cat -n docs/cluster/integrations/index.mdRepository: crate/cloud-docs
Length of output: 2214
🏁 Script executed:
# Check the mongo-cdc file to see if it documents MongoDB import
head -50 docs/cluster/integrations/mongo-cdc.mdRepository: crate/cloud-docs
Length of output: 1864
Integration section is minimalist and should provide more context.
The Integration section (lines 102-105) contains only a single cross-reference. While the reference points to the appropriate integrations documentation, the section itself reads as a placeholder and lacks any explanatory context about what integrations are or how they differ from File Import.
Additionally, there is a terminology confusion: the intro mentions "MongoDB database" and line 15 lists "MongoDB collection" as a supported format, but no guidance exists in the File Import section for importing from MongoDB as a data source (unlike S3 and Azure, which have dedicated subsections). The "MongoDB collection" format reference relates to data format support in other imports, not MongoDB-as-source capability. MongoDB import/sync guidance exists only in the separate Integrations section (MongoDB CDC), which describes continuous real-time synchronization rather than one-time imports.
Consider either:
- Expanding the Integration section with a brief explanation of what integrations are and how they differ from one-time file imports, or
- Relocating this section to appear after the File Import subsections with clearer separation of concerns
- Clarifying whether one-time MongoDB imports are supported in File Import (beyond CDC) and documenting them accordingly
🤖 Prompt for AI Agents
In @docs/cluster/import.md around lines 102 - 105, The Integration section
titled "(cluster-import-integration) ## Integration" is too minimal and causes
confusion about how integrations differ from File Import and whether MongoDB is
supported as a one-time source; expand this section to briefly define
"integrations" vs "one-time file imports", explicitly state that MongoDB CDC
covers real-time sync while noting whether one-time MongoDB imports are
supported (and link to the MongoDB CDC page), and either move this expanded
section to follow the File Import subsections or add a clear
cross-reference/note under "File Import" clarifying that "MongoDB collection" in
formats refers to format support (not source import) unless one-time MongoDB
import is implemented—if one-time MongoDB import exists, add a short how-to
summary or link to its docs.
What's Inside
The import flow has been revised and therefore the documentation page is not anymore relevant as it is.
The new version of the page has new screenshots and the text fits what the user see on the screen and the available features and settings.
Preview
See https://crate-cloud--114.org.readthedocs.build/en/114/
Highlights
Checklist
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.