Skip to content

feat(knowledge): allow Excel file indexing with 2MB size limit#805

Open
parabala wants to merge 3 commits intowecode-ai:mainfrom
parabala:feature/allow-index-xlxs
Open

feat(knowledge): allow Excel file indexing with 2MB size limit#805
parabala wants to merge 3 commits intowecode-ai:mainfrom
parabala:feature/allow-index-xlxs

Conversation

@parabala
Copy link
Collaborator

@parabala parabala commented Mar 20, 2026

  • Remove .xls and .xlsx from RAG_INDEXING_DISABLED_EXTENSIONS
  • Add EXCEL_FILE_SIZE_LIMIT (2MB) and EXCEL_EXTENSIONS constants
  • Add file_size parameter to get_rag_indexing_skip_reason function
  • Return error code format for i18n support: EXCEL_FILE_SIZE_EXCEEDED|ext|limit|size
  • Update orchestrator to pass file_size to indexing functions
  • Update frontend DocumentItem to show 'Indexing' status when reindexing
  • Show 'Index Unavailable' when not reindexing and is_active=false
  • Add i18n translations for Excel file size exceeded error (zh-CN/en)
  • Update tests to verify Excel files within 2MB limit are allowed

Summary by CodeRabbit

  • New Features

    • Excel files are now indexable when ≤2MB; files >2MB are skipped with a size-specific reason.
    • Added localized messages (EN/ZH) that include extension, limit, and actual size for size-exceeded errors.
  • Bug Fixes / UX

    • Reindexing failure toasts and document tooltips now surface the Excel size-limit reason and show a clear, formatted message.
  • Tests

    • Updated tests to reflect size-dependent Excel indexing behavior.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 20, 2026

📝 Walkthrough

Walkthrough

Excel files are no longer unconditionally excluded by extension; indexing skips Excel only when file_size is known and exceeds a 2MB limit. Backend APIs, orchestrator flows, frontend UI, tests, and i18n were updated to propagate and surface this size-based skip reason.

Changes

Cohort / File(s) Summary
Backend Indexing Logic
backend/app/services/knowledge/indexing.py
Replace extension-only exclusion with size check. Add EXCEL_FILE_SIZE_LIMIT (2MB) and EXCEL_EXTENSIONS. Update get_rag_indexing_skip_reason(source_type, file_extension, file_size) to return `EXCEL_FILE_SIZE_EXCEEDED
Backend Orchestrator Integration
backend/app/services/knowledge/orchestrator.py
Pass file_size into get_rag_indexing_skip_reason(...) for create/index and reindex flows (uses data.file_size or document.file_size).
Backend Tests
backend/tests/services/knowledge/test_orchestrator.py
Adjust tests to treat Excel skipping as size-dependent (>2MB). Fixtures set file_size=3MB where expecting skip/error; assertions updated to expect EXCEL_FILE_SIZE_EXCEEDED token.
Frontend Document Components
frontend/src/features/knowledge/document/components/DocumentItem.tsx, frontend/src/features/knowledge/document/components/DocumentList.tsx
Detect oversized Excel files (isExcelExceedingSizeLimit) and show Excel-specific tooltip. DocumentList parses `EXCEL_FILE_SIZE_EXCEEDED
Frontend Internationalization
frontend/src/i18n/locales/en/knowledge.json, frontend/src/i18n/locales/zh-CN/knowledge.json
Add document.excelFileSizeExceeded translation keys with {{extension}}, {{limit}}, and {{size}} placeholders.

Sequence Diagram(s)

sequenceDiagram
  participant UI as DocumentList / DocumentItem (Frontend)
  participant Orch as Orchestrator (Backend)
  participant IndexSvc as Indexing Service
  participant DB as KnowledgeDocument (DB)

  UI->>Orch: request reindex / trigger indexing
  Orch->>DB: fetch metadata (file_extension, file_size)
  Orch->>IndexSvc: start indexing (passes source_type, file_extension, file_size)
  IndexSvc->>IndexSvc: get_rag_indexing_skip_reason(source_type, file_extension, file_size)
  alt Excel and file_size > 2MB
    IndexSvc-->>Orch: EXCEL_FILE_SIZE_EXCEEDED|ext|limit|size (skip)
    Orch-->>UI: return error token (frontend shows localized toast)
  else allowed (size unknown or <= limit)
    IndexSvc->>IndexSvc: schedule/perform indexing
    IndexSvc-->>Orch: indexing scheduled/started
    Orch-->>UI: success
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰🥕 I hopped through sheets both wide and deep,
Two megs the fence I now must keep.
If XLSX grows beyond the gate,
A kindly hop will mark its fate.
Shrink or split — then I will leap!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main feature: enabling Excel file indexing with a 2MB size limit, which aligns with the core change across backend and frontend.
Docstring Coverage ✅ Passed Docstring coverage is 81.82% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can disable sequence diagrams in the walkthrough.

Disable the reviews.sequence_diagrams setting to disable sequence diagrams in the walkthrough.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/services/knowledge/indexing.py (1)

502-529: ⚠️ Potential issue | 🟠 Major

Don't rely solely on KnowledgeDocument.file_size for the Excel gate.

When document_id is present, this path never consults the attachment size. Older rows can still have file_size as NULL/0, so the new check fails open and a >2MB Excel file can still be indexed/reindexed even though attachment_id is available here. Please fall back to attachment metadata before calling get_rag_indexing_skip_reason(), and fail closed for Excel if neither size is trustworthy.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/knowledge/indexing.py` around lines 502 - 529, When
resolving file metadata before calling get_rag_indexing_skip_reason, don't trust
KnowledgeDocument.file_size alone: in the document_id branch (KnowledgeDocument)
if document.file_size is None or 0 and attachment_id is present, query
SubtaskContext (filtering ContextType.ATTACHMENT) and override
file_size/file_extension with the attachment's values; otherwise if after both
checks file_size is still missing/zero and source_type indicates Excel treat
size as "too large" (fail closed) so get_rag_indexing_skip_reason will skip
indexing. Ensure you still populate source_type from KnowledgeDocument when
available and only fall back to attachment metadata for
file_size/file_extension.
🧹 Nitpick comments (1)
frontend/src/features/knowledge/document/components/DocumentItem.tsx (1)

126-129: Extract the Excel limit/tooltip logic into a shared helper.

The 2MB policy is duplicated here in both layouts (EXCEL_FILE_SIZE_LIMIT, limit: 2, and the repeated size formatting). A small helper/shared constant would keep the compact and table UIs in sync and reduce drift from the backend policy over time.

Also applies to: 234-239, 442-447

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@frontend/src/features/knowledge/document/components/DocumentItem.tsx` around
lines 126 - 129, The Excel size-limit logic (EXCEL_FILE_SIZE_LIMIT, isExcel,
isExcelExceedingSizeLimit and the repeated formatted "2 MB" text) is duplicated
across multiple UI components; extract a single shared helper that exports the
numeric limit constant (e.g., EXCEL_FILE_SIZE_LIMIT = 2 * 1024 * 1024) and
utility functions like isExcel(fileExtension) and isExcelExceedingLimit(file)
plus a formattedLimit getter used by tooltips; replace local uses of
EXCEL_FILE_SIZE_LIMIT, isExcel and isExcelExceedingSizeLimit in DocumentItem and
the other occurrences with imports from that helper so both compact and table
UIs share the same source of truth.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/tests/services/knowledge/test_orchestrator.py`:
- Around line 422-426: The test
test_excel_documents_within_size_limit_are_allowed never exercises the size
check because it omits the file_size argument; update the test to call
get_rag_indexing_skip_reason("file", "xlsx", file_size=2 * 1024 * 1024) so the
2MB boundary is evaluated and assert the result is None, and optionally add an
additional assertion for a size just above the limit (e.g., file_size=2 * 1024 *
1024 + 1) to verify the skip reason is returned.

In `@frontend/src/features/knowledge/document/components/DocumentList.tsx`:
- Around line 382-400: In the catch block inside DocumentList.tsx that builds
the toast error (the code setting errorMessage and using
t('document.document.reindexFailed')), remove the else branch that assigns the
raw backend message (errorMessage = message) so we do not surface unlocalized
server text; keep the existing EXCEL_FILE_SIZE_EXCEEDED parsing and i18n
mapping, and for any other error leave errorMessage as the generic
t('document.document.reindexFailed') fallback unless you add explicit
error-code-to-i18n mappings later.

---

Outside diff comments:
In `@backend/app/services/knowledge/indexing.py`:
- Around line 502-529: When resolving file metadata before calling
get_rag_indexing_skip_reason, don't trust KnowledgeDocument.file_size alone: in
the document_id branch (KnowledgeDocument) if document.file_size is None or 0
and attachment_id is present, query SubtaskContext (filtering
ContextType.ATTACHMENT) and override file_size/file_extension with the
attachment's values; otherwise if after both checks file_size is still
missing/zero and source_type indicates Excel treat size as "too large" (fail
closed) so get_rag_indexing_skip_reason will skip indexing. Ensure you still
populate source_type from KnowledgeDocument when available and only fall back to
attachment metadata for file_size/file_extension.

---

Nitpick comments:
In `@frontend/src/features/knowledge/document/components/DocumentItem.tsx`:
- Around line 126-129: The Excel size-limit logic (EXCEL_FILE_SIZE_LIMIT,
isExcel, isExcelExceedingSizeLimit and the repeated formatted "2 MB" text) is
duplicated across multiple UI components; extract a single shared helper that
exports the numeric limit constant (e.g., EXCEL_FILE_SIZE_LIMIT = 2 * 1024 *
1024) and utility functions like isExcel(fileExtension) and
isExcelExceedingLimit(file) plus a formattedLimit getter used by tooltips;
replace local uses of EXCEL_FILE_SIZE_LIMIT, isExcel and
isExcelExceedingSizeLimit in DocumentItem and the other occurrences with imports
from that helper so both compact and table UIs share the same source of truth.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 35414538-0e85-41ef-8b70-84e004b3c793

📥 Commits

Reviewing files that changed from the base of the PR and between 45958dc and d54e2bb.

📒 Files selected for processing (7)
  • backend/app/services/knowledge/indexing.py
  • backend/app/services/knowledge/orchestrator.py
  • backend/tests/services/knowledge/test_orchestrator.py
  • frontend/src/features/knowledge/document/components/DocumentItem.tsx
  • frontend/src/features/knowledge/document/components/DocumentList.tsx
  • frontend/src/i18n/locales/en/knowledge.json
  • frontend/src/i18n/locales/zh-CN/knowledge.json

Comment on lines +422 to +426
def test_excel_documents_within_size_limit_are_allowed(self):
"""Test Excel extensions within 2MB size limit are allowed for RAG indexing."""
reason = get_rag_indexing_skip_reason("file", "xlsx")

assert reason == "Excel documents (.xlsx) are excluded from RAG indexing"
assert reason is None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

This test never hits the "within 2MB" branch.

Omitting file_size makes get_rag_indexing_skip_reason() return None regardless of the workbook size, so the new size boundary is not exercised here. Pass 2 * 1024 * 1024 explicitly, and ideally add an exact-boundary assertion.

🧪 Suggested fix
-        reason = get_rag_indexing_skip_reason("file", "xlsx")
+        reason = get_rag_indexing_skip_reason("file", "xlsx", 2 * 1024 * 1024)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/tests/services/knowledge/test_orchestrator.py` around lines 422 -
426, The test test_excel_documents_within_size_limit_are_allowed never exercises
the size check because it omits the file_size argument; update the test to call
get_rag_indexing_skip_reason("file", "xlsx", file_size=2 * 1024 * 1024) so the
2MB boundary is evaluated and assert the result is None, and optionally add an
additional assertion for a size just above the limit (e.g., file_size=2 * 1024 *
1024 + 1) to verify the skip reason is returned.

Comment on lines +382 to +400
} catch (err) {
// Parse error message and use i18n translation if it's a known error code
let errorMessage = t('document.document.reindexFailed')
if (err instanceof Error) {
const message = err.message
// Check if it's an EXCEL_FILE_SIZE_EXCEEDED error code
// Format: EXCEL_FILE_SIZE_EXCEEDED|extension|limit|size
if (message.startsWith('EXCEL_FILE_SIZE_EXCEEDED|')) {
const parts = message.split('|')
if (parts.length === 4) {
errorMessage = t('document.document.excelFileSizeExceeded', {
extension: parts[1],
limit: parts[2],
size: parts[3],
})
}
} else {
// Use the original error message for other errors
errorMessage = message
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Don't surface raw backend errors in the reindex toast.

Only EXCEL_FILE_SIZE_EXCEEDED is mapped to i18n here. The else branch now shows raw backend messages for every other failure, which regresses localization and leaks server wording straight into user-facing toasts. Keep the generic reindexFailed fallback unless you add explicit error-code mappings for those cases.

💡 Suggested change
-        } else {
-          // Use the original error message for other errors
-          errorMessage = message
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
} catch (err) {
// Parse error message and use i18n translation if it's a known error code
let errorMessage = t('document.document.reindexFailed')
if (err instanceof Error) {
const message = err.message
// Check if it's an EXCEL_FILE_SIZE_EXCEEDED error code
// Format: EXCEL_FILE_SIZE_EXCEEDED|extension|limit|size
if (message.startsWith('EXCEL_FILE_SIZE_EXCEEDED|')) {
const parts = message.split('|')
if (parts.length === 4) {
errorMessage = t('document.document.excelFileSizeExceeded', {
extension: parts[1],
limit: parts[2],
size: parts[3],
})
}
} else {
// Use the original error message for other errors
errorMessage = message
} catch (err) {
// Parse error message and use i18n translation if it's a known error code
let errorMessage = t('document.document.reindexFailed')
if (err instanceof Error) {
const message = err.message
// Check if it's an EXCEL_FILE_SIZE_EXCEEDED error code
// Format: EXCEL_FILE_SIZE_EXCEEDED|extension|limit|size
if (message.startsWith('EXCEL_FILE_SIZE_EXCEEDED|')) {
const parts = message.split('|')
if (parts.length === 4) {
errorMessage = t('document.document.excelFileSizeExceeded', {
extension: parts[1],
limit: parts[2],
size: parts[3],
})
}
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@frontend/src/features/knowledge/document/components/DocumentList.tsx` around
lines 382 - 400, In the catch block inside DocumentList.tsx that builds the
toast error (the code setting errorMessage and using
t('document.document.reindexFailed')), remove the else branch that assigns the
raw backend message (errorMessage = message) so we do not surface unlocalized
server text; keep the existing EXCEL_FILE_SIZE_EXCEEDED parsing and i18n
mapping, and for any other error leave errorMessage as the generic
t('document.document.reindexFailed') fallback unless you add explicit
error-code-to-i18n mappings later.

const message = err.message
// Check if it's an EXCEL_FILE_SIZE_EXCEEDED error code
// Format: EXCEL_FILE_SIZE_EXCEEDED|extension|limit|size
if (message.startsWith('EXCEL_FILE_SIZE_EXCEEDED|')) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我刚刚改权限那边,加了个 error_code 好像可以复用

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/services/knowledge/indexing.py (1)

91-114: ⚠️ Potential issue | 🟠 Major

Make file_size mandatory to avoid Excel size-limit bypass.

Line 91 makes file_size optional; any caller that omits it will bypass the Excel limit check at Line 113. That weakens the 2MB enforcement contract.

🔧 Proposed fix
 def get_rag_indexing_skip_reason(
     source_type: Optional[str],
     file_extension: Optional[str],
-    file_size: Optional[int] = None,
+    file_size: int,
 ) -> Optional[str]:
@@
-    if normalized_extension in EXCEL_EXTENSIONS:
-        if file_size is not None and file_size > EXCEL_FILE_SIZE_LIMIT:
+    if normalized_extension in EXCEL_EXTENSIONS and file_size > EXCEL_FILE_SIZE_LIMIT:
             size_mb = file_size / (1024 * 1024)
             limit_mb = EXCEL_FILE_SIZE_LIMIT / (1024 * 1024)
             return f"EXCEL_FILE_SIZE_EXCEEDED|{normalized_extension}|{limit_mb:.0f}|{size_mb:.2f}"
-        file_size = None
+        file_size = 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/knowledge/indexing.py` around lines 91 - 114, The
file_size parameter is optional which lets callers omit it and bypass the Excel
2MB check; make file_size mandatory (remove Optional and the default None) in
the function signature that returns the skip reason, update all callers to
provide the document size, and ensure runtime validation around
EXCEL_EXTENSIONS/EXCEL_FILE_SIZE_LIMIT (e.g., raise or return an error/skip
reason if size is not provided for Excel files) so the Excel size-limit cannot
be bypassed; reference the file_size parameter, EXCEL_EXTENSIONS,
EXCEL_FILE_SIZE_LIMIT, and the skip-reason function when making changes.
🧹 Nitpick comments (1)
backend/app/services/knowledge/indexing.py (1)

425-440: run_document_indexing is doing too much in one function.

This function mixes lookup, policy checks, indexing orchestration, status persistence, and summary triggering; splitting into smaller helpers will improve maintainability and testability.

As per coding guidelines, "Function length: Max 50 lines per function (preferred)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/knowledge/indexing.py` around lines 425 - 440,
run_document_indexing is too large and mixes lookup, policy checks,
orchestration, persistence and summary triggering; break it into small helpers:
extract the database/KB lookup logic into a function like
fetch_kb_and_attachment(knowledge_base_id, attachment_id, db), policy checks
into validate_indexing_permissions(user_id, knowledge_base, attachment),
indexing orchestration into orchestrate_indexing(...), status persistence into
persist_index_status(kb_index_info, status, db), and summary logic into
trigger_summary_if_needed(...); then simplify run_document_indexing to call
these helpers in sequence, pass only needed parameters, and move any complex
branching into the new functions to keep run_document_indexing concise and under
~50 lines.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@backend/app/services/knowledge/indexing.py`:
- Around line 91-114: The file_size parameter is optional which lets callers
omit it and bypass the Excel 2MB check; make file_size mandatory (remove
Optional and the default None) in the function signature that returns the skip
reason, update all callers to provide the document size, and ensure runtime
validation around EXCEL_EXTENSIONS/EXCEL_FILE_SIZE_LIMIT (e.g., raise or return
an error/skip reason if size is not provided for Excel files) so the Excel
size-limit cannot be bypassed; reference the file_size parameter,
EXCEL_EXTENSIONS, EXCEL_FILE_SIZE_LIMIT, and the skip-reason function when
making changes.

---

Nitpick comments:
In `@backend/app/services/knowledge/indexing.py`:
- Around line 425-440: run_document_indexing is too large and mixes lookup,
policy checks, orchestration, persistence and summary triggering; break it into
small helpers: extract the database/KB lookup logic into a function like
fetch_kb_and_attachment(knowledge_base_id, attachment_id, db), policy checks
into validate_indexing_permissions(user_id, knowledge_base, attachment),
indexing orchestration into orchestrate_indexing(...), status persistence into
persist_index_status(kb_index_info, status, db), and summary logic into
trigger_summary_if_needed(...); then simplify run_document_indexing to call
these helpers in sequence, pass only needed parameters, and move any complex
branching into the new functions to keep run_document_indexing concise and under
~50 lines.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3824eb33-ed79-49c3-8b03-cd8e3a24cd12

📥 Commits

Reviewing files that changed from the base of the PR and between d54e2bb and f5c6078.

📒 Files selected for processing (1)
  • backend/app/services/knowledge/indexing.py

- Remove .xls and .xlsx from RAG_INDEXING_DISABLED_EXTENSIONS
- Add EXCEL_FILE_SIZE_LIMIT (2MB) and EXCEL_EXTENSIONS constants
- Add file_size parameter to get_rag_indexing_skip_reason function
- Return error code format for i18n support: EXCEL_FILE_SIZE_EXCEEDED|ext|limit|size
- Update orchestrator to pass file_size to indexing functions
- Update frontend DocumentItem to show 'Indexing' status when reindexing
- Show 'Index Unavailable' when not reindexing and is_active=false
- Add i18n translations for Excel file size exceeded error (zh-CN/en)
- Update tests to verify Excel files within 2MB limit are allowed
…nstant

Remove the unused RAG_INDEXING_DISABLED_EXTENSIONS frozenset constant
and its associated extension check logic from the indexing service.
The constant was empty and served no functional purpose.
…ing in DocumentList

Replace string prefix matching with ApiError.errorCode field check for
EXCEL_FILE_SIZE_EXCEEDED error handling. This leverages the existing
ApiError class from client.ts for more robust and type-safe error handling.
@parabala parabala force-pushed the feature/allow-index-xlxs branch from f5c6078 to d9e8064 Compare March 22, 2026 07:33
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
frontend/src/features/knowledge/document/components/DocumentItem.tsx (1)

268-276: Remove redundant ternary branch in unavailable tooltip

At Line 274 and Line 500, both branches return the same indexStatus.unavailableHint. This can be simplified.

Proposed cleanup
-                      {isExcelExceedingSizeLimit
-                        ? t('knowledge:document.document.excelFileSizeExceeded', {
-                            extension: document.file_extension,
-                            limit: 2,
-                            size: (document.file_size / (1024 * 1024)).toFixed(2),
-                          })
-                        : !ragConfigured
-                          ? t('knowledge:document.document.indexStatus.unavailableHint')
-                          : t('knowledge:document.document.indexStatus.unavailableHint')}
+                      {isExcelExceedingSizeLimit
+                        ? t('knowledge:document.document.excelFileSizeExceeded', {
+                            extension: document.file_extension,
+                            limit: 2,
+                            size: (document.file_size / (1024 * 1024)).toFixed(2),
+                          })
+                        : t('knowledge:document.document.indexStatus.unavailableHint')}

Also applies to: 494-503

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@frontend/src/features/knowledge/document/components/DocumentItem.tsx` around
lines 268 - 276, The JSX in DocumentItem.tsx uses a redundant nested ternary
where both the true and false branches return the same
t('knowledge:document.document.indexStatus.unavailableHint') value; update the
expression that checks isExcelExceedingSizeLimit and ragConfigured (used in the
JSX around the unavailable tooltip) to remove the duplicated branch so it only
renders the excel-size message when isExcelExceedingSizeLimit is true, otherwise
render t('knowledge:document.document.indexStatus.unavailableHint'); keep the
existing variables isExcelExceedingSizeLimit and ragConfigured and the
translation key unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@frontend/src/features/knowledge/document/components/DocumentItem.tsx`:
- Around line 149-152: The Excel detection misses dot-prefixed extensions;
update the normalization used by isExcel so document.file_extension is sanitized
(trim, remove leading '.' if present, and toLowerCase()) before checking against
['xls','xlsx']. Keep EXCEL_FILE_SIZE_LIMIT and isExcelExceedingSizeLimit logic
but compute isExcel from the normalizedExtension (derived from
document.file_extension) to ensure '.xlsx' and 'xlsx' are both recognized.
- Around line 149-152: The UI computes EXCEL_FILE_SIZE_LIMIT, isExcel and
isExcelExceedingSizeLimit but never uses it to prevent user actions; update the
DocumentItem action rendering logic (both the compact and normal action areas
where reindex is offered) to check isExcelExceedingSizeLimit and either hide or
disable the reindex action/button (and show an explanatory tooltip/message) when
true so users cannot trigger guaranteed-to-fail reindex requests for oversized
Excel files.

---

Nitpick comments:
In `@frontend/src/features/knowledge/document/components/DocumentItem.tsx`:
- Around line 268-276: The JSX in DocumentItem.tsx uses a redundant nested
ternary where both the true and false branches return the same
t('knowledge:document.document.indexStatus.unavailableHint') value; update the
expression that checks isExcelExceedingSizeLimit and ragConfigured (used in the
JSX around the unavailable tooltip) to remove the duplicated branch so it only
renders the excel-size message when isExcelExceedingSizeLimit is true, otherwise
render t('knowledge:document.document.indexStatus.unavailableHint'); keep the
existing variables isExcelExceedingSizeLimit and ragConfigured and the
translation key unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8cc521b2-8f19-4821-94aa-7094517ba7db

📥 Commits

Reviewing files that changed from the base of the PR and between f5c6078 and d9e8064.

📒 Files selected for processing (7)
  • backend/app/services/knowledge/indexing.py
  • backend/app/services/knowledge/orchestrator.py
  • backend/tests/services/knowledge/test_orchestrator.py
  • frontend/src/features/knowledge/document/components/DocumentItem.tsx
  • frontend/src/features/knowledge/document/components/DocumentList.tsx
  • frontend/src/i18n/locales/en/knowledge.json
  • frontend/src/i18n/locales/zh-CN/knowledge.json
✅ Files skipped from review due to trivial changes (1)
  • frontend/src/i18n/locales/en/knowledge.json
🚧 Files skipped from review as they are similar to previous changes (5)
  • frontend/src/i18n/locales/zh-CN/knowledge.json
  • backend/app/services/knowledge/orchestrator.py
  • frontend/src/features/knowledge/document/components/DocumentList.tsx
  • backend/tests/services/knowledge/test_orchestrator.py
  • backend/app/services/knowledge/indexing.py

Comment on lines +149 to +152
// Check if Excel file exceeds size limit (2MB)
const EXCEL_FILE_SIZE_LIMIT = 2 * 1024 * 1024 // 2MB
const isExcel = ['xls', 'xlsx'].includes(document.file_extension?.toLowerCase() || '')
const isExcelExceedingSizeLimit = isExcel && document.file_size > EXCEL_FILE_SIZE_LIMIT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Normalize extension before Excel detection to avoid false negatives

At Line 151, the check only matches 'xls'/'xlsx'. If document.file_extension is dot-prefixed (e.g. .xlsx, consistent with backend normalization), Excel files will not be recognized and Line 152 won’t flag size overflow.

Proposed fix
-  const EXCEL_FILE_SIZE_LIMIT = 2 * 1024 * 1024 // 2MB
-  const isExcel = ['xls', 'xlsx'].includes(document.file_extension?.toLowerCase() || '')
-  const isExcelExceedingSizeLimit = isExcel && document.file_size > EXCEL_FILE_SIZE_LIMIT
+  const EXCEL_FILE_SIZE_LIMIT = 2 * 1024 * 1024 // 2MB
+  const normalizedExtension = (document.file_extension || '').toLowerCase().replace(/^\./, '')
+  const isExcel = normalizedExtension === 'xls' || normalizedExtension === 'xlsx'
+  const isExcelExceedingSizeLimit =
+    isExcel && typeof document.file_size === 'number' && document.file_size > EXCEL_FILE_SIZE_LIMIT
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Check if Excel file exceeds size limit (2MB)
const EXCEL_FILE_SIZE_LIMIT = 2 * 1024 * 1024 // 2MB
const isExcel = ['xls', 'xlsx'].includes(document.file_extension?.toLowerCase() || '')
const isExcelExceedingSizeLimit = isExcel && document.file_size > EXCEL_FILE_SIZE_LIMIT
// Check if Excel file exceeds size limit (2MB)
const EXCEL_FILE_SIZE_LIMIT = 2 * 1024 * 1024 // 2MB
const normalizedExtension = (document.file_extension || '').toLowerCase().replace(/^\./, '')
const isExcel = normalizedExtension === 'xls' || normalizedExtension === 'xlsx'
const isExcelExceedingSizeLimit =
isExcel && typeof document.file_size === 'number' && document.file_size > EXCEL_FILE_SIZE_LIMIT
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@frontend/src/features/knowledge/document/components/DocumentItem.tsx` around
lines 149 - 152, The Excel detection misses dot-prefixed extensions; update the
normalization used by isExcel so document.file_extension is sanitized (trim,
remove leading '.' if present, and toLowerCase()) before checking against
['xls','xlsx']. Keep EXCEL_FILE_SIZE_LIMIT and isExcelExceedingSizeLimit logic
but compute isExcel from the normalizedExtension (derived from
document.file_extension) to ensure '.xlsx' and 'xlsx' are both recognized.

⚠️ Potential issue | 🟠 Major

Use isExcelExceedingSizeLimit to block known-failing reindex actions

You already compute the oversize condition, but reindex actions are still exposed in compact/normal action areas. That allows user-triggered requests that are guaranteed to fail for oversized Excel files.

Proposed fix
+  const canReindex = ragConfigured && !document.is_active && !isTable && !isExcelExceedingSizeLimit

-                  {ragConfigured && !document.is_active && !isTable && onReindex && (
+                  {canReindex && onReindex && (
                     <DropdownMenuItem onClick={handleReindex} disabled={isReindexing}>
                       ...
                     </DropdownMenuItem>
                   )}

-          {ragConfigured && !document.is_active && !isTable && onReindex && (
+          {canReindex && onReindex && (
             <button
               ...
             >
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@frontend/src/features/knowledge/document/components/DocumentItem.tsx` around
lines 149 - 152, The UI computes EXCEL_FILE_SIZE_LIMIT, isExcel and
isExcelExceedingSizeLimit but never uses it to prevent user actions; update the
DocumentItem action rendering logic (both the compact and normal action areas
where reindex is offered) to check isExcelExceedingSizeLimit and either hide or
disable the reindex action/button (and show an explanatory tooltip/message) when
true so users cannot trigger guaranteed-to-fail reindex requests for oversized
Excel files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants