-
Notifications
You must be signed in to change notification settings - Fork 7
Evaluation: Upload Dataset Improvements #450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughRemoves the async CSV upload function from Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant API as Evaluation API
participant Lang as Langfuse Adapter/SDK
participant DB as Database
Client->>API: POST upload (CSV file)
API->>API: parse CSV -> normalize headers (case-insensitive), map question/answer cols, build items list
API->>Lang: upload_dataset_to_langfuse(items, name, duplication_factor)
loop for each original item
Lang->>Lang: create duplicated dataset items
rect rgba(200,230,200,0.18)
Lang->>Lang: per-original-item flush
end
end
Lang->>Lang: final flush after all items
API->>DB: commit & refresh evaluation run
alt commit succeeds
DB-->>API: ok
else commit fails
DB--x API: error
API->>DB: rollback
end
API-->>Client: response with dataset/run details
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used📓 Path-based instructions (1)**/*.py📄 CodeRabbit inference engine (CLAUDE.md)
Files:
🧠 Learnings (1)📓 Common learnings⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
backend/app/crud/evaluations/langfuse.py (1)
318-323: Per-item flush strategy looks good.The per-item flush after processing each original item's duplicates, combined with a final flush, addresses potential batching issues in the Langfuse SDK. This pattern ensures proper separation of Q&A pairs.
Note: If datasets grow significantly beyond the current 1MB limit, consider batching flushes (e.g., every N items) to balance reliability and performance.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
backend/app/crud/evaluations/core.py(7 hunks)backend/app/crud/evaluations/langfuse.py(1 hunks)backend/app/tests/crud/evaluations/test_langfuse.py(3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use type hints in Python code (Python 3.11+ project)
Files:
backend/app/crud/evaluations/langfuse.pybackend/app/tests/crud/evaluations/test_langfuse.pybackend/app/crud/evaluations/core.py
backend/app/crud/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Implement database access operations in backend/app/crud/
Files:
backend/app/crud/evaluations/langfuse.pybackend/app/crud/evaluations/core.py
🧠 Learnings (1)
📚 Learning: 2025-10-08T12:05:01.317Z
Learnt from: CR
Repo: ProjectTech4DevAI/ai-platform PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-10-08T12:05:01.317Z
Learning: Applies to backend/app/core/langfuse/**/*.py : Place Langfuse observability integration under backend/app/core/langfuse/
Applied to files:
backend/app/crud/evaluations/langfuse.py
🧬 Code graph analysis (3)
backend/app/crud/evaluations/langfuse.py (1)
backend/app/core/langfuse/langfuse.py (1)
flush(108-109)
backend/app/tests/crud/evaluations/test_langfuse.py (1)
backend/app/core/langfuse/langfuse.py (1)
flush(108-109)
backend/app/crud/evaluations/core.py (2)
backend/app/core/langfuse/langfuse.py (1)
flush(108-109)backend/app/tests/crud/collections/collection/test_crud_collection_read_all.py (1)
refresh(32-34)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: checks (3.11.7, 6)
🔇 Additional comments (9)
backend/app/tests/crud/evaluations/test_langfuse.py (3)
419-420: Test assertions correctly updated for per-item + final flush pattern.The flush count assertion (4 = 3 items + 1 final) aligns with the implementation changes in
backend/app/crud/evaluations/langfuse.py.
492-493: LGTM: Flush count reflects valid items only.Correctly expects 3 flushes (2 valid items + 1 final), as invalid rows are skipped before the per-item flush.
545-546: LGTM: Flush count consistent with duplication_factor=1 scenario.The assertion (4 = 3 items + 1 final) correctly verifies the flush pattern when duplication_factor is 1.
backend/app/crud/evaluations/core.py (6)
38-46: Duplication factor validation is well-implemented.The bounds check (1-100) with descriptive error messages provides good guardrails for users.
48-56: CSV size limit appropriately enforced.The 1MB limit with clear error messaging helps prevent resource exhaustion. The size calculation and formatting are correct.
139-144: Per-item flush + final flush pattern is consistent.The implementation matches the pattern in
backend/app/crud/evaluations/langfuse.pyand addresses batching concerns mentioned in the comments.
206-212: Robust error handling with rollback added.The try/except wrapper with explicit rollback ensures database consistency when commit fails. This is a best practice for transactional operations.
343-349: LGTM: Database error handling with rollback.Consistent with the pattern in
create_evaluation_run. The error handling properly maintains database integrity.
16-16: All callers have been correctly updated for the sync conversion.Verification confirms that
upload_dataset_to_langfuseand its wrapperupload_dataset_to_langfuse_from_csvare both properly defined as sync functions (usingdefnotasync def). The call site in evaluation.py:240 correctly omits theawaitkeyword, and no remainingawaitcalls to either function were found in the codebase. The conversion has been completed correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
backend/app/crud/evaluations/core.py (1)
67-98: Duplicate header detection correctly implemented.The single-pass normalization with duplicate detection addresses the previous review concern. The implementation correctly:
- Maps normalized names to lists of originals to detect conflicts
- Returns a descriptive error showing which headers conflict
- Uses the first occurrence (
originals[0]) for the clean mappingThis prevents silent data loss from case-variant duplicates like "Question" and "question".
🧹 Nitpick comments (1)
backend/app/crud/evaluations/core.py (1)
48-56: Consider reusing the constant for consistency.The size limit logic is correct. Minor suggestion: reuse
max_size_bytesin the calculation to avoid duplicating the magic number.# Validate CSV file size (max 1MB) max_size_bytes = 1_048_576 # 1MB if len(csv_content) > max_size_bytes: - size_mb = len(csv_content) / 1_048_576 + size_mb = len(csv_content) / max_size_bytes return ( False, None, f"CSV file too large ({size_mb:.2f}MB). Maximum allowed is 1MB", )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
backend/app/crud/evaluations/core.py(7 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use type hints in Python code (Python 3.11+ project)
Files:
backend/app/crud/evaluations/core.py
backend/app/crud/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Implement database access operations in backend/app/crud/
Files:
backend/app/crud/evaluations/core.py
🧬 Code graph analysis (1)
backend/app/crud/evaluations/core.py (2)
backend/app/models/collection.py (1)
norm(92-98)backend/app/core/langfuse/langfuse.py (1)
flush(108-109)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: checks (3.11.7, 6)
🔇 Additional comments (6)
backend/app/crud/evaluations/core.py (6)
38-46: LGTM on duplication_factor validation.The validation correctly enforces the 1–100 range per PR objectives. Minor nit: the docstring on line 31 mentions "default 5" but there's no default value in the signature—consider updating the docstring to say "Number of times to duplicate each item (1-100)".
101-134: LGTM on case-insensitive header matching.The approach correctly validates using normalized names while accessing row data with original field names—
DictReaderkeys match the original headers exactly, sorow.get(golden_question)works correctly.
148-174: Per-item flush trade-off is acceptable given the constraints.The flush-per-original-item approach correctly prevents batching race conditions as noted in the PR objectives. Given the 1MB file limit and max 100× duplication, the performance overhead is reasonable.
Note that partial uploads are possible if individual
create_dataset_itemcalls fail—the function logs errors and continues, returningtotal_uploadedwhich may be less than expected. This is a reasonable graceful-degradation strategy, though callers should be aware.
236-242: LGTM on database rollback handling.The try/except pattern with rollback, logging (including stack trace), and re-raise correctly ensures transactional integrity while maintaining visibility into failures.
373-379: Consistent rollback pattern applied.Matches the
create_evaluation_runpattern—maintains consistency across CRUD operations.
189-189: Verifyidattribute presence on Langfuse Dataset objects.The web search indicates
create_dataset()returns alangfuse.api.Datasetobject with attributes includingname,description, andmetadata, but does not definitively confirm whetheridis present. The defensivehasattrcheck appears reasonable given incomplete SDK documentation. If the Langfuse SDK always includes anidattribute on successfully created datasets, direct access would be cleaner; however, this requires manual verification against the actual SDK implementation or official API documentation to confirm the guaranteed contract.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
backend/app/crud/evaluations/core.py (2)
140-165: Per-item flushing addresses race conditions but has performance implications.The per-item flush after each original item's duplicates (line 162) plus the final flush (line 165) aligns with the PR objective of preventing race conditions in Langfuse SDK's batching. However, for CSVs with many rows, this could significantly increase upload time due to frequent network round-trips.
Consider documenting this trade-off or adding a comment explaining why per-item flushing is necessary for correctness.
58-63: Consider more specific error handling for CSV decoding failures.If the CSV file isn't UTF-8 encoded, the
decode("utf-8")call will raiseUnicodeDecodeError, which gets caught by the generic exception handler at line 185. The resulting error message may not clearly indicate the encoding issue to users.try: - csv_text = csv_content.decode("utf-8") + try: + csv_text = csv_content.decode("utf-8") + except UnicodeDecodeError: + return False, None, "CSV file must be UTF-8 encoded" + csv_reader = csv.DictReader(io.StringIO(csv_text))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
backend/app/crud/evaluations/core.py(7 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use type hints in Python code (Python 3.11+ project)
Files:
backend/app/crud/evaluations/core.py
backend/app/crud/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Implement database access operations in backend/app/crud/
Files:
backend/app/crud/evaluations/core.py
🧬 Code graph analysis (1)
backend/app/crud/evaluations/core.py (1)
backend/app/core/langfuse/langfuse.py (1)
flush(108-109)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: checks (3.11.7, 6)
🔇 Additional comments (6)
backend/app/crud/evaluations/core.py (6)
16-23: Function signature is well-typed and appropriately synchronous.The conversion from async to sync aligns with the PR objective of preventing race conditions. Type hints are comprehensive.
38-46: Duplication factor validation looks good.The range validation (1–100) is correctly implemented with clear error messages.
48-56: CSV file size validation is well-implemented.The 1MB limit with a clear error message showing the actual file size provides good user feedback.
65-102: Header normalization and duplicate detection are well-implemented.The approach correctly identifies all original headers that normalize to the same lowercase key and provides a descriptive error message. The required header validation (question/answer) uses case-insensitive matching as intended.
227-233: Database error handling with rollback is correctly implemented.The try/except pattern properly rolls back on failure, logs the error with stack trace, and re-raises to allow caller handling.
364-370: Consistent error handling pattern for database updates.The rollback-on-failure pattern matches
create_evaluation_run, maintaining consistency across CRUD operations.
backend/app/crud/evaluations/core.py
Outdated
| csv_reader = csv.DictReader(io.StringIO(csv_text)) | ||
|
|
||
| if not csv_reader.fieldnames: | ||
| return False, None, "CSV file has no headers" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should not this error be raised instead of returning?
backend/app/crud/evaluations/core.py
Outdated
| return False, None, "CSV file has no headers" | ||
|
|
||
| normalized_to_originals = {} | ||
| for field in csv_reader.fieldnames: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't a tuple instead of hashmap do here if we want to remove duplicate headers?
backend/app/crud/evaluations/core.py
Outdated
| if normalized not in normalized_to_originals: | ||
| normalized_to_originals[normalized] = [] | ||
| normalized_to_originals[normalized].append(field) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from LOC 62 to 103 can be extracted to a csv sanitizing and checking util
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
backend/app/api/routes/evaluation.py (2)
222-224: Preserve explicitHTTPExceptionvalidation errors.As noted in a previous review, the broad
except ExceptioncatchesHTTPExceptionraised by header validation (lines 182, 191, 210) and re-wraps it, mangling the message. LetHTTPExceptionpropagate unchanged:+ except HTTPException: + # Let explicit validation errors propagate as-is + raise except Exception as e: logger.error(f"[upload_dataset] Failed to parse CSV | {e}", exc_info=True) - raise HTTPException(status_code=422, detail=f"Invalid CSV file: {e}") + raise HTTPException(status_code=422, detail="Invalid CSV file")
203-207: Row values may beNone-.strip()can fail.As noted in a previous review,
csv.DictReadercan returnNonefor column values (e.g., when a row has fewer columns than headers). The current code will raiseAttributeError: 'NoneType' object has no attribute 'strip'in that case.Apply this fix:
for row in csv_reader: - question = row.get(question_col, "").strip() - answer = row.get(answer_col, "").strip() + raw_question = row.get(question_col) + raw_answer = row.get(answer_col) + question = (raw_question or "").strip() + answer = (raw_answer or "").strip() if question and answer: original_items.append({"question": question, "answer": answer})
🧹 Nitpick comments (1)
backend/app/crud/evaluations/langfuse.py (1)
252-271: Consider defensive key access for item dictionaries.The docstring states items are "already validated," but if a malformed item dict is passed (missing
questionoranswerkey), this would raise aKeyErrorat lines 258-259 or 261. Since this is an internal function, the current approach may be acceptable, but you could add defensive access:- input={"question": item["question"]}, - expected_output={"answer": item["answer"]}, + input={"question": item.get("question", "")}, + expected_output={"answer": item.get("answer", "")},Alternatively, ensure the caller always validates items before calling this function.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
backend/app/api/routes/evaluation.py(6 hunks)backend/app/crud/evaluations/__init__.py(2 hunks)backend/app/crud/evaluations/langfuse.py(2 hunks)backend/app/tests/crud/evaluations/test_langfuse.py(4 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use type hints in Python code (Python 3.11+ project)
Files:
backend/app/crud/evaluations/langfuse.pybackend/app/tests/crud/evaluations/test_langfuse.pybackend/app/crud/evaluations/__init__.pybackend/app/api/routes/evaluation.py
backend/app/crud/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Implement database access operations in backend/app/crud/
Files:
backend/app/crud/evaluations/langfuse.pybackend/app/crud/evaluations/__init__.py
backend/app/api/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Expose FastAPI REST endpoints under backend/app/api/ organized by domain
Files:
backend/app/api/routes/evaluation.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: ProjectTech4DevAI/ai-platform PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-10-08T12:05:01.317Z
Learning: Applies to backend/app/core/langfuse/**/*.py : Place Langfuse observability integration under backend/app/core/langfuse/
📚 Learning: 2025-10-08T12:05:01.317Z
Learnt from: CR
Repo: ProjectTech4DevAI/ai-platform PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-10-08T12:05:01.317Z
Learning: Applies to backend/app/core/langfuse/**/*.py : Place Langfuse observability integration under backend/app/core/langfuse/
Applied to files:
backend/app/crud/evaluations/langfuse.pybackend/app/tests/crud/evaluations/test_langfuse.pybackend/app/crud/evaluations/__init__.py
🧬 Code graph analysis (4)
backend/app/crud/evaluations/langfuse.py (1)
backend/app/core/langfuse/langfuse.py (1)
flush(108-109)
backend/app/tests/crud/evaluations/test_langfuse.py (2)
backend/app/crud/evaluations/langfuse.py (1)
upload_dataset_to_langfuse(220-295)backend/app/core/langfuse/langfuse.py (1)
flush(108-109)
backend/app/crud/evaluations/__init__.py (1)
backend/app/crud/evaluations/langfuse.py (1)
upload_dataset_to_langfuse(220-295)
backend/app/api/routes/evaluation.py (2)
backend/app/crud/evaluations/langfuse.py (1)
upload_dataset_to_langfuse(220-295)backend/app/models/evaluation.py (1)
DatasetUploadResponse(25-44)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: checks (3.11.7, 6)
🔇 Additional comments (11)
backend/app/crud/evaluations/langfuse.py (2)
220-225: LGTM! Clean refactor to item-based workflow.The function signature change from
csv_content: bytestoitems: list[dict[str, str]]is a good separation of concerns, moving CSV parsing responsibility to the API layer.
273-278: Per-item flush addresses race conditions but may impact performance.The per-item flush strategy is a reasonable mitigation for Langfuse SDK batching issues. Note that for large datasets, this synchronous flushing pattern may increase upload latency. The final flush on line 278 is technically redundant when items exist but serves as a safety net for edge cases.
backend/app/crud/evaluations/__init__.py (1)
25-29: LGTM! Public API export correctly updated.The import and
__all__export are properly aligned with the renamed function inlangfuse.py.backend/app/tests/crud/evaluations/test_langfuse.py (3)
386-393: LGTM! Test fixture correctly updated for item-based workflow.The
valid_itemsfixture properly provides the expected data structure withquestionandanswerkeys.
418-419: LGTM! Flush count assertions correctly match implementation.The test correctly expects
3 items + 1 final = 4flush calls, accurately reflecting the per-item flush pattern in the implementation.
489-511: LGTM! Error handling test correctly validates partial success scenario.The test properly validates that item creation errors are logged but don't stop processing, and the returned
total_itemsreflects only successful uploads.backend/app/api/routes/evaluation.py (5)
44-54: LGTM! Good DRY improvement with response helper.The
_dataset_to_responsehelper eliminates duplication across the list and get endpoints. The type hint is present as required by coding guidelines.
184-199: LGTM! Clean implementation of case-insensitive header matching.The approach of building a lowercase-to-original mapping (
clean_headers) and then using the original column names for row access is correct and maintains compatibility withcsv.DictReader.
118-123: Verify duplication factor range: code says 1-5, PR says 1-100.The PR description states the duplication factor range should be "1–100", but the implementation constrains it to
ge=1, le=5(max 5). Please verify the intended range and update either the code or the PR description.If the intended max is 100:
duplication_factor: int = Form( default=5, ge=1, - le=5, - description="Number of times to duplicate each item (min: 1, max: 5)", + le=100, + description="Number of times to duplicate each item (min: 1, max: 100)", ),
261-266: LGTM! Correct usage of refactored upload function.The call to
upload_dataset_to_langfusecorrectly passes the pre-parsedoriginal_itemslist, aligning with the new item-based signature.
340-340: LGTM! Consistent response formatting across endpoints.Both the list and get endpoints now use
_dataset_to_response, ensuring consistent response structure.Also applies to: 371-371
Prajna1999
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM functionality wise. May be we can take up nitpicks/refactors later.
Summary
Target issue is #449
Checklist
Before submitting a pull request, please ensure that you mark these task.
fastapi run --reload app/main.pyordocker compose upin the repository root and test.Notes
Summary by CodeRabbit
New Features
Improvements
Tests
✏️ Tip: You can customize this high-level summary in your review settings.