Evaluation: Upload Dataset Improvements #450

AkhileshNegi · 2025-11-24T15:36:23Z

Summary

Target issue is #449

Checklist

Before submitting a pull request, please ensure that you mark these task.

Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
If you've fixed a bug or added code that is tested and has test cases.

Notes

Upload flow now accepts pre-parsed item lists (no raw CSV parsing), with case-insensitive header handling, relaxed validation, and clearer errors.
Per-item processing with intermediate and final flushes improves upload reliability and reduces race conditions.
Improved persistence error handling with safer rollback on failures.

Summary by CodeRabbit

New Features
- CSV dataset uploads now support case-insensitive column headers for greater flexibility.
Improvements
- Enhanced error handling during dataset uploads with improved rollback mechanisms.
Tests
- Updated test coverage for flexible CSV header matching and upload scenarios.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-11-24T15:36:30Z

Walkthrough

Removes the async CSV upload function from backend/app/crud/evaluations/core.py, renames and reworks the Langfuse upload to accept pre-parsed item dicts with per-original-item flushes, wraps DB commit/refresh calls in try/except with rollback, normalizes CSV headers case-insensitively in the API and maps columns, updates public exports, and adjusts tests to use the renamed upload function and items input shape.

Changes

Cohort / File(s)	Summary
Core CRUD changes `backend/app/crud/evaluations/core.py`	Deleted the public async CSV upload function. Updated CRUD functions (`create_evaluation_run`, `list_evaluation_runs`, `get_evaluation_run_by_id`, `update_evaluation_run`) to wrap `commit`/`refresh` in try/except with rollback and improved error handling; removed imports/types tied to the deleted upload function.
Langfuse upload behavior `backend/app/crud/evaluations/langfuse.py`	Renamed `upload_dataset_to_langfuse_from_csv` → `upload_dataset_to_langfuse`; signature now accepts `items: list[dict[str,str]]` instead of CSV bytes. Removed CSV parsing/validation and iterates provided items, duplicating per `duplication_factor`, performing a per-original-item `flush` inside the loop and a final flush after all items; adjusted logs and errors accordingly.
API CSV handling & responses `backend/app/api/routes/evaluation.py`	Import updated to new upload function. Added `_dataset_to_response(dataset) -> DatasetUploadResponse`. CSV header handling broadened: validates presence of headers, normalizes headers case-insensitively, derives actual `question_col`/`answer_col`, extracts rows using those names, tracks `original_items_count` and `total_items_count`, and passes pre-parsed `items` to Langfuse upload. List/get endpoints now use `_dataset_to_response`.
Public exports `backend/app/crud/evaluations/__init__.py`	Updated exported symbol name from `upload_dataset_to_langfuse_from_csv` to `upload_dataset_to_langfuse` in `__all__`.
Tests updated for items API & flush behavior `backend/app/tests/crud/evaluations/test_langfuse.py`, `backend/app/tests/api/routes/test_evaluation.py`	Renamed tests, fixtures, and patches to align with `upload_dataset_to_langfuse` and `items` input; replaced CSV inputs with pre-parsed item dicts; adjusted assertions for per-original-item flush plus final flush and updated mocks/patch targets across API tests.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant API as Evaluation API
    participant Lang as Langfuse Adapter/SDK
    participant DB as Database

    Client->>API: POST upload (CSV file)
    API->>API: parse CSV -> normalize headers (case-insensitive), map question/answer cols, build items list
    API->>Lang: upload_dataset_to_langfuse(items, name, duplication_factor)
    loop for each original item
        Lang->>Lang: create duplicated dataset items
        rect rgba(200,230,200,0.18)
            Lang->>Lang: per-original-item flush
        end
    end
    Lang->>Lang: final flush after all items
    API->>DB: commit & refresh evaluation run
    alt commit succeeds
        DB-->>API: ok
    else commit fails
        DB--x API: error
        API->>DB: rollback
    end
    API-->>Client: response with dataset/run details

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Focus areas:
- Verify all call sites and imports updated to the renamed function.
- Inspect per-original-item flush placement and final flush for correctness and performance implications.
- Validate CSV header normalization and column mapping edge cases.
- Confirm DB commit/rollback paths and error logging.
- Ensure tests correctly mock/patch the new symbol and reflect flush counting.

Suggested labels

ready-for-review

Suggested reviewers

avirajsingh7
Prajna1999
kartpop

Poem

🐰 I hopped through rows and mapped each name,
I flushed each batch to tame the racey game.
Headers folded, case made light,
I saved the run and rolled back tight.
A carrot toast to items done just right.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Evaluation: Upload Dataset Improvements' clearly relates to the main changes in the PR, which focus on improving the dataset upload flow for evaluations through refactoring CSV parsing, adding case-insensitive headers, and improving error handling.
Docstring Coverage	✅ Passed	Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch enhancement/evaluation-upload-dataset

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 69ed744 and edb3a5f.

📒 Files selected for processing (1)

backend/app/tests/api/routes/test_evaluation.py (6 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use type hints in Python code (Python 3.11+ project)

Files:

backend/app/tests/api/routes/test_evaluation.py

🧠 Learnings (1)

📓 Common learnings

Learnt from: CR
Repo: ProjectTech4DevAI/ai-platform PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-10-08T12:05:01.317Z
Learning: Applies to backend/app/core/langfuse/**/*.py : Place Langfuse observability integration under backend/app/core/langfuse/

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.11.7, 6)

🔇 Additional comments (1)

backend/app/tests/api/routes/test_evaluation.py (1)

63-65: Patch targets correctly updated to renamed upload_dataset_to_langfuse function

All affected tests now patch app.api.routes.evaluation.upload_dataset_to_langfuse, keeping the mock at the route layer and aligned with the new public API name. Expectations (return_value, assert_called_once) remain valid and don’t need further adjustment here.

Also applies to: 147-149, 191-193, 230-232, 269-271, 359-361

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2025-11-24T15:41:48Z

Codecov Report

❌ Patch coverage is 67.24138% with 19 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
backend/app/crud/evaluations/core.py	6.25%	15 Missing ⚠️
backend/app/api/routes/evaluation.py	69.23%	4 Missing ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

backend/app/crud/evaluations/langfuse.py (1)

318-323: Per-item flush strategy looks good.

The per-item flush after processing each original item's duplicates, combined with a final flush, addresses potential batching issues in the Langfuse SDK. This pattern ensures proper separation of Q&A pairs.

Note: If datasets grow significantly beyond the current 1MB limit, consider batching flushes (e.g., every N items) to balance reliability and performance.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7e4f7d3 and 25b7e68.

📒 Files selected for processing (3)

backend/app/crud/evaluations/core.py (7 hunks)
backend/app/crud/evaluations/langfuse.py (1 hunks)
backend/app/tests/crud/evaluations/test_langfuse.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use type hints in Python code (Python 3.11+ project)

Files:

backend/app/crud/evaluations/langfuse.py
backend/app/tests/crud/evaluations/test_langfuse.py
backend/app/crud/evaluations/core.py

backend/app/crud/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement database access operations in backend/app/crud/

Files:

backend/app/crud/evaluations/langfuse.py
backend/app/crud/evaluations/core.py

🧠 Learnings (1)

📚 Learning: 2025-10-08T12:05:01.317Z

Learnt from: CR
Repo: ProjectTech4DevAI/ai-platform PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-10-08T12:05:01.317Z
Learning: Applies to backend/app/core/langfuse/**/*.py : Place Langfuse observability integration under backend/app/core/langfuse/

Applied to files:

backend/app/crud/evaluations/langfuse.py

🧬 Code graph analysis (3)

backend/app/crud/evaluations/langfuse.py (1)

backend/app/core/langfuse/langfuse.py (1)

flush (108-109)

backend/app/tests/crud/evaluations/test_langfuse.py (1)

backend/app/core/langfuse/langfuse.py (1)

flush (108-109)

backend/app/crud/evaluations/core.py (2)

backend/app/core/langfuse/langfuse.py (1)

flush (108-109)

backend/app/tests/crud/collections/collection/test_crud_collection_read_all.py (1)

refresh (32-34)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.11.7, 6)

🔇 Additional comments (9)

backend/app/tests/crud/evaluations/test_langfuse.py (3)

419-420: Test assertions correctly updated for per-item + final flush pattern.

The flush count assertion (4 = 3 items + 1 final) aligns with the implementation changes in backend/app/crud/evaluations/langfuse.py.

492-493: LGTM: Flush count reflects valid items only.

Correctly expects 3 flushes (2 valid items + 1 final), as invalid rows are skipped before the per-item flush.

545-546: LGTM: Flush count consistent with duplication_factor=1 scenario.

The assertion (4 = 3 items + 1 final) correctly verifies the flush pattern when duplication_factor is 1.

backend/app/crud/evaluations/core.py (6)

38-46: Duplication factor validation is well-implemented.

The bounds check (1-100) with descriptive error messages provides good guardrails for users.

48-56: CSV size limit appropriately enforced.

The 1MB limit with clear error messaging helps prevent resource exhaustion. The size calculation and formatting are correct.

139-144: Per-item flush + final flush pattern is consistent.

The implementation matches the pattern in backend/app/crud/evaluations/langfuse.py and addresses batching concerns mentioned in the comments.

206-212: Robust error handling with rollback added.

The try/except wrapper with explicit rollback ensures database consistency when commit fails. This is a best practice for transactional operations.

343-349: LGTM: Database error handling with rollback.

Consistent with the pattern in create_evaluation_run. The error handling properly maintains database integrity.

16-16: All callers have been correctly updated for the sync conversion.

Verification confirms that upload_dataset_to_langfuse and its wrapper upload_dataset_to_langfuse_from_csv are both properly defined as sync functions (using def not async def). The call site in evaluation.py:240 correctly omits the await keyword, and no remaining await calls to either function were found in the codebase. The conversion has been completed correctly.

backend/app/crud/evaluations/core.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

backend/app/crud/evaluations/core.py (1)

67-98: Duplicate header detection correctly implemented.

The single-pass normalization with duplicate detection addresses the previous review concern. The implementation correctly:

Maps normalized names to lists of originals to detect conflicts

Returns a descriptive error showing which headers conflict

Uses the first occurrence (originals[0]) for the clean mapping

This prevents silent data loss from case-variant duplicates like "Question" and "question".

🧹 Nitpick comments (1)

backend/app/crud/evaluations/core.py (1)
48-56: Consider reusing the constant for consistency.

The size limit logic is correct. Minor suggestion: reuse max_size_bytes in the calculation to avoid duplicating the magic number.
     # Validate CSV file size (max 1MB)
     max_size_bytes = 1_048_576  # 1MB
     if len(csv_content) > max_size_bytes:
-        size_mb = len(csv_content) / 1_048_576
+        size_mb = len(csv_content) / max_size_bytes
         return (
             False,
             None,
             f"CSV file too large ({size_mb:.2f}MB). Maximum allowed is 1MB",
         )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 25b7e68 and 30792a4.

📒 Files selected for processing (1)

backend/app/crud/evaluations/core.py (7 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use type hints in Python code (Python 3.11+ project)

Files:

backend/app/crud/evaluations/core.py

backend/app/crud/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement database access operations in backend/app/crud/

Files:

backend/app/crud/evaluations/core.py

🧬 Code graph analysis (1)

backend/app/crud/evaluations/core.py (2)

backend/app/models/collection.py (1)

norm (92-98)

backend/app/core/langfuse/langfuse.py (1)

flush (108-109)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.11.7, 6)

🔇 Additional comments (6)

backend/app/crud/evaluations/core.py (6)

38-46: LGTM on duplication_factor validation.

The validation correctly enforces the 1–100 range per PR objectives. Minor nit: the docstring on line 31 mentions "default 5" but there's no default value in the signature—consider updating the docstring to say "Number of times to duplicate each item (1-100)".

101-134: LGTM on case-insensitive header matching.

The approach correctly validates using normalized names while accessing row data with original field names—DictReader keys match the original headers exactly, so row.get(golden_question) works correctly.

148-174: Per-item flush trade-off is acceptable given the constraints.

The flush-per-original-item approach correctly prevents batching race conditions as noted in the PR objectives. Given the 1MB file limit and max 100× duplication, the performance overhead is reasonable.

Note that partial uploads are possible if individual create_dataset_item calls fail—the function logs errors and continues, returning total_uploaded which may be less than expected. This is a reasonable graceful-degradation strategy, though callers should be aware.

236-242: LGTM on database rollback handling.

The try/except pattern with rollback, logging (including stack trace), and re-raise correctly ensures transactional integrity while maintaining visibility into failures.

373-379: Consistent rollback pattern applied.

Matches the create_evaluation_run pattern—maintains consistency across CRUD operations.

189-189: Verify id attribute presence on Langfuse Dataset objects.

The web search indicates create_dataset() returns a langfuse.api.Dataset object with attributes including name, description, and metadata, but does not definitively confirm whether id is present. The defensive hasattr check appears reasonable given incomplete SDK documentation. If the Langfuse SDK always includes an id attribute on successfully created datasets, direct access would be cleaner; however, this requires manual verification against the actual SDK implementation or official API documentation to confirm the guaranteed contract.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

backend/app/crud/evaluations/core.py (2)
140-165: Per-item flushing addresses race conditions but has performance implications.

The per-item flush after each original item's duplicates (line 162) plus the final flush (line 165) aligns with the PR objective of preventing race conditions in Langfuse SDK's batching. However, for CSVs with many rows, this could significantly increase upload time due to frequent network round-trips.

Consider documenting this trade-off or adding a comment explaining why per-item flushing is necessary for correctness.

58-63: Consider more specific error handling for CSV decoding failures.

If the CSV file isn't UTF-8 encoded, the decode("utf-8") call will raise UnicodeDecodeError, which gets caught by the generic exception handler at line 185. The resulting error message may not clearly indicate the encoding issue to users.
     try:
-        csv_text = csv_content.decode("utf-8")
+        try:
+            csv_text = csv_content.decode("utf-8")
+        except UnicodeDecodeError:
+            return False, None, "CSV file must be UTF-8 encoded"
+
         csv_reader = csv.DictReader(io.StringIO(csv_text))

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 30792a4 and 5f6f8bf.

📒 Files selected for processing (1)

backend/app/crud/evaluations/core.py (7 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use type hints in Python code (Python 3.11+ project)

Files:

backend/app/crud/evaluations/core.py

backend/app/crud/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement database access operations in backend/app/crud/

Files:

backend/app/crud/evaluations/core.py

🧬 Code graph analysis (1)

backend/app/crud/evaluations/core.py (1)

backend/app/core/langfuse/langfuse.py (1)

flush (108-109)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.11.7, 6)

🔇 Additional comments (6)

backend/app/crud/evaluations/core.py (6)

16-23: Function signature is well-typed and appropriately synchronous.

The conversion from async to sync aligns with the PR objective of preventing race conditions. Type hints are comprehensive.

38-46: Duplication factor validation looks good.

The range validation (1–100) is correctly implemented with clear error messages.

48-56: CSV file size validation is well-implemented.

The 1MB limit with a clear error message showing the actual file size provides good user feedback.

65-102: Header normalization and duplicate detection are well-implemented.

The approach correctly identifies all original headers that normalize to the same lowercase key and provides a descriptive error message. The required header validation (question/answer) uses case-insensitive matching as intended.

227-233: Database error handling with rollback is correctly implemented.

The try/except pattern properly rolls back on failure, logs the error with stack trace, and re-raises to allow caller handling.

364-370: Consistent error handling pattern for database updates.

The rollback-on-failure pattern matches create_evaluation_run, maintaining consistency across CRUD operations.

backend/app/crud/evaluations/core.py

Prajna1999 · 2025-11-25T09:46:34Z

backend/app/crud/evaluations/core.py

+        csv_reader = csv.DictReader(io.StringIO(csv_text))
+
+        if not csv_reader.fieldnames:
+            return False, None, "CSV file has no headers"


Should not this error be raised instead of returning?

Prajna1999 · 2025-11-25T09:48:51Z

backend/app/crud/evaluations/core.py

+            return False, None, "CSV file has no headers"
+
+        normalized_to_originals = {}
+        for field in csv_reader.fieldnames:


Won't a tuple instead of hashmap do here if we want to remove duplicate headers?

Prajna1999 · 2025-11-25T09:58:27Z

backend/app/crud/evaluations/core.py

+            if normalized not in normalized_to_originals:
+                normalized_to_originals[normalized] = []
+            normalized_to_originals[normalized].append(field)
+


from LOC 62 to 103 can be extracted to a csv sanitizing and checking util

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

backend/app/api/routes/evaluation.py (2)
222-224: Preserve explicit HTTPException validation errors.

As noted in a previous review, the broad except Exception catches HTTPException raised by header validation (lines 182, 191, 210) and re-wraps it, mangling the message. Let HTTPException propagate unchanged:
+    except HTTPException:
+        # Let explicit validation errors propagate as-is
+        raise
     except Exception as e:
         logger.error(f"[upload_dataset] Failed to parse CSV | {e}", exc_info=True)
-        raise HTTPException(status_code=422, detail=f"Invalid CSV file: {e}")
+        raise HTTPException(status_code=422, detail="Invalid CSV file")
203-207: Row values may be None - .strip() can fail.

As noted in a previous review, csv.DictReader can return None for column values (e.g., when a row has fewer columns than headers). The current code will raise AttributeError: 'NoneType' object has no attribute 'strip' in that case.

Apply this fix:
         for row in csv_reader:
-            question = row.get(question_col, "").strip()
-            answer = row.get(answer_col, "").strip()
+            raw_question = row.get(question_col)
+            raw_answer = row.get(answer_col)
+            question = (raw_question or "").strip()
+            answer = (raw_answer or "").strip()
             if question and answer:
                 original_items.append({"question": question, "answer": answer})

🧹 Nitpick comments (1)

backend/app/crud/evaluations/langfuse.py (1)
252-271: Consider defensive key access for item dictionaries.

The docstring states items are "already validated," but if a malformed item dict is passed (missing question or answer key), this would raise a KeyError at lines 258-259 or 261. Since this is an internal function, the current approach may be acceptable, but you could add defensive access:
-                        input={"question": item["question"]},
-                        expected_output={"answer": item["answer"]},
+                        input={"question": item.get("question", "")},
+                        expected_output={"answer": item.get("answer", "")},
Alternatively, ensure the caller always validates items before calling this function.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bcd37e6 and 69ed744.

📒 Files selected for processing (4)

backend/app/api/routes/evaluation.py (6 hunks)
backend/app/crud/evaluations/__init__.py (2 hunks)
backend/app/crud/evaluations/langfuse.py (2 hunks)
backend/app/tests/crud/evaluations/test_langfuse.py (4 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use type hints in Python code (Python 3.11+ project)

Files:

backend/app/crud/evaluations/langfuse.py
backend/app/tests/crud/evaluations/test_langfuse.py
backend/app/crud/evaluations/__init__.py
backend/app/api/routes/evaluation.py

backend/app/crud/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement database access operations in backend/app/crud/

Files:

backend/app/crud/evaluations/langfuse.py
backend/app/crud/evaluations/__init__.py

backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Expose FastAPI REST endpoints under backend/app/api/ organized by domain

Files:

backend/app/api/routes/evaluation.py

🧠 Learnings (2)

📓 Common learnings

Learnt from: CR
Repo: ProjectTech4DevAI/ai-platform PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-10-08T12:05:01.317Z
Learning: Applies to backend/app/core/langfuse/**/*.py : Place Langfuse observability integration under backend/app/core/langfuse/

📚 Learning: 2025-10-08T12:05:01.317Z

Learnt from: CR
Repo: ProjectTech4DevAI/ai-platform PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-10-08T12:05:01.317Z
Learning: Applies to backend/app/core/langfuse/**/*.py : Place Langfuse observability integration under backend/app/core/langfuse/

Applied to files:

backend/app/crud/evaluations/langfuse.py
backend/app/tests/crud/evaluations/test_langfuse.py
backend/app/crud/evaluations/__init__.py

🧬 Code graph analysis (4)

backend/app/crud/evaluations/langfuse.py (1)

backend/app/core/langfuse/langfuse.py (1)

flush (108-109)

backend/app/tests/crud/evaluations/test_langfuse.py (2)

backend/app/crud/evaluations/langfuse.py (1)

upload_dataset_to_langfuse (220-295)

backend/app/core/langfuse/langfuse.py (1)

flush (108-109)

backend/app/crud/evaluations/__init__.py (1)

backend/app/crud/evaluations/langfuse.py (1)

upload_dataset_to_langfuse (220-295)

backend/app/api/routes/evaluation.py (2)

backend/app/crud/evaluations/langfuse.py (1)

upload_dataset_to_langfuse (220-295)

backend/app/models/evaluation.py (1)

DatasetUploadResponse (25-44)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.11.7, 6)

🔇 Additional comments (11)

backend/app/crud/evaluations/langfuse.py (2)

220-225: LGTM! Clean refactor to item-based workflow.

The function signature change from csv_content: bytes to items: list[dict[str, str]] is a good separation of concerns, moving CSV parsing responsibility to the API layer.

273-278: Per-item flush addresses race conditions but may impact performance.

The per-item flush strategy is a reasonable mitigation for Langfuse SDK batching issues. Note that for large datasets, this synchronous flushing pattern may increase upload latency. The final flush on line 278 is technically redundant when items exist but serves as a safety net for edge cases.

backend/app/crud/evaluations/__init__.py (1)

25-29: LGTM! Public API export correctly updated.

The import and __all__ export are properly aligned with the renamed function in langfuse.py.

backend/app/tests/crud/evaluations/test_langfuse.py (3)

386-393: LGTM! Test fixture correctly updated for item-based workflow.

The valid_items fixture properly provides the expected data structure with question and answer keys.

418-419: LGTM! Flush count assertions correctly match implementation.

The test correctly expects 3 items + 1 final = 4 flush calls, accurately reflecting the per-item flush pattern in the implementation.

489-511: LGTM! Error handling test correctly validates partial success scenario.

The test properly validates that item creation errors are logged but don't stop processing, and the returned total_items reflects only successful uploads.
backend/app/api/routes/evaluation.py (5)
44-54: LGTM! Good DRY improvement with response helper.

The _dataset_to_response helper eliminates duplication across the list and get endpoints. The type hint is present as required by coding guidelines.

184-199: LGTM! Clean implementation of case-insensitive header matching.

The approach of building a lowercase-to-original mapping (clean_headers) and then using the original column names for row access is correct and maintains compatibility with csv.DictReader.

118-123: Verify duplication factor range: code says 1-5, PR says 1-100.

The PR description states the duplication factor range should be "1–100", but the implementation constrains it to ge=1, le=5 (max 5). Please verify the intended range and update either the code or the PR description.

If the intended max is 100:
     duplication_factor: int = Form(
         default=5,
         ge=1,
-        le=5,
-        description="Number of times to duplicate each item (min: 1, max: 5)",
+        le=100,
+        description="Number of times to duplicate each item (min: 1, max: 100)",
     ),
261-266: LGTM! Correct usage of refactored upload function.

The call to upload_dataset_to_langfuse correctly passes the pre-parsed original_items list, aligning with the new item-based signature.

340-340: LGTM! Consistent response formatting across endpoints.

Both the list and get endpoints now use _dataset_to_response, ensuring consistent response structure.

Also applies to: 371-371

Prajna1999

LGTM functionality wise. May be we can take up nitpicks/refactors later.

cleanup upload

25b7e68

AkhileshNegi linked an issue Nov 24, 2025 that may be closed by this pull request

Evaluation: Inconsistent Golden Q&As in Langfuse #449

Closed

AkhileshNegi marked this pull request as ready for review November 24, 2025 15:36

coderabbitai bot reviewed Nov 24, 2025

View reviewed changes

backend/app/crud/evaluations/core.py Outdated Show resolved Hide resolved

AkhileshNegi added 2 commits November 25, 2025 11:42

added check for duplicate headers

30792a4

added check for duplicate headers

5f6f8bf

coderabbitai bot reviewed Nov 25, 2025

View reviewed changes

cleanup

795df4e

coderabbitai bot reviewed Nov 25, 2025

View reviewed changes

cleanup

3d4d98d

AkhileshNegi self-assigned this Nov 25, 2025

AkhileshNegi added the enhancement New feature or request label Nov 25, 2025

AkhileshNegi requested review from Prajna1999 and avirajsingh7 November 25, 2025 06:54

AkhileshNegi changed the title ~~Evaluation: Dataset Improvements~~ Evaluation: Upload Dataset Improvements Nov 25, 2025

Prajna1999 reviewed Nov 25, 2025

View reviewed changes

cleanups

63c26bd

This comment was marked as outdated.

Sign in to view

AkhileshNegi added 2 commits November 25, 2025 23:03

cleanups

bcd37e6

refactor upload dataset

69ed744

coderabbitai bot reviewed Nov 26, 2025

View reviewed changes

updated failing testcases

edb3a5f

AkhileshNegi requested a review from Prajna1999 November 26, 2025 07:28

avirajsingh7 approved these changes Nov 27, 2025

View reviewed changes

Merge branch 'main' into enhancement/evaluation-upload-dataset

f1eee6c

Prajna1999 approved these changes Nov 28, 2025

View reviewed changes

Merge branch 'main' into enhancement/evaluation-upload-dataset

15def5a

AkhileshNegi merged commit f3b8f4d into main Nov 28, 2025
2 checks passed

AkhileshNegi deleted the enhancement/evaluation-upload-dataset branch November 28, 2025 07:03

Evaluation: Upload Dataset Improvements #450

Evaluation: Upload Dataset Improvements #450

Uh oh!

Conversation

AkhileshNegi commented Nov 24, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

codecov bot commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Prajna1999 Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Prajna1999 Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Prajna1999 Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Prajna1999 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AkhileshNegi commented Nov 24, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 24, 2025 •

edited

Loading

codecov bot commented Nov 24, 2025 •

edited

Loading