-
Notifications
You must be signed in to change notification settings - Fork 7
fix: add threadpool based concurrency to speeden up langfuse dataset upload functionality #461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughAdds a new Celery architecture and operations guide and refactors Langfuse dataset uploading to use concurrent uploads via ThreadPoolExecutor with aggregated success counting and a single final flush. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used📓 Path-based instructions (2)**/*.py📄 CodeRabbit inference engine (CLAUDE.md)
Files:
backend/app/crud/**/*.py📄 CodeRabbit inference engine (CLAUDE.md)
Files:
🧬 Code graph analysis (1)backend/app/crud/evaluations/langfuse.py (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (7)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (4)
backend/app/crud/evaluations/langfuse.py (3)
251-270: Add return type annotation to inner function.Per coding guidelines, type hints should be used throughout. The
upload_itemhelper is missing its return type annotation.- def upload_item(item: dict[str, str], duplicate_num: int) -> bool: + def upload_item(item: dict[str, str], duplicate_num: int) -> bool:The current signature already has the return type — disregard if Black reformats it correctly. Otherwise, ensure the
-> boolannotation is preserved.
285-285: Consider makingmax_workersconfigurable.The hardcoded
max_workers=4may not be optimal for all environments. Consider exposing this as a configuration parameter or deriving it from settings (similar toCELERY_WORKER_CONCURRENCY).+from app.core.config import settings + +# In function or at module level +MAX_UPLOAD_WORKERS = getattr(settings, "LANGFUSE_UPLOAD_WORKERS", 4) + # Then use: - with ThreadPoolExecutor(max_workers=4) as executor: + with ThreadPoolExecutor(max_workers=MAX_UPLOAD_WORKERS) as executor:
287-290: Simplify by usingexecutor.mapor list comprehension withsubmit.The current pattern of appending futures to a list in a loop can be simplified.
- futures = [] - for item, dup_num in upload_tasks: - future = executor.submit(upload_item, item, dup_num) - futures.append(future) + futures = [executor.submit(upload_item, item, dup_num) for item, dup_num in upload_tasks]backend/CELERY_OVERVIEW.md (1)
15-25: Add language specifier to fenced code block.The file structure code block is missing a language identifier, which triggers markdown lint warning MD040. Use
textorplaintextfor directory listings.-``` +```text app/celery/ ├── __init__.py # Package initialization, exports celery_app
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
backend/CELERY_OVERVIEW.md(1 hunks)backend/app/crud/evaluations/langfuse.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use type hints in Python code (Python 3.11+ project)
Files:
backend/app/crud/evaluations/langfuse.py
backend/app/crud/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Implement database access operations in backend/app/crud/
Files:
backend/app/crud/evaluations/langfuse.py
🧠 Learnings (2)
📚 Learning: 2025-10-08T12:05:01.317Z
Learnt from: CR
Repo: ProjectTech4DevAI/ai-platform PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-10-08T12:05:01.317Z
Learning: Applies to backend/app/celery/**/*.py : Keep Celery app configuration (priority queues, beat scheduler, workers) under backend/app/celery/
Applied to files:
backend/CELERY_OVERVIEW.md
📚 Learning: 2025-10-08T12:05:01.317Z
Learnt from: CR
Repo: ProjectTech4DevAI/ai-platform PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-10-08T12:05:01.317Z
Learning: Applies to backend/app/celery/tasks/**/*.py : Define Celery tasks under backend/app/celery/tasks/
Applied to files:
backend/CELERY_OVERVIEW.md
🧬 Code graph analysis (1)
backend/app/crud/evaluations/langfuse.py (1)
backend/app/core/langfuse/langfuse.py (1)
flush(108-109)
🪛 GitHub Actions: AI Platform CI
backend/app/crud/evaluations/langfuse.py
[error] 1-1: Trailing whitespace detected and removed by pre-commit hook 'trailing-whitespace'. Re-run pre-commit to finalize changes.
[error] 1-1: Black formatter reformatted the file. Re-run pre-commit to commit the changes.
🪛 LanguageTool
backend/CELERY_OVERVIEW.md
[uncategorized] ~171-~171: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...Identical to high priority but uses the low priority queue get_task_status() (lines 7...
(EN_COMPOUND_ADJECTIVE_INTERNAL)
[grammar] ~555-~555: Use a hyphen to join words.
Context: ...vice** (app/services/llm/) - Uses high priority queue for real-time API calls -...
(QB_NEW_EN_HYPHEN)
🪛 markdownlint-cli2 (0.18.1)
backend/CELERY_OVERVIEW.md
15-15: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🔇 Additional comments (2)
backend/CELERY_OVERVIEW.md (1)
1-588: Documentation looks comprehensive and well-structured.This is a valuable addition that clearly explains the Celery architecture, queue priorities, task routing, and best practices. The end-to-end flow example and integration patterns are particularly helpful for onboarding developers. Based on learnings, the file structure aligns with project conventions for Celery configuration under
backend/app/celery/.backend/app/crud/evaluations/langfuse.py (1)
283-295: Langfuse client thread-safety is properly handled.The Langfuse Python SDK is thread-safe for concurrent
create_dataset_itemcalls when using a single client instance, which your code does. The internal queue and background worker handle batching, andlangfuse.flush()at line 298 correctly ensures all items are delivered before the function returns.
| for future in as_completed(futures): | ||
| upload_successful = future.result() | ||
| if upload_successful: | ||
| total_uploaded += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle exceptions from future.result() to avoid silent failures.
If upload_item raises an unexpected exception (not caught internally), future.result() will propagate it here and abort the loop. Consider wrapping in a try-except to ensure all futures are processed and failures are logged.
for future in as_completed(futures):
- upload_successful = future.result()
- if upload_successful:
- total_uploaded += 1
+ try:
+ if future.result():
+ total_uploaded += 1
+ except Exception as e:
+ logger.error(f"[upload_dataset_to_langfuse] Unexpected error in upload task | {e}")🤖 Prompt for AI Agents
In backend/app/crud/evaluations/langfuse.py around lines 292 to 295, the loop
calling future.result() does not catch exceptions from the completed futures
which can abort processing; wrap the future.result() call in a try-except block,
log the exception (with context such as which item or index failed), and treat
that future as a failed upload (do not increment total_uploaded). Ensure the
except block continues the loop so all futures are processed and consider
returning/recording failure details for higher-level handling.
Summary
Target issue is #460
Adding threadpool based concurrency model to allow max 4 worker threads to take up the insert (input, output) pair instead of one blocking process doing all the heavy lifting. This enhances the endpoints ability to insert CSV files upto 1000 line items within 60 seconds.
Chore: Also added a celery doc to explain how CELERY implemented in the repo. Largely unrelated with the core task.
Checklist
Before submitting a pull request, please ensure that you mark these task.
fastapi run --reload app/main.pyordocker compose upin the repository root and test.Summary by CodeRabbit
Documentation
Improvements
✏️ Tip: You can customize this high-level summary in your review settings.