Skip to content

Add LanguageDetectProcessor for automatic language detection#34

Open
Solventerritory wants to merge 6 commits intogoogle-gemini:mainfrom
Solventerritory:Add-LanguageDetectProcessor-for-automatic-lang-detection-#14
Open

Add LanguageDetectProcessor for automatic language detection#34
Solventerritory wants to merge 6 commits intogoogle-gemini:mainfrom
Solventerritory:Add-LanguageDetectProcessor-for-automatic-lang-detection-#14

Conversation

@Solventerritory
Copy link

@Solventerritory Solventerritory commented Dec 18, 2025

Description Closes #14

This PR adds a new LanguageDetectProcessor to the contrib module that automatically detects the language of text parts and adds the detected language code to the part's metadata.

Changes

  • Added genai_processors/contrib/language_detect_processor.py with the LanguageDetectProcessor class
  • Added comprehensive test suite in tests/contrib/test_language_detect_processor.py (8 tests)
  • Updated genai_processors/contrib/__init__.py to export the new processor
  • Updated pyproject.toml to include langdetect>=1.0.9 dependency

Features

  • Detects language using the langdetect library
  • Returns ISO 639-1 language codes (e.g., "en", "fr", "zh")
  • Handles edge cases: empty text, short text, non-text parts
  • Configurable metadata key, unknown label, and minimum text length
  • Preserves existing metadata

Testing

All 8 tests passing:

  • ✅ English text detection
  • ✅ French text detection
  • ✅ Bengali text detection
  • ✅ Short text handling (returns "unknown")
  • ✅ Empty text handling (returns "unknown")
  • ✅ Non-text parts remain unchanged
  • ✅ Preserves existing metadata
  • ✅ Multiple parts processing

Usage Example

from genai_processors.contrib import LanguageDetectProcessor

processor = LanguageDetectProcessor()
async for part in processor(part_stream):
    print(part.metadata["language"])  # e.g., "en", "fr", "bn"

- Changed condition from `part.text` to `part.text is not None`
- Ensures empty strings are processed and marked as "unknown" language
- All tests now passing (8/8)
@gemini-code-assist
Copy link

Summary of Changes

Hello @Solventerritory, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new LanguageDetectProcessor to the genai-processors library, enabling automatic language detection for text-based ProcessorPart objects. This enhancement allows the system to enrich text parts with their detected language, facilitating downstream multilingual processing and improving data handling capabilities.

Highlights

  • New Processor Added: Introduced a new LanguageDetectProcessor to the contrib module for automatic language detection.
  • Language Detection Functionality: The processor utilizes the langdetect library to identify the language of text parts and appends the detected ISO 639-1 language code (e.g., 'en', 'fr') to the part's metadata.
  • Robust Handling of Edge Cases: The processor is designed to handle various scenarios, including empty text, short text, and non-text parts, either labeling them as 'unknown' or passing them through without modification.
  • Configurable Options: Users can customize the metadata key for storing the language, the label used for unknown languages, and the minimum text length required for detection.
  • Dependency Updates: The pyproject.toml file has been updated to include langdetect as a contrib dependency and pytest-asyncio for development, along with pytest configuration for asyncio mode.
  • Comprehensive Testing: A new, comprehensive test suite with 8 tests has been added to ensure the reliability and correctness of the LanguageDetectProcessor across different languages and edge cases.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a LanguageDetectProcessor for automatic language detection, which is a great addition to the contrib module. The implementation is solid and comes with a comprehensive test suite. My review includes a few suggestions to better align the new processor with the existing framework, mainly by inheriting from the base Processor class and correcting an import path. I've also pointed out a minor improvement in exception handling. These changes will improve maintainability and ensure seamless integration.

Solventerritory and others added 5 commits December 19, 2025 02:33
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add LanguageDetectProcessor for automatic lang detection

1 participant