Skip to content

Conversation

@NehanPathan
Copy link
Contributor


🎯 Objective:

This pull request (PR) optimizes the SmartCn dictionary loading process and introduces unit tests to ensure correctness and maintainability.


🔥 Key Changes:

1. Dictionary Optimization:

  • Replaced ByteBuffer with BinaryReader.ReadInt32() for faster and more efficient data reading.
  • Implemented ReadOnlySpan<char> to minimize memory usage and improve overall performance.

2. Comprehensive Unit Tests Added:

  • Test File: DictionaryTests.cs

    • Contains tests for loading dictionaries and verifying dictionary operations.
  • BigramDictionary Tests:

    • GetInstance() method to ensure correct singleton instantiation.
    • LoadFromFile() method to verify successful loading of the dictionary from bigramDict.dct.
    • GetFrequency() method to test frequency retrieval of valid and non-existent entries.
  • WordDictionary Tests:

    • GetInstance() to confirm proper instantiation.
    • Future tests can be added if LoadMainDataFromFile() becomes accessible (Currently it is private method).

3. Resource Files Added:

  • Location: Lucene.Net.Tests.Analysis.SmartCn.Resources
    • bigramDict.dct
    • coreDict.dct

4. Embedded Resource Loading:

  • Embedded both .dct files as resources in the test assembly to eliminate external dependencies.
  • Created a utility in LuceneTestCase to extract these resources as temporary files during tests.

🧪 Testing Details:

📂 Test Scenarios:

  • Validated successful loading of both bigramDict.dct and coreDict.dct from embedded resources.
  • Checked frequency retrieval for valid entries (hello, world) and ensured non-existent entries return 0.
  • Verified that the GetInstance() method returns a non-null singleton instance.

Assertions Included:

  • Frequency correctness for known entries.
  • Proper dictionary instantiation.
  • No regression in dictionary functionality.

🚀 Why These Changes?

💡 Performance Improvements:

  • Faster dictionary loading with reduced memory overhead.

💡 Increased Test Coverage:

  • Ensures that dictionary operations work correctly and efficiently.

💡 Simplified Testing Workflow:

  • Embedded resource handling eliminates file path dependencies.

📝 Future Considerations:

🔍 Testing WordDictionary:

  • Currently limited to GetInstance() due to private access of LoadMainDataFromFile().
  • Additional tests can be added when the method’s visibility is updated.

🚀 Performance Enhancements:

  • Future work may include further performance optimization of dictionary lookups and hash collision handling.

📂 Issue Reference:

Fixes #1153


🔎 Checklist:


📄 How to Run Tests:

  1. Build the solution using dotnet build.
  2. Run tests using dotnet test to verify that all dictionary operations work correctly.

- Replaced ByteBuffer with BinaryReader for efficiency.
- Used ReadOnlySpan<char> in BigramDictionary.
- Added tests for dictionary loading from embedded resources.
- Embedded bigramDict.dct and coreDict.dct.
@NehanPathan NehanPathan force-pushed the feat/optimize-smartcn-dictionaries branch from cdcc306 to 12223a4 Compare April 2, 2025 06:01
@NehanPathan
Copy link
Contributor Author

Hey Shad [@NightOwl888],

I recently updated my commit because I forgot to add comments in my code earlier.

  • I also noticed that the LoadMainDataFromFile() method remained internal instead of private—this was unintentional, as I had temporarily changed it for experimental testing but forgot to revert it back.
  • In yesterday’s commit, I reverted it back to private and added a comment explaining its purpose to avoid confusion in the future.
  • The force push was only to include this minor correction, and no other changes were made to the PR.

I’d appreciate it if you could review this PR when you get a chance. Let me know if any further modifications are needed!

Also, I’m planning to draft my Google Summer of Code (GSoC) proposal for Lucene.NET. If any maintainers are available, I’d love to share it for feedback. Please let me know if that would be possible.

Looking forward to your response. Thanks! 😊🚀

@paulirwin paulirwin self-assigned this Apr 8, 2025
@paulirwin paulirwin self-requested a review April 8, 2025 02:01
@paulirwin paulirwin added the notes:performance-improvement Contains a performance improvement label Apr 8, 2025
Copy link
Contributor

@paulirwin paulirwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR! I think most of the core logic looks correct, but there are some changes needed to adopt our practices with how we maintain existing code and match the upstream Lucene code. In particular, try to avoid adding unnecessary comments that do not exist upstream, unless it's to point out something lucenenet-specific that deviates from upstream, or to clarify unclear logic. Avoid adding comments to every line of code, as the code should generally be self-explanatory. Also, please make sure to not remove comments that existed previously in Lucene.NET that still apply, or comments that exist upstream.

These changes should all be pretty small and quick to resolve, but my apologies for the quantity of them. I just wanted to help clarify where comments should be removed, added, or restored.

@NehanPathan NehanPathan force-pushed the feat/optimize-smartcn-dictionaries branch from 0334ead to 76d55f6 Compare April 8, 2025 09:08
@NehanPathan NehanPathan force-pushed the feat/optimize-smartcn-dictionaries branch from 76d55f6 to fa8fc77 Compare April 8, 2025 09:17
@NehanPathan
Copy link
Contributor Author

NehanPathan commented Apr 8, 2025


[@paulirwin]
Hi Paul, thanks for the detailed feedback!

All suggested changes implemented:

  • Updated code comments to align with upstream Lucene and Lucene.NET standards.
  • Restored previously removed comments where appropriate, and added // LUCENENET tags for custom logic or deviations.
  • Lowercased bigramdict.dct filename to match upstream expectations and ensure compatibility on case-sensitive file systems.
  • Added [LuceneNetSpecific] attribute to custom test classes.

🛠️ About the try-catch (EndOfStreamException) block:
We retained a minimal try-catch only around the ReadInt32() line within the character loop to handle EndOfStreamException. This exception arises when the file ends naturally (not due to corruption), which is expected behavior for our test data.
Without this handling, DictionaryTests fail because the test file contains fewer entries than the full GB2312 character set (6768). This pattern reflects how earlier Lucene.NET logic handled the end of file — exiting cleanly once the data is exhausted.

✅ All tests are passing now.


🔁 Force-pushed twice for clarity:

  1. First push: Removed the inline comment // Reached end of file — assume remaining chars are missing above the break; to reduce verbosity.
  2. Second push: Re-added a shorter version // Reached end of file to retain context without clutter.

This keeps the codebase clean, but still understandable for future maintainers.


📢 GSoC Update:
I’ve officially submitted my Google Summer of Code 2025 proposal for Lucene.NET. Tomorrow is the final day for mentor feedback — if you get a moment, I’d truly appreciate any suggestions or thoughts you might have!

📄 Proposal (Google Docs, comments enabled):
👉 [Google Summer of Code 2025 Proposal](https://docs.google.com/document/d/1uWD45L8iNZ1A102Gxbf23ZABdGjHmh5Xm6SAZyWQHO0/edit?usp=sharing)

Thanks again for your time and support! 🙏


@paulirwin paulirwin requested a review from NightOwl888 April 15, 2025 13:13
Copy link
Contributor

@paulirwin paulirwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the feedback! This looks good to me, but I know @NightOwl888 wanted to review as well.

Copy link
Contributor

@NightOwl888 NightOwl888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. My apologies for the delay.

Given that this is a port from Java, there are a few practices to keep in mind when making changes such as these. We won't be porting from scratch again. We will be upgrading this code multiple times to keep up with the upstream Java code, so it is important that we follow similar practices as were done in Java to make this upgrade process as painless as possible. So, in short, "maintainable" for us means something very different than most .NET projects. We want to stay closely aligned with the upstream code rather than moving things around to make them more "readable", which would hinder the upgrade process.

  1. Variable declarations should be left the same as in Java unless we have a specific reason to change them.
  2. Loop style should match the upstream code (even if it looks strange), so we don't have to re-analyze the business logic every time we apply future changes from Lucene.
  3. In general, file formats should be portable between .NET and Java. There are some exceptions where in Java they were serializing objects using a proprietary Java format where we have deviated from this (including the bigramdict.mem and coredict.mem files in this project), but these files are not serviceable by users, anyway.
  4. Temp files should always be created using LuceneTestCase.CreateTempFile() or (as in this case) grouped together using LuceneTestCase.CreateTempDir() to ensure that they are not left on disk when the tests are finished running.
  5. New files that do not exist upstream should all go into a Support folder, but Support should generally not be part of the namespace. Comparing directories on disk is the simplest way to determine if we have a matching set of files as upstream, but these extra files should be physically separated.

I have left several comments inline with more detail, but do note that many of them are very repetitive.

@NehanPathan
Copy link
Contributor Author

[@NightOwl888 ]

Hey Shad! 😊

Just wanted to share that I’ve pushed the final changes based on your suggestions — thank you so much for the clear and helpful feedback throughout. Here's a quick overview of what’s been improved:

  • Refactored AbstractDictionary, WordDictionary, and BigramDictionary to enhance code structure, clarity, and maintainability, following your suggestions for the upstream.
  • Cleaned up and finalized TestBuildDictionary.cs to validate both dictionary types correctly
  • Added the zipped file to keep things tidy and lightweight

Thanks again for all your support — your feedback really helped me sharpen the code and improve my understanding! 🙌
If there's anything else to update or improve, let me know....


{
cnt = reader.ReadInt32(); // LUCENENET: Use BinaryReader methods instead of ByteBuffer
}
catch (EndOfStreamException)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the length is hard coded as a constant instead of being inferred or read from the file, I guess we can leave this here. It seems like the file must contain exactly this set of words and nothing more or less, but it can be customized to change the frequencies. Since it is like this upstream, I guess it is fine for now.

@NehanPathan
Copy link
Contributor Author


@NightOwl888

Hi Shad,

Hope you're doing well. Just a gentle follow-up on this PR—I've made suggested changes and really appreciate your guidance. Totally understand if you're busy, no rush at all. Just wanted to make sure it stays on your radar.

Thanks again for your time! 🙏
—Nehan


@NehanPathan NehanPathan requested a review from NightOwl888 May 10, 2025 16:23
@paulirwin paulirwin requested a review from Copilot August 20, 2025 02:27
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes the SmartCn dictionary loading process by replacing ByteBuffer operations with BinaryReader for improved performance, and introduces comprehensive unit tests to ensure correctness. The changes focus on memory efficiency improvements through ReadOnlySpan usage and proper test coverage for dictionary operations.

Key changes:

  • Replaced ByteBuffer with BinaryReader.ReadInt32() for more efficient little-endian data reading
  • Changed char[] parameters to ReadOnlySpan to reduce memory allocations
  • Added comprehensive unit tests for BigramDictionary and WordDictionary loading and operations

Reviewed Changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
Lucene.Net.Tests.Analysis.SmartCn.csproj Added embedded resource configuration for test data
TestBuildDictionary.cs New comprehensive test class for dictionary loading and functionality
Lucene.Net.Analysis.SmartCn.csproj Added InternalsVisibleTo for test access and removed empty lines
WordDictionary.cs Optimized file reading with BinaryReader and improved comments
BigramDictionary.cs Enhanced with BinaryReader, ReadOnlySpan usage, and robust error handling
AbstractDictionary.cs Updated hash methods to use ReadOnlySpan and added encoding fallbacks

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@NehanPathan
Copy link
Contributor Author

@paulirwin @NightOwl888
If we adopt the
[ length > 0 && reader.BaseStream.Position + length <= dctFile.Lengt ]and [reader.BaseStream.Seek(4, SeekOrigin.Current) ] approach in BigramDictionary, it makes sense to apply the same change in WordDictionary as well for consistency and slightly improved efficiency.

@NehanPathan
Copy link
Contributor Author

@paulirwin @NightOwl888
done the last suggested changes locally
Just wanted to confirm if both of you are good with these changes before I push the final commit...

@paulirwin
Copy link
Contributor

@NehanPathan Added comments to the Copilot comments above. Let's just do the stream seek change for now as that seems like an obvious (if small) improvement. Let's skip the other change as I doubt it is actually a problem.

@NehanPathan
Copy link
Contributor Author

@paulirwin @NightOwl888,

I’ve tried adding final newlines and removing trailing whitespace in the most of files I could identify, but the check-editorconfig CI is still failing.

Is there a recommended way to apply this fix across all files in the repo safely so the CI passes, without accidentally touching unrelated code?
beacause same test fail in #1182 pr also

@NightOwl888
Copy link
Contributor

@paulirwin @NightOwl888,

I’ve tried adding final newlines and removing trailing whitespace in the most of files I could identify, but the check-editorconfig CI is still failing.

Is there a recommended way to apply this fix across all files in the repo safely so the CI passes, without accidentally touching unrelated code? beacause same test fail in #1182 pr also

Which text editor are you using to modify the file? Most editors that are widely used for development either have native support for .editorconfig files or there is a plugin to download (and possibly enable in settings). Once it is enabled, the editor should respect the settings and add the newline automatically.

It also helps to have line numbers and whitespace enabled so they are visible in the editor window. Here is how I have Notepad++ configured:

image

As you can see, the final newline is missing.

I would be happy to fix it for you, but I think it would be best if we figure out how to set up your editor so you can quickly work through these errors.

Copy link
Contributor

@NightOwl888 NightOwl888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one more issue to fix and this one is finally finished! Thanks Nehan for sticking with it.

I ended up changing the test data one more time, since the data we were using to test the file loading was identical to the data that is loaded by default. While we could verify the data was being loaded manually, the tests wouldn't catch it if we somehow loaded the default data instead of running it with custom data.

So, I made a quick console app to convert the frequency values to specific values that differ from the original data and swapped in those values. I ended up dropping the code for that console app into a comment in the test, because it was basically a one-time thing. But, if we ever need to convert another dataset, we have the code to do it.

There are some more optimizations that can be done, but I think those should be done in another PR.

@NightOwl888 NightOwl888 self-requested a review October 8, 2025 04:19
@NehanPathan NehanPathan merged commit 03af992 into apache:master Oct 8, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

notes:performance-improvement Contains a performance improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eliminate usage of ByteBuffer in SmartCn

3 participants