-
Notifications
You must be signed in to change notification settings - Fork 653
feat: Optimize SmartCn Dictionaries and Add Dictionary Loading Tests #1154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Optimize SmartCn Dictionaries and Add Dictionary Loading Tests #1154
Conversation
- Replaced ByteBuffer with BinaryReader for efficiency. - Used ReadOnlySpan<char> in BigramDictionary. - Added tests for dictionary loading from embedded resources. - Embedded bigramDict.dct and coreDict.dct.
cdcc306 to
12223a4
Compare
|
Hey Shad [@NightOwl888], I recently updated my commit because I forgot to add comments in my code earlier.
I’d appreciate it if you could review this PR when you get a chance. Let me know if any further modifications are needed! Also, I’m planning to draft my Google Summer of Code (GSoC) proposal for Lucene.NET. If any maintainers are available, I’d love to share it for feedback. Please let me know if that would be possible. Looking forward to your response. Thanks! 😊🚀 |
paulirwin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR! I think most of the core logic looks correct, but there are some changes needed to adopt our practices with how we maintain existing code and match the upstream Lucene code. In particular, try to avoid adding unnecessary comments that do not exist upstream, unless it's to point out something lucenenet-specific that deviates from upstream, or to clarify unclear logic. Avoid adding comments to every line of code, as the code should generally be self-explanatory. Also, please make sure to not remove comments that existed previously in Lucene.NET that still apply, or comments that exist upstream.
These changes should all be pretty small and quick to resolve, but my apologies for the quantity of them. I just wanted to help clarify where comments should be removed, added, or restored.
0334ead to
76d55f6
Compare
76d55f6 to
fa8fc77
Compare
|
[@paulirwin] ✅ All suggested changes implemented:
🛠️ About the ✅ All tests are passing now. 🔁 Force-pushed twice for clarity:
This keeps the codebase clean, but still understandable for future maintainers. 📢 GSoC Update: 📄 Proposal (Google Docs, comments enabled): Thanks again for your time and support! 🙏 |
src/Lucene.Net.Tests.Analysis.SmartCn/Lucene.Net.Tests.Analysis.SmartCn.csproj
Outdated
Show resolved
Hide resolved
paulirwin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the feedback! This looks good to me, but I know @NightOwl888 wanted to review as well.
NightOwl888
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. My apologies for the delay.
Given that this is a port from Java, there are a few practices to keep in mind when making changes such as these. We won't be porting from scratch again. We will be upgrading this code multiple times to keep up with the upstream Java code, so it is important that we follow similar practices as were done in Java to make this upgrade process as painless as possible. So, in short, "maintainable" for us means something very different than most .NET projects. We want to stay closely aligned with the upstream code rather than moving things around to make them more "readable", which would hinder the upgrade process.
- Variable declarations should be left the same as in Java unless we have a specific reason to change them.
- Loop style should match the upstream code (even if it looks strange), so we don't have to re-analyze the business logic every time we apply future changes from Lucene.
- In general, file formats should be portable between .NET and Java. There are some exceptions where in Java they were serializing objects using a proprietary Java format where we have deviated from this (including the
bigramdict.memandcoredict.memfiles in this project), but these files are not serviceable by users, anyway. - Temp files should always be created using
LuceneTestCase.CreateTempFile()or (as in this case) grouped together usingLuceneTestCase.CreateTempDir()to ensure that they are not left on disk when the tests are finished running. - New files that do not exist upstream should all go into a
Supportfolder, butSupportshould generally not be part of the namespace. Comparing directories on disk is the simplest way to determine if we have a matching set of files as upstream, but these extra files should be physically separated.
I have left several comments inline with more detail, but do note that many of them are very repetitive.
…igram and WordDictionary
[@NightOwl888 ]Hey Shad! 😊 Just wanted to share that I’ve pushed the final changes based on your suggestions — thank you so much for the clear and helpful feedback throughout. Here's a quick overview of what’s been improved:
Thanks again for all your support — your feedback really helped me sharpen the code and improve my understanding! 🙌 |
| { | ||
| cnt = reader.ReadInt32(); // LUCENENET: Use BinaryReader methods instead of ByteBuffer | ||
| } | ||
| catch (EndOfStreamException) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the length is hard coded as a constant instead of being inferred or read from the file, I guess we can leave this here. It seems like the file must contain exactly this set of words and nothing more or less, but it can be customized to change the frequencies. Since it is like this upstream, I guess it is fine for now.
src/Lucene.Net.Tests.Analysis.SmartCn/Hhmm/TestBuildDictionary.cs
Outdated
Show resolved
Hide resolved
src/Lucene.Net.Tests.Analysis.SmartCn/Hhmm/TestBuildDictionary.cs
Outdated
Show resolved
Hide resolved
…fine maxlength usage
|
Hi Shad, Hope you're doing well. Just a gentle follow-up on this PR—I've made suggested changes and really appreciate your guidance. Totally understand if you're busy, no rush at all. Just wanted to make sure it stays on your radar. Thanks again for your time! 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR optimizes the SmartCn dictionary loading process by replacing ByteBuffer operations with BinaryReader for improved performance, and introduces comprehensive unit tests to ensure correctness. The changes focus on memory efficiency improvements through ReadOnlySpan usage and proper test coverage for dictionary operations.
Key changes:
- Replaced ByteBuffer with BinaryReader.ReadInt32() for more efficient little-endian data reading
- Changed char[] parameters to ReadOnlySpan to reduce memory allocations
- Added comprehensive unit tests for BigramDictionary and WordDictionary loading and operations
Reviewed Changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| Lucene.Net.Tests.Analysis.SmartCn.csproj | Added embedded resource configuration for test data |
| TestBuildDictionary.cs | New comprehensive test class for dictionary loading and functionality |
| Lucene.Net.Analysis.SmartCn.csproj | Added InternalsVisibleTo for test access and removed empty lines |
| WordDictionary.cs | Optimized file reading with BinaryReader and improved comments |
| BigramDictionary.cs | Enhanced with BinaryReader, ReadOnlySpan usage, and robust error handling |
| AbstractDictionary.cs | Updated hash methods to use ReadOnlySpan and added encoding fallbacks |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
@paulirwin @NightOwl888 |
|
@paulirwin @NightOwl888 |
|
@NehanPathan Added comments to the Copilot comments above. Let's just do the stream seek change for now as that seems like an obvious (if small) improvement. Let's skip the other change as I doubt it is actually a problem. |
|
I’ve tried adding final newlines and removing trailing whitespace in the most of files I could identify, but the check-editorconfig CI is still failing. Is there a recommended way to apply this fix across all files in the repo safely so the CI passes, without accidentally touching unrelated code? |
Which text editor are you using to modify the file? Most editors that are widely used for development either have native support for It also helps to have line numbers and whitespace enabled so they are visible in the editor window. Here is how I have Notepad++ configured:
As you can see, the final newline is missing. I would be happy to fix it for you, but I think it would be best if we figure out how to set up your editor so you can quickly work through these errors. |
…est data with known frequency values to verify the custom data set is loaded.
NightOwl888
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one more issue to fix and this one is finally finished! Thanks Nehan for sticking with it.
I ended up changing the test data one more time, since the data we were using to test the file loading was identical to the data that is loaded by default. While we could verify the data was being loaded manually, the tests wouldn't catch it if we somehow loaded the default data instead of running it with custom data.
So, I made a quick console app to convert the frequency values to specific values that differ from the original data and swapped in those values. I ended up dropping the code for that console app into a comment in the test, because it was basically a one-time thing. But, if we ever need to convert another dataset, we have the code to do it.
There are some more optimizations that can be done, but I think those should be done in another PR.

🎯 Objective:
This pull request (PR) optimizes the SmartCn dictionary loading process and introduces unit tests to ensure correctness and maintainability.
🔥 Key Changes:
✅ 1. Dictionary Optimization:
ByteBufferwithBinaryReader.ReadInt32()for faster and more efficient data reading.ReadOnlySpan<char>to minimize memory usage and improve overall performance.✅ 2. Comprehensive Unit Tests Added:
Test File:
DictionaryTests.csBigramDictionary Tests:
GetInstance()method to ensure correct singleton instantiation.LoadFromFile()method to verify successful loading of the dictionary frombigramDict.dct.GetFrequency()method to test frequency retrieval of valid and non-existent entries.WordDictionary Tests:
GetInstance()to confirm proper instantiation.LoadMainDataFromFile()becomes accessible (Currently it is private method).✅ 3. Resource Files Added:
Lucene.Net.Tests.Analysis.SmartCn.ResourcesbigramDict.dctcoreDict.dct✅ 4. Embedded Resource Loading:
.dctfiles as resources in the test assembly to eliminate external dependencies.LuceneTestCaseto extract these resources as temporary files during tests.🧪 Testing Details:
📂 Test Scenarios:
bigramDict.dctandcoreDict.dctfrom embedded resources.hello,world) and ensured non-existent entries return0.GetInstance()method returns a non-null singleton instance.✅ Assertions Included:
🚀 Why These Changes?
💡 Performance Improvements:
💡 Increased Test Coverage:
💡 Simplified Testing Workflow:
📝 Future Considerations:
🔍 Testing
WordDictionary:GetInstance()due to private access ofLoadMainDataFromFile().🚀 Performance Enhancements:
📂 Issue Reference:
Fixes #1153
🔎 Checklist:
📄 How to Run Tests:
dotnet build.dotnet testto verify that all dictionary operations work correctly.