feat: Optimize SmartCn Dictionaries and Add Dictionary Loading Tests #1154

NehanPathan · 2025-03-31T17:29:02Z

🎯 Objective:

This pull request (PR) optimizes the SmartCn dictionary loading process and introduces unit tests to ensure correctness and maintainability.

🔥 Key Changes:

✅ 1. Dictionary Optimization:

Replaced ByteBuffer with BinaryReader.ReadInt32() for faster and more efficient data reading.
Implemented ReadOnlySpan<char> to minimize memory usage and improve overall performance.

✅ 2. Comprehensive Unit Tests Added:

Test File: DictionaryTests.cs
- Contains tests for loading dictionaries and verifying dictionary operations.
BigramDictionary Tests:
- GetInstance() method to ensure correct singleton instantiation.
- LoadFromFile() method to verify successful loading of the dictionary from bigramDict.dct.
- GetFrequency() method to test frequency retrieval of valid and non-existent entries.
WordDictionary Tests:
- GetInstance() to confirm proper instantiation.
- Future tests can be added if LoadMainDataFromFile() becomes accessible (Currently it is private method).

✅ 3. Resource Files Added:

Location: Lucene.Net.Tests.Analysis.SmartCn.Resources
- bigramDict.dct
- coreDict.dct

✅ 4. Embedded Resource Loading:

Embedded both .dct files as resources in the test assembly to eliminate external dependencies.
Created a utility in LuceneTestCase to extract these resources as temporary files during tests.

🧪 Testing Details:

📂 Test Scenarios:

Validated successful loading of both bigramDict.dct and coreDict.dct from embedded resources.
Checked frequency retrieval for valid entries (hello, world) and ensured non-existent entries return 0.
Verified that the GetInstance() method returns a non-null singleton instance.

✅ Assertions Included:

Frequency correctness for known entries.
Proper dictionary instantiation.
No regression in dictionary functionality.

🚀 Why These Changes?

💡 Performance Improvements:

Faster dictionary loading with reduced memory overhead.

💡 Increased Test Coverage:

Ensures that dictionary operations work correctly and efficiently.

💡 Simplified Testing Workflow:

Embedded resource handling eliminates file path dependencies.

📝 Future Considerations:

🔍 Testing WordDictionary:

Currently limited to GetInstance() due to private access of LoadMainDataFromFile().
Additional tests can be added when the method’s visibility is updated.

🚀 Performance Enhancements:

Future work may include further performance optimization of dictionary lookups and hash collision handling.

📂 Issue Reference:

Fixes #1153

🔎 Checklist:

Read and followed the [Contributor Guide](https://github.com/apache/lucenenet/blob/main/CONTRIBUTING.md) and [Code of Conduct](https://www.apache.org/foundation/policies/conduct.html).
Included relevant unit or integration tests.
Added inline documentation where applicable.
Created an open issue and linked it to this PR.

📄 How to Run Tests:

Build the solution using dotnet build.
Run tests using dotnet test to verify that all dictionary operations work correctly.

- Replaced ByteBuffer with BinaryReader for efficiency. - Used ReadOnlySpan<char> in BigramDictionary. - Added tests for dictionary loading from embedded resources. - Embedded bigramDict.dct and coreDict.dct.

NehanPathan · 2025-04-03T06:40:55Z

Hey Shad [@NightOwl888],

I recently updated my commit because I forgot to add comments in my code earlier.

I also noticed that the LoadMainDataFromFile() method remained internal instead of private—this was unintentional, as I had temporarily changed it for experimental testing but forgot to revert it back.
In yesterday’s commit, I reverted it back to private and added a comment explaining its purpose to avoid confusion in the future.
The force push was only to include this minor correction, and no other changes were made to the PR.

I’d appreciate it if you could review this PR when you get a chance. Let me know if any further modifications are needed!

Also, I’m planning to draft my Google Summer of Code (GSoC) proposal for Lucene.NET. If any maintainers are available, I’d love to share it for feedback. Please let me know if that would be possible.

Looking forward to your response. Thanks! 😊🚀

paulirwin

Thanks for this PR! I think most of the core logic looks correct, but there are some changes needed to adopt our practices with how we maintain existing code and match the upstream Lucene code. In particular, try to avoid adding unnecessary comments that do not exist upstream, unless it's to point out something lucenenet-specific that deviates from upstream, or to clarify unclear logic. Avoid adding comments to every line of code, as the code should generally be self-explanatory. Also, please make sure to not remove comments that existed previously in Lucene.NET that still apply, or comments that exist upstream.

These changes should all be pretty small and quick to resolve, but my apologies for the quantity of them. I just wanted to help clarify where comments should be removed, added, or restored.

src/Lucene.Net.Analysis.SmartCn/Hhmm/BigramDictionary.cs

src/Lucene.Net.Analysis.SmartCn/Hhmm/WordDictionary.cs

src/Lucene.Net.Tests.Analysis.SmartCn/DictionaryTests.cs

NehanPathan · 2025-04-08T09:23:43Z

[@paulirwin]
Hi Paul, thanks for the detailed feedback!

✅ All suggested changes implemented:

Updated code comments to align with upstream Lucene and Lucene.NET standards.
Restored previously removed comments where appropriate, and added // LUCENENET tags for custom logic or deviations.
Lowercased bigramdict.dct filename to match upstream expectations and ensure compatibility on case-sensitive file systems.
Added [LuceneNetSpecific] attribute to custom test classes.

🛠️ About the try-catch (EndOfStreamException) block:
We retained a minimal try-catch only around the ReadInt32() line within the character loop to handle EndOfStreamException. This exception arises when the file ends naturally (not due to corruption), which is expected behavior for our test data.
Without this handling, DictionaryTests fail because the test file contains fewer entries than the full GB2312 character set (6768). This pattern reflects how earlier Lucene.NET logic handled the end of file — exiting cleanly once the data is exhausted.

✅ All tests are passing now.

🔁 Force-pushed twice for clarity:

First push: Removed the inline comment // Reached end of file — assume remaining chars are missing above the break; to reduce verbosity.
Second push: Re-added a shorter version // Reached end of file to retain context without clutter.

This keeps the codebase clean, but still understandable for future maintainers.

📢 GSoC Update:
I’ve officially submitted my Google Summer of Code 2025 proposal for Lucene.NET. Tomorrow is the final day for mentor feedback — if you get a moment, I’d truly appreciate any suggestions or thoughts you might have!

📄 Proposal (Google Docs, comments enabled):
👉 [Google Summer of Code 2025 Proposal](https://docs.google.com/document/d/1uWD45L8iNZ1A102Gxbf23ZABdGjHmh5Xm6SAZyWQHO0/edit?usp=sharing)

Thanks again for your time and support! 🙏

src/Lucene.Net.Tests.Analysis.SmartCn/Lucene.Net.Tests.Analysis.SmartCn.csproj

…sensitive OSes

paulirwin

Thanks for addressing the feedback! This looks good to me, but I know @NightOwl888 wanted to review as well.

NightOwl888

Thanks for the PR. My apologies for the delay.

Given that this is a port from Java, there are a few practices to keep in mind when making changes such as these. We won't be porting from scratch again. We will be upgrading this code multiple times to keep up with the upstream Java code, so it is important that we follow similar practices as were done in Java to make this upgrade process as painless as possible. So, in short, "maintainable" for us means something very different than most .NET projects. We want to stay closely aligned with the upstream code rather than moving things around to make them more "readable", which would hinder the upgrade process.

Variable declarations should be left the same as in Java unless we have a specific reason to change them.
Loop style should match the upstream code (even if it looks strange), so we don't have to re-analyze the business logic every time we apply future changes from Lucene.
In general, file formats should be portable between .NET and Java. There are some exceptions where in Java they were serializing objects using a proprietary Java format where we have deviated from this (including the bigramdict.mem and coredict.mem files in this project), but these files are not serviceable by users, anyway.
Temp files should always be created using LuceneTestCase.CreateTempFile() or (as in this case) grouped together using LuceneTestCase.CreateTempDir() to ensure that they are not left on disk when the tests are finished running.
New files that do not exist upstream should all go into a Support folder, but Support should generally not be part of the namespace. Comparing directories on disk is the simplest way to determine if we have a matching set of files as upstream, but these extra files should be physically separated.

I have left several comments inline with more detail, but do note that many of them are very repetitive.

src/Lucene.Net.Analysis.SmartCn/Hhmm/WordDictionary.cs

src/Lucene.Net.Tests.Analysis.SmartCn/DictionaryTests.cs

src/Lucene.Net.Analysis.SmartCn/Hhmm/BigramDictionary.cs

src/Lucene.Net.Tests.Analysis.SmartCn/DictionaryTests.cs

…igram and WordDictionary

NehanPathan · 2025-04-24T13:09:07Z

[@NightOwl888 ]

Hey Shad! 😊

Just wanted to share that I’ve pushed the final changes based on your suggestions — thank you so much for the clear and helpful feedback throughout. Here's a quick overview of what’s been improved:

Refactored AbstractDictionary, WordDictionary, and BigramDictionary to enhance code structure, clarity, and maintainability, following your suggestions for the upstream.
Cleaned up and finalized TestBuildDictionary.cs to validate both dictionary types correctly
Added the zipped file to keep things tidy and lightweight

Thanks again for all your support — your feedback really helped me sharpen the code and improve my understanding! 🙌
If there's anything else to update or improve, let me know....

src/Lucene.Net.Analysis.SmartCn/Hhmm/BigramDictionary.cs

NightOwl888 · 2025-04-24T15:41:04Z

src/Lucene.Net.Analysis.SmartCn/Hhmm/BigramDictionary.cs

+                {
+                   cnt = reader.ReadInt32();  // LUCENENET: Use BinaryReader methods instead of ByteBuffer
+                }
+                catch (EndOfStreamException)


Given that the length is hard coded as a constant instead of being inferred or read from the file, I guess we can leave this here. It seems like the file must contain exactly this set of words and nothing more or less, but it can be customized to change the frequencies. Since it is like this upstream, I guess it is fine for now.

src/Lucene.Net.Analysis.SmartCn/Hhmm/BigramDictionary.cs

src/Lucene.Net.Analysis.SmartCn/Hhmm/AbstractDictionary.cs

src/Lucene.Net.Analysis.SmartCn/Hhmm/WordDictionary.cs

src/Lucene.Net.Tests.Analysis.SmartCn/Hhmm/TestBuildDictionary.cs

…fine maxlength usage

NehanPathan · 2025-04-30T11:26:03Z

@NightOwl888

Hi Shad,

Hope you're doing well. Just a gentle follow-up on this PR—I've made suggested changes and really appreciate your guidance. Totally understand if you're busy, no rush at all. Just wanted to make sure it stays on your radar.

Thanks again for your time! 🙏
—Nehan

Copilot

Pull Request Overview

This PR optimizes the SmartCn dictionary loading process by replacing ByteBuffer operations with BinaryReader for improved performance, and introduces comprehensive unit tests to ensure correctness. The changes focus on memory efficiency improvements through ReadOnlySpan usage and proper test coverage for dictionary operations.

Key changes:

Replaced ByteBuffer with BinaryReader.ReadInt32() for more efficient little-endian data reading
Changed char[] parameters to ReadOnlySpan to reduce memory allocations
Added comprehensive unit tests for BigramDictionary and WordDictionary loading and operations

Reviewed Changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
Lucene.Net.Tests.Analysis.SmartCn.csproj	Added embedded resource configuration for test data
TestBuildDictionary.cs	New comprehensive test class for dictionary loading and functionality
Lucene.Net.Analysis.SmartCn.csproj	Added InternalsVisibleTo for test access and removed empty lines
WordDictionary.cs	Optimized file reading with BinaryReader and improved comments
BigramDictionary.cs	Enhanced with BinaryReader, ReadOnlySpan usage, and robust error handling
AbstractDictionary.cs	Updated hash methods to use ReadOnlySpan and added encoding fallbacks

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/Lucene.Net.Analysis.SmartCn/Hhmm/BigramDictionary.cs

src/Lucene.Net.Analysis.SmartCn/Hhmm/AbstractDictionary.cs

NehanPathan · 2025-09-25T09:36:06Z

@paulirwin @NightOwl888
If we adopt the
[ length > 0 && reader.BaseStream.Position + length <= dctFile.Lengt ]and [reader.BaseStream.Seek(4, SeekOrigin.Current) ] approach in BigramDictionary, it makes sense to apply the same change in WordDictionary as well for consistency and slightly improved efficiency.

NehanPathan · 2025-10-06T13:57:04Z

@paulirwin @NightOwl888
done the last suggested changes locally
Just wanted to confirm if both of you are good with these changes before I push the final commit...

paulirwin · 2025-10-06T16:27:15Z

@NehanPathan Added comments to the Copilot comments above. Let's just do the stream seek change for now as that seems like an obvious (if small) improvement. Let's skip the other change as I doubt it is actually a problem.

NehanPathan · 2025-10-06T17:17:57Z

@paulirwin @NightOwl888,

I’ve tried adding final newlines and removing trailing whitespace in the most of files I could identify, but the check-editorconfig CI is still failing.

Is there a recommended way to apply this fix across all files in the repo safely so the CI passes, without accidentally touching unrelated code?
beacause same test fail in #1182 pr also

NightOwl888 · 2025-10-06T17:39:30Z

@paulirwin @NightOwl888,

I’ve tried adding final newlines and removing trailing whitespace in the most of files I could identify, but the check-editorconfig CI is still failing.

Is there a recommended way to apply this fix across all files in the repo safely so the CI passes, without accidentally touching unrelated code? beacause same test fail in #1182 pr also

Which text editor are you using to modify the file? Most editors that are widely used for development either have native support for .editorconfig files or there is a plugin to download (and possibly enable in settings). Once it is enabled, the editor should respect the settings and add the newline automatically.

It also helps to have line numbers and whitespace enabled so they are visible in the editor window. Here is how I have Notepad++ configured:

As you can see, the final newline is missing.

I would be happy to fix it for you, but I think it would be best if we figure out how to set up your editor so you can quickly work through these errors.

…hitespace

…est data with known frequency values to verify the custom data set is loaded.

NightOwl888

Just one more issue to fix and this one is finally finished! Thanks Nehan for sticking with it.

I ended up changing the test data one more time, since the data we were using to test the file loading was identical to the data that is loaded by default. While we could verify the data was being loaded manually, the tests wouldn't catch it if we somehow loaded the default data instead of running it with custom data.

So, I made a quick console app to convert the frequency values to specific values that differ from the original data and swapped in those values. I ended up dropping the code for that console app into a comment in the test, because it was basically a one-time thing. But, if we ever need to convert another dataset, we have the code to do it.

There are some more optimizations that can be done, but I think those should be done in another PR.

src/Lucene.Net.Analysis.SmartCn/Hhmm/BigramDictionary.cs

feat: Optimize SmartCn dictionaries and add loading tests

12223a4

- Replaced ByteBuffer with BinaryReader for efficiency. - Used ReadOnlySpan<char> in BigramDictionary. - Added tests for dictionary loading from embedded resources. - Embedded bigramDict.dct and coreDict.dct.

NehanPathan force-pushed the feat/optimize-smartcn-dictionaries branch from cdcc306 to 12223a4 Compare April 2, 2025 06:01

paulirwin self-assigned this Apr 8, 2025

paulirwin self-requested a review April 8, 2025 02:01

paulirwin added the notes:performance-improvement Contains a performance improvement label Apr 8, 2025

paulirwin requested changes Apr 8, 2025

View reviewed changes

NehanPathan force-pushed the feat/optimize-smartcn-dictionaries branch from 0334ead to 76d55f6 Compare April 8, 2025 09:08

refactor: apply review suggestions for SmartCn dictionary classes

fa8fc77

NehanPathan force-pushed the feat/optimize-smartcn-dictionaries branch from 76d55f6 to fa8fc77 Compare April 8, 2025 09:17

paulirwin reviewed Apr 9, 2025

View reviewed changes

src/Lucene.Net.Tests.Analysis.SmartCn/Lucene.Net.Tests.Analysis.SmartCn.csproj Outdated Show resolved Hide resolved

Fix casing for bigramdict.dct and coredict.dct to lowercase for case-…

539d714

…sensitive OSes

paulirwin requested a review from NightOwl888 April 15, 2025 13:13

paulirwin approved these changes Apr 15, 2025

View reviewed changes

NightOwl888 requested changes Apr 19, 2025

View reviewed changes

NightOwl888 reviewed Apr 24, 2025

View reviewed changes

src/Lucene.Net.Tests.Analysis.SmartCn/DictionaryTests.cs Outdated Show resolved Hide resolved

Revert breaking changes and restore compatibility; update tests for B…

cb6d24c

…igram and WordDictionary

NightOwl888 requested changes Apr 24, 2025

View reviewed changes

Improve SmartCN tests: Replace file existence checks with asserts, re…

2fc4987

…fine maxlength usage

NehanPathan requested a review from NightOwl888 May 10, 2025 16:23

paulirwin requested a review from Copilot August 20, 2025 02:27

Copilot AI reviewed Aug 20, 2025

View reviewed changes

src/Lucene.Net.Analysis.SmartCn/Hhmm/BigramDictionary.cs Outdated Show resolved Hide resolved

src/Lucene.Net.Analysis.SmartCn/Hhmm/BigramDictionary.cs Outdated Show resolved Hide resolved

src/Lucene.Net.Analysis.SmartCn/Hhmm/AbstractDictionary.cs Show resolved Hide resolved

NehanPathan added 2 commits October 6, 2025 22:22

Optimize dictionary loading: skip unused handle with Stream.Seek

8821689

Fix: add final newline and remove trailing whitespace

8c7b714

NehanPathan added 2 commits October 6, 2025 22:37

Fix again: add final newline and remove trailing whitespace

7c78868

Merge branch 'master' into feat/optimize-smartcn-dictionaries

61a4c92

NehanPathan and others added 5 commits October 6, 2025 23:16

last file fix: add final newline and remove trailing whitespace

941fe08

file fix: setup extension for add final newline and remove trailing w…

4ebd65a

…hitespace

Update SmartCn dictionary tests and BigramDictionary loading

a14bf81

Update BigramDictionary ie Updated LoadFromFile to throw IOException

c1bb6a5

Lucene.Net.Analysis.Cn.Smart.Hhmm.TestBuildDictionary: Modified the t…

b83ed00

…est data with known frequency values to verify the custom data set is loaded.

NightOwl888 requested changes Oct 7, 2025

View reviewed changes

src/Lucene.Net.Analysis.SmartCn/Hhmm/BigramDictionary.cs Outdated Show resolved Hide resolved

Revert LoadFromFile length check to match upstream Lucene behavior

965a088

paulirwin approved these changes Oct 8, 2025

View reviewed changes

NightOwl888 self-requested a review October 8, 2025 04:19

NightOwl888 approved these changes Oct 8, 2025

View reviewed changes

NehanPathan merged commit 03af992 into apache:master Oct 8, 2025
17 checks passed

feat: Optimize SmartCn Dictionaries and Add Dictionary Loading Tests #1154

feat: Optimize SmartCn Dictionaries and Add Dictionary Loading Tests #1154

Uh oh!

Conversation

NehanPathan commented Mar 31, 2025

🎯 Objective:

🔥 Key Changes:

🧪 Testing Details:

🚀 Why These Changes?

📝 Future Considerations:

📂 Issue Reference:

🔎 Checklist:

📄 How to Run Tests:

Uh oh!

NehanPathan commented Apr 3, 2025

Uh oh!

paulirwin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NehanPathan commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

paulirwin left a comment

Choose a reason for hiding this comment

Uh oh!

NightOwl888 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NehanPathan commented Apr 24, 2025

[@NightOwl888 ]

Uh oh!

Uh oh!

NightOwl888 Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NehanPathan commented Apr 30, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NehanPathan commented Sep 25, 2025

Uh oh!

NehanPathan commented Oct 6, 2025

Uh oh!

paulirwin commented Oct 6, 2025

Uh oh!

NehanPathan commented Oct 6, 2025

NehanPathan commented Apr 8, 2025 •

edited

Loading