You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: Optimize SmartCn Dictionaries and Add Dictionary Loading Tests (#1154)
* feat: Optimize SmartCn dictionaries and add loading tests
- Replaced ByteBuffer with BinaryReader for efficiency.
- Used ReadOnlySpan<char> in BigramDictionary.
- Added tests for dictionary loading from embedded resources.
- Embedded bigramDict.dct and coreDict.dct.
* refactor: apply review suggestions for SmartCn dictionary classes
* Fix casing for bigramdict.dct and coredict.dct to lowercase for case-sensitive OSes
* Revert breaking changes and restore compatibility; update tests for Bigram and WordDictionary
* Improve SmartCN tests: Replace file existence checks with asserts, refine maxlength usage
* Optimize dictionary loading: skip unused handle with Stream.Seek
* Fix: add final newline and remove trailing whitespace in multiple files
* Update SmartCn dictionary tests and BigramDictionary loading
* Update BigramDictionary ie Updated LoadFromFile to throw IOException
* Lucene.Net.Analysis.Cn.Smart.Hhmm.TestBuildDictionary: Modified the test data with known frequency values to verify the custom data set is loaded.
* Revert LoadFromFile length check to match upstream Lucene behavior
---------
Co-authored-by: Shad Storhaug <[email protected]>
// LUCENENET: Use BinaryReader to decode little endian instead of ByteBuffer, since this is the default in .NET
305
+
buffer[0]=reader.ReadInt32();// frequency
306
+
buffer[1]=reader.ReadInt32();// length
307
+
reader.BaseStream.Seek(4,SeekOrigin.Current);// Skip handle value (unused)
298
308
299
309
length=buffer[1];
300
310
if(length>0)
301
311
{
302
-
byte[]lchBuffer=newbyte[length];
303
-
dctFile.Read(lchBuffer,0,lchBuffer.Length);
312
+
byte[]lchBuffer=reader.ReadBytes(length);// LUCENENET: Use BinaryReader to decode little endian instead of ByteBuffer, since this is the default in .NET
313
+
304
314
//tmpword = new String(lchBuffer, "GB2312");
305
315
tmpword=gb2312Encoding.GetString(lchBuffer);// LUCENENET specific: use cached encoding instance from base class
Copy file name to clipboardExpand all lines: src/Lucene.Net.Analysis.SmartCn/package.md
+3-4Lines changed: 3 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,12 +25,11 @@ Analyzer for Simplified Chinese, which indexes words.
25
25
26
26
Three analyzers are provided for Chinese, each of which treats Chinese text in a different way.
27
27
28
-
* StandardAnalyzer: Index unigrams (individual Chinese characters) as a token.
28
+
- StandardAnalyzer: Index unigrams (individual Chinese characters) as a token.
29
29
30
-
* CJKAnalyzer (in the <xref:Lucene.Net.Analysis.Cjk> namespace of <xref:Lucene.Net.Analysis.Common>): Index bigrams (overlapping groups of two adjacent Chinese characters) as tokens.
31
-
32
-
* SmartChineseAnalyzer (in this package): Index words (attempt to segment Chinese text into words) as tokens.
30
+
- CJKAnalyzer (in the <xref:Lucene.Net.Analysis.Cjk> namespace of <xref:Lucene.Net.Analysis.Common>): Index bigrams (overlapping groups of two adjacent Chinese characters) as tokens.
33
31
32
+
- SmartChineseAnalyzer (in this package): Index words (attempt to segment Chinese text into words) as tokens.
0 commit comments