Attempt to support both Linux and Apple ARM #9
Conversation
|
I feel mangled |
|
@wumpus sorry, the inital comment of this PR was confused (too late in the evening) - I added more information. I need this library to be able to build and test nutch locally, I've got a couple of issues commoncrawl/nutch#40 (PR: here) and commoncrawl/nutch#39 |
|
Thanks, @lfoppiano! Great that it now can run on MacOS. However, I've failed to get it run on Linux - "Ubuntu 25.10" (
The major points is to resolve the issue with C++ name mangling. There are tiny little differences between MacOS and Linux, resp. the compilers used. When demangling, the There are more challenges ahead:
For now: What about to first share the procedure to make it run on MacOS in the README? |
|
Hi @lfoppiano, great! Thanks! The tests with the "full" version needs adaption. There are two different results. Needs a closer look. Otherwise:
TODOs (not necessarily for now, but before we use this in production):
|
I found an easy trick to run the test based on the library, however I'm not sure which results to expect for the full library.
That's great! 😅
This needs to be assessed, and there might be viable alternatives, e.g. the WSL (https://learn.microsoft.com/en-us/windows/wsl/install) which may work with the linux version. Not having Windows its support very challenging.
Adding the libraries was a quick fix for being able to test nutch, I will remove them and perhaps study a way to build the JAR from source as you suggested with the library as git submodule (or using the debian package - as in TODO 1 a lightweight JAR). |
|
@sebastian-nagel I did print out the mismatches, I can understand that covering more languages (from 80 to 160 languages) may reduce the accuracy overall (e.g. see below Galician confused with Portuguese - very similar languages). @pjox any clue why the full CDL2 that supports 160+ languages library has those mismatches? Maybe the context is too small for the full CDL2 library? Plain text detection: html: |
|
So for the In any case none of these example seem that problematic to me, these are all two very close languages, or a case of CLD2 not properly supporting the language or it not supporting the language at all. So everything seems normal. I'm also tagging @laurieburchell in case they have some additional insights. |
|
I can at least shed some light on the Kurdish: depending on the microlanguage, Kurdish (a macrolanguage) can be written in both Latin and Arabic script. CLD2 is trained on Kurdish in Arabic script, hence the confusion. Where are you getting your test cases from? EDIT: you might find this table of results from the CLD2 repo useful: https://github.com/CLD2Owners/cld2/blob/master/docs/evaluate_cld2_large_20140122.txt. It lets you look up what language/script combinations CLD2 covers, as well as giving you an (optimistic) upper bound for performance. |
|
OK, on this point I noticed that Kurdish is not included in the 80 languages supported by the standard library https://github.com/CLD2Owners/cld2?tab=readme-ov-file#supported-languages so I don't understand why the test using that model passes and not the test using the full version of the model (rephrased for @pjox and @laurieburchell - sorry it was not clear before - maybe even now). FYI the two "versions" of the librar are described here: https://github.com/commoncrawl/language-detection-cld2/tree/master?tab=readme-ov-file#using-the-cld2-full-version-160-languages |
|
@sebastian-nagel Finally, this works, with some caveat:
|
I've managed to fix the build of the bindings for both Linux and Apple ARM.
The main problem is the fact that Apple deprecated the stdlib++ in favour of libc++. The JNDI bindings were supporting the symbols created by the former, which cannot be used on Apple.
What I did:
This will allow to build the java package on apple and being able to build and test nutch on apple ARM.
I'm not sure this is an acceptable change, however we can keep this branch hidden