Skip to content

Attempt to support both Linux and Apple ARM #9

Open
lfoppiano wants to merge 23 commits intomasterfrom
feature/support-arm-mac
Open

Attempt to support both Linux and Apple ARM #9
lfoppiano wants to merge 23 commits intomasterfrom
feature/support-arm-mac

Conversation

@lfoppiano
Copy link
Contributor

@lfoppiano lfoppiano commented Feb 23, 2026

I've managed to fix the build of the bindings for both Linux and Apple ARM.

The main problem is the fact that Apple deprecated the stdlib++ in favour of libc++. The JNDI bindings were supporting the symbols created by the former, which cannot be used on Apple.

What I did:

  1. I created a fork of the CLD2 library (https://github.com/lfoppiano/cld2) and compiled all architectures (linux, mac) using libc++ using github actions (for windows I'm not sure there are possibilities with the runners available for free).
  2. I manually committed those .so / .dylib into this repository
  3. I updated the JNDI symbols in the java classes

This will allow to build the java package on apple and being able to build and test nutch on apple ARM.

I'm not sure this is an acceptable change, however we can keep this branch hidden

@lfoppiano lfoppiano linked an issue Feb 23, 2026 that may be closed by this pull request
@wumpus
Copy link
Member

wumpus commented Feb 24, 2026

I feel mangled

@lfoppiano
Copy link
Contributor Author

lfoppiano commented Feb 24, 2026

@wumpus sorry, the inital comment of this PR was confused (too late in the evening) - I added more information.

I need this library to be able to build and test nutch locally, I've got a couple of issues commoncrawl/nutch#40 (PR: here) and commoncrawl/nutch#39

@sebastian-nagel
Copy link
Contributor

Thanks, @lfoppiano! Great that it now can run on MacOS.

However, I've failed to get it run on Linux - "Ubuntu 25.10" (lsb_release -d) "x86_64" (uname -m) in three constellations:

  1. with the package libcdl2-dev installed:
    $> mvn clean verify
    ...
    java.lang.UnsatisfiedLinkError: Error looking up function '_ZN4CLD224ExtDetectLanguageSummaryEPKcibPKNS_8CLDHintsEiPNS_8LanguageEPiPdPNSt3__16vectorINS_11ResultChunkENS9_9allocatorISB_EEEES7_Pb': /lib/x86_64-linux-gnu/libcld2.so: undefined symbol: _ZN4CLD224ExtDetectLanguageSummaryEPKcibPKNS_8CLDHintsEiPNS_8LanguageEPiPdPNSt3__16vectorINS_11ResultChunkENS9_9allocatorISB_EEEES7_Pb
    
    The change in Cld2Library does not fit with the shared object shipped:
    $> readelf -Ws --dyn-syms src/main/resources/linux-amd64/libcld2.so | grep -c _ZN4CLD224ExtDetectLanguageSummaryEPKcibPKNS_8CLDHintsEiPNS_8LanguageEPiPdPSt6vectorINS_11ResultChunkESaISA_EES7_Pb
    2
    
    See also trial 3, it's related.
  2. After removing the packages libcld2-0 and libcdl2-dev:
    $> mvn clean verify
    ...
    Native library (linux-x86-64/libcld2.so) not found in resource path
    
    Something's wrong with selecting the right resource path.
  3. Same as 2, but manually pointing to the right path:
    $> LD_LIBRARY_PATH=$PWD/target/classes/linux-amd64 mvn clean verify
    ...
    java.lang.UnsatisfiedLinkError: Error looking up function '_ZN4CLD224ExtDetectLanguageSummaryEPKcibPKNS_8CLDHintsEiPNS_8LanguageEPiPdPNSt3__16vectorINS_11ResultChunkENS9_9allocatorISB_EEEES7_Pb': /mnt/data/wastl/proj/cc/git/language-detection-cld2/target/classes/linux-amd64/libcld2.so: undefined symbol: _ZN4CLD224ExtDetectLanguageSummaryEPKcibPKNS_8CLDHintsEiPNS_8LanguageEPiPdPNSt3__16vectorINS_11ResultChunkENS9_9allocatorISB_EEEES7_Pb
    
    The symbol is indeed not defined in the shared object:
    $> readelf -Ws --dyn-syms src/main/resources/linux-amd64/libcld2.so | grep -c_ZN4CLD224ExtDetectLanguageSummaryEPKcibPKNS_8CLDHintsEiPNS_8LanguageEPiPdPNSt3__16vectorINS_11ResultChunkENS9_9allocatorISB_EEEES7_Pb
    0
    
    (It is in the MacOS objects)
    $> strings src/main/resources/darwin-amd64/libcld2.dylib | grep -c _ZN4CLD224ExtDetectLanguageSummaryEPKcibPKNS_8CLDHintsEiPNS_8LanguageEPiPdPNSt3__16vectorINS_11ResultChunkENS9_9allocatorISB_EEEES7_Pb
    1
    

The major points is to resolve the issue with C++ name mangling. There are tiny little differences between MacOS and Linux, resp. the compilers used. When demangling, the std namespace looks different:

# Darwin
$> echo _ZN4CLD224ExtDetectLanguageSummaryEPKcibPKNS_8CLDHintsEiPNS_8LanguageEPiPdPNSt3__16vectorINS_11ResultChunkENS9_9allocatorISB_EEEES7_Pb \
  | c++filt
CLD2::ExtDetectLanguageSummary(char const*, int, bool, CLD2::CLDHints const*, int, CLD2::Language*, int*, double*, std::__1::vector<CLD2::ResultChunk, std::__1::allocator<CLD2::ResultChunk> >*, int*, bool*)

# Linux
$> echo _ZN4CLD224ExtDetectLanguageSummaryEPKcibPKNS_8CLDHintsEiPNS_8LanguageEPiPdPSt6vectorINS_11ResultChunkESaISA_EES7_Pb \
  | c++filt
CLD2::ExtDetectLanguageSummary(char const*, int, bool, CLD2::CLDHints const*, int, CLD2::Language*, int*, double*, std::vector<CLD2::ResultChunk, std::allocator<CLD2::ResultChunk> >*, int*, bool*)

There are more challenges ahead:

  • We definitely need to make it work for both the 80 and 160 languages variants of CLD2. See the instructions in the README.
  • If we ship binaries, we need a reliable way to generate and update them. This could be a GitHub workflow. We then release the jar (maybe even on Maven Central), but I'd prefer not to keep the native libraries under version control, or only as secondary artifacts. See for example, how JNA itself deals with shared objects.
  • The matrix of operating systems and CPU architectures is big, see the JNA Platform class. It's almost impossible to cover even a non-trivial part of it.

For now: What about to first share the procedure to make it run on MacOS in the README?

@sebastian-nagel
Copy link
Contributor

Hi @lfoppiano, great! Thanks!

The tests with the "full" version needs adaption. There are two different results. Needs a closer look.

Otherwise:

TODOs (not necessarily for now, but before we use this in production):

  • The included shared objects significantly increase the package size. This then also affects the Nutch job package size. Maybe allow to build a small jar without the shared objects?
  • The Ubuntu / Debian shared objects are smaller, likely because they are stripped. See https://code.launchpad.net/ubuntu/+source/cld2. There are also patches applied.
  • Still thinking that the native library build should be in this repository, linking to CLD2 as a Git submodule. It must be reproducible and transparent.

@lfoppiano
Copy link
Contributor Author

lfoppiano commented Feb 26, 2026

Hi @lfoppiano, great! Thanks!

The tests with the "full" version needs adaption. There are two different results. Needs a closer look.

I found an easy trick to run the test based on the library, however I'm not sure which results to expect for the full library.

Otherwise:

* yes, this works on Linux as expected.

That's great! 😅

* great to have (after many years) a solution for the old issues [java.lang.UnsatisfiedLinkError: Error looking up function on Mac #3](https://github.com/commoncrawl/language-detection-cld2/issues/3)

* and the potential option to releases this as a package ([Cannot find package from Maven repository #2](https://github.com/commoncrawl/language-detection-cld2/issues/2)). But this might cause more work to support Windows and other platforms, if anybody really starts to use it.

This needs to be assessed, and there might be viable alternatives, e.g. the WSL (https://learn.microsoft.com/en-us/windows/wsl/install) which may work with the linux version. Not having Windows its support very challenging.

TODOs (not necessarily for now, but before we use this in production):

* The included shared objects significantly increase the package size. This then also affects the Nutch job package size. Maybe allow to build a small jar without the shared objects?

* The Ubuntu / Debian shared objects are smaller, likely because they are stripped. See https://code.launchpad.net/ubuntu/+source/cld2. There are also patches applied.

* Still thinking that the native library build should be in this repository, linking to CLD2 as a Git submodule. It must be reproducible and transparent.

Adding the libraries was a quick fix for being able to test nutch, I will remove them and perhaps study a way to build the JAR from source as you suggested with the library as git submodule (or using the debian package - as in TODO 1 a lightweight JAR).

@lfoppiano
Copy link
Contributor Author

lfoppiano commented Feb 26, 2026

@sebastian-nagel I did print out the mismatches, I can understand that covering more languages (from 80 to 160 languages) may reduce the accuracy overall (e.g. see below Galician confused with Portuguese - very similar languages).
I don't get why there is a complete mismatch with Kurdish (the library responds Unknown). For the sake of those tests, should I just remove the faulty cases?

@pjox any clue why the full CDL2 that supports 160+ languages library has those mismatches? Maybe the context is too small for the full CDL2 library?

Plain text detection:

Wrong language (expected: ku, predicted: un) for test 144: Dema ew wenda bu, ew ramiya "xweziya min guhdarî shîret a wê kir baya"
Wrong language ISO-639-3 (expected: kur, predicted: null) for test 144: Dema ew wenda bu, ew ramiya "xweziya min guhdarî shîret a wê kir baya"
Wrong language (expected: ku, predicted: un) for test 237: Min nîvisand hefteya derpas buyî ji birêz Wood re u pirsî ji wî bide te kar di b
Wrong language ISO-639-3 (expected: kur, predicted: null) for test 237: Min nîvisand hefteya derpas buyî ji birêz Wood re u pirsî ji wî bide te kar di b
Wrong language (expected: ht, predicted: crs) for test 259: Ou, nou papa ki dan lesyel, Fer ou ganny rekonnet konman Bondye.
Wrong language ISO-639-3 (expected: hat, predicted: crs) for test 259: Ou, nou papa ki dan lesyel, Fer ou ganny rekonnet konman Bondye.
Wrong language (expected: bs, predicted: hr) for test 283: Mi se približimo, pokušavati razumjeti jedni drugog, ali samo povrijedimo jedni 
Wrong language ISO-639-3 (expected: bos, predicted: hrv) for test 283: Mi se približimo, pokušavati razumjeti jedni drugog, ali samo povrijedimo jedni 
Wrong language (expected: ku, predicted: un) for test 351: Eger tu bawer dike ku bi risvakirinê tu ê problem a masîvanan chareser bike, bas
Wrong language ISO-639-3 (expected: kur, predicted: null) for test 351: Eger tu bawer dike ku bi risvakirinê tu ê problem a masîvanan chareser bike, bas
Wrong language (expected: ku, predicted: un) for test 363: Va ye sirra min, ew bi xwe gelekî sade ye: Mirov tenê bi dilê xwe dikare baş bib
Wrong language ISO-639-3 (expected: kur, predicted: null) for test 363: Va ye sirra min, ew bi xwe gelekî sade ye: Mirov tenê bi dilê xwe dikare baş bib
Wrong language (expected: ku, predicted: un) for test 385: Hemu zimanan de vegotin, frase, îdîom, u vebêj he ne ku bi tîpîtî nayên wergeran
Wrong language ISO-639-3 (expected: kur, predicted: null) for test 385: Hemu zimanan de vegotin, frase, îdîom, u vebêj he ne ku bi tîpîtî nayên wergeran

html:

Wrong language (expected: gl, predicted: pt) for test 2: Os gobernos, para ensinar como desfrutar dos móbiles sen sermos dominados por el
Wrong language ISO-639-3 (expected: glg, predicted: por) for test 2: Os gobernos, para ensinar como desfrutar dos móbiles sen sermos dominados por el
Wrong language (expected: ku, predicted: un) for test 144: Dema ew wenda bu, ew ramiya "xweziya min guhdarî shîret a wê kir baya"
Wrong language ISO-639-3 (expected: kur, predicted: null) for test 144: Dema ew wenda bu, ew ramiya "xweziya min guhdarî shîret a wê kir baya"
Wrong language (expected: ku, predicted: un) for test 237: Min nîvisand hefteya derpas buyî ji birêz Wood re u pirsî ji wî bide te kar di b
Wrong language ISO-639-3 (expected: kur, predicted: null) for test 237: Min nîvisand hefteya derpas buyî ji birêz Wood re u pirsî ji wî bide te kar di b
Wrong language (expected: ht, predicted: crs) for test 259: Ou, nou papa ki dan lesyel, Fer ou ganny rekonnet konman Bondye.
Wrong language ISO-639-3 (expected: hat, predicted: crs) for test 259: Ou, nou papa ki dan lesyel, Fer ou ganny rekonnet konman Bondye.
Wrong language (expected: bs, predicted: hr) for test 283: Mi se približimo, pokušavati razumjeti jedni drugog, ali samo povrijedimo jedni 
Wrong language ISO-639-3 (expected: bos, predicted: hrv) for test 283: Mi se približimo, pokušavati razumjeti jedni drugog, ali samo povrijedimo jedni 
Wrong language (expected: it, predicted: eo) for test 289: Esempio: per esprimere la direzione, aggiungiamo alla parola la lettera finale "
Wrong language ISO-639-3 (expected: ita, predicted: epo) for test 289: Esempio: per esprimere la direzione, aggiungiamo alla parola la lettera finale "
Wrong language (expected: ht, predicted: un) for test 313: Burj Khalifa aktyelman se gratsyél ki pi wo nan mond.
Wrong language ISO-639-3 (expected: hat, predicted: null) for test 313: Burj Khalifa aktyelman se gratsyél ki pi wo nan mond.
Wrong language (expected: ht, predicted: un) for test 315: Aktyelman Burj Khalifa se gratsyél ki pi wo nan mond lan.
Wrong language ISO-639-3 (expected: hat, predicted: null) for test 315: Aktyelman Burj Khalifa se gratsyél ki pi wo nan mond lan.
Wrong language (expected: ku, predicted: un) for test 351: Eger tu bawer dike ku bi risvakirinê tu ê problem a masîvanan chareser bike, bas
Wrong language ISO-639-3 (expected: kur, predicted: null) for test 351: Eger tu bawer dike ku bi risvakirinê tu ê problem a masîvanan chareser bike, bas
Wrong language (expected: ku, predicted: un) for test 363: Va ye sirra min, ew bi xwe gelekî sade ye: Mirov tenê bi dilê xwe dikare baş bib
Wrong language ISO-639-3 (expected: kur, predicted: null) for test 363: Va ye sirra min, ew bi xwe gelekî sade ye: Mirov tenê bi dilê xwe dikare baş bib
Wrong language (expected: ku, predicted: un) for test 385: Hemu zimanan de vegotin, frase, îdîom, u vebêj he ne ku bi tîpîtî nayên wergeran
Wrong language ISO-639-3 (expected: kur, predicted: null) for test 385: Hemu zimanan de vegotin, frase, îdîom, u vebêj he ne ku bi tîpîtî nayên wergeran

@pjox
Copy link
Member

pjox commented Feb 26, 2026

So for the kur examples, I think CLD2 is just not that good at Kurdish (which it seems to have support for), so it's just predicting unknown language (probably the same with null). For Croatian hr and Bosnian bs, they are just two very similar language (I think some linguist would say it's the same language) so this looks not out of the ordinary. for Portuguese pt and Galician glg, these are again two very close languages, same fo Italian it and Esperanto eo. ht or hat is just Haitian Creole, which I don't think CLD2 supports.

In any case none of these example seem that problematic to me, these are all two very close languages, or a case of CLD2 not properly supporting the language or it not supporting the language at all. So everything seems normal.

I'm also tagging @laurieburchell in case they have some additional insights.

@laurieburchell
Copy link
Member

laurieburchell commented Feb 26, 2026

I can at least shed some light on the Kurdish: depending on the microlanguage, Kurdish (a macrolanguage) can be written in both Latin and Arabic script. CLD2 is trained on Kurdish in Arabic script, hence the confusion. Where are you getting your test cases from?

EDIT: you might find this table of results from the CLD2 repo useful: https://github.com/CLD2Owners/cld2/blob/master/docs/evaluate_cld2_large_20140122.txt. It lets you look up what language/script combinations CLD2 covers, as well as giving you an (optimistic) upper bound for performance.

@lfoppiano
Copy link
Contributor Author

lfoppiano commented Feb 26, 2026

OK, on this point I noticed that Kurdish is not included in the 80 languages supported by the standard library https://github.com/CLD2Owners/cld2?tab=readme-ov-file#supported-languages so I don't understand why the test using that model passes and not the test using the full version of the model (rephrased for @pjox and @laurieburchell - sorry it was not clear before - maybe even now).

FYI the two "versions" of the librar are described here: https://github.com/commoncrawl/language-detection-cld2/tree/master?tab=readme-ov-file#using-the-cld2-full-version-160-languages

@lfoppiano
Copy link
Contributor Author

@sebastian-nagel Finally, this works, with some caveat:

  1. I did not set up the submodule yet, but it's a small change to do so (at the moment we checkout the specific fork/branch)
  2. I've patched the CLD2 library at my fork github.com/lfoppiano/CLD2 (I can transfer to commoncrawl once approved)
  3. there are three modes:
  • by default e.g. mvn clean verify uses the library provided by the system (current use case, works on Linux - did not test in Mac)
  • standard library e.g. mvn clean very -Pstandard uses the standard library (works on both Linux and Mac)
  • full library, e.g. mvn clean veryfy -Pfull (works only on Linux - in Mac the override using DYLD_INSERT_LIBRARIES seems not working - I did not manage to get it to work)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cannot build on Apple M2

5 participants