Skip to content

Conversation

Qubitium
Copy link

@Qubitium Qubitium commented Oct 2, 2025

Summary

  • Fix Python 3.14 compat: TypeError: Pickler._batch_setitems() takes 2 positional arguments but 3 were given with HF Datasets
  • Make tests pytests compatible

Notables:

dill/tests/test_pickle_batch_setitems.py new test is added to test the hf datasets crash fix when loading any dataset.

Traceback (most recent call last):
  File "/root/GPTQModel/tests/test_dataset_loading.py", line 15, in <module>
    test_dataset_loader()
    ~~~~~~~~~~~~~~~~~~~^^
  File "/root/GPTQModel/tests/test_dataset_loading.py", line 5, in test_dataset_loader
    dataset = load_dataset("imdb", split="train[:1%]")  # load only 1% to keep it small
  File "/root/vm314t/lib/python3.14t/site-packages/datasets/load.py", line 2062, in load_dataset
    builder_instance = load_dataset_builder(
        path=path,
    ...<12 lines>...
        **config_kwargs,
    )
  File "/root/vm314t/lib/python3.14t/site-packages/datasets/load.py", line 1833, in load_dataset_builder
    builder_instance._use_legacy_cache_dir_if_possible(dataset_module)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
  File "/root/vm314t/lib/python3.14t/site-packages/datasets/builder.py", line 643, in _use_legacy_cache_dir_if_possible
    self._check_legacy_cache2(dataset_module) or self._check_legacy_cache() or None
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
  File "/root/vm314t/lib/python3.14t/site-packages/datasets/builder.py", line 487, in _check_legacy_cache2
    config_id = self.config.name + "-" + Hasher.hash({"data_files": self.config.data_files})
                                         ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/vm314t/lib/python3.14t/site-packages/datasets/fingerprint.py", line 188, in hash
    return cls.hash_bytes(dumps(value))
                          ~~~~~^^^^^^^
  File "/root/vm314t/lib/python3.14t/site-packages/datasets/utils/_dill.py", line 109, in dumps
    dump(obj, file)
    ~~~~^^^^^^^^^^^
  File "/root/vm314t/lib/python3.14t/site-packages/datasets/utils/_dill.py", line 103, in dump
    Pickler(file, recurse=True).dump(obj)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^
  File "/root/vm314t/lib/python3.14t/site-packages/dill/_dill.py", line 420, in dump
    StockPickler.dump(self, obj)
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^
  File "/usr/lib/python3.14/pickle.py", line 498, in dump
    self.save(obj)
    ~~~~~~~~~^^^^^
  File "/root/vm314t/lib/python3.14t/site-packages/datasets/utils/_dill.py", line 70, in save
    dill.Pickler.save(self, obj, save_persistent_id=save_persistent_id)
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/vm314t/lib/python3.14t/site-packages/dill/_dill.py", line 414, in save
    StockPickler.save(self, obj, save_persistent_id)
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.14/pickle.py", line 572, in save
    f(self, obj)  # Call unbound method with explicit self
    ~^^^^^^^^^^^
  File "/root/vm314t/lib/python3.14t/site-packages/dill/_dill.py", line 1217, in save_module_dict
    StockPickler.save_dict(pickler, obj)
    ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
  File "/usr/lib/python3.14/pickle.py", line 1064, in save_dict
    self._batch_setitems(obj.items(), obj)
    ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
TypeError: Pickler._batch_setitems() takes 2 positional arguments but 3 were given

Test to run under dill + python 3.14 + datasets to replicate the stacktrace :

from datasets import load_dataset

def test_dataset_loader():
    # Load a small split of a dataset from Hugging Face
    dataset = load_dataset("imdb", split="train[:1%]")  # load only 1% to keep it small
    
    # Print dataset info
    print(dataset)
    
    # Print the first row
    first_row = dataset[0]
    print("First row:", first_row)

if __name__ == "__main__":
    test_dataset_loader()

For test_threads.py

assert t.is_alive() == t_.is_alive()

This original check was removed as it never passed in Python 3.14 with or without PYTHON_GIL=0.

Checklist

Documentation and Tests

  • Added relevant tests that run with python tests/__main__.py, and pass.
    R.

Release Management

  • Added "Fixes #NNN" in the PR body, referencing the issue (#NNN) it closes.

Finally

@mmckerns The fix and tests were generated by codex but I did review every delta as much as possbile. I am not exactly sure why the thread is_alive() check is failing on python 3.14 (check removed). Please double check the fixes and especially the Pickler._batch_setitems() fix. Thanks.

The pytest conversion for usability since the output and stacktrace helpers is easier to pinpoint erros. But pytest does inject some wrappers to object types which the updated test codes had to skip or nullify.

@Qubitium Qubitium changed the title Fix Python 3.14 compat and HF Datasets Fix Python 3.14 compat with HF Datasets Oct 2, 2025
@Qubitium Qubitium marked this pull request as ready for review October 3, 2025 08:31
@mmckerns
Copy link
Member

Can you explain this PR a bit more? Specifically, why this belongs in dill as opposed to hugging face datasets.

@Qubitium
Copy link
Author

@mmckerns Good question! I will double check. After relooking at the stack it does appear to be method def changes to Python 3.14's pickler and not specically anything that dill is doing. It could may wil be fixed via python version check in datasets.

@Qubitium Qubitium marked this pull request as draft October 13, 2025 04:22
@mmckerns
Copy link
Member

@Qubitium: thanks for the follow-up. So, we'll have to look at the broader potential impact of the change in python.

@sghng
Copy link

sghng commented Oct 14, 2025

@Qubitium @mmckerns thanks for the PR. However I think this should be a downstream fix. The error is caused by datasets package instead of dill. dill already handles the new signature introduced in Python 3.14.

I put some inspection code in datasets and confirmed this:

calling save_dict from pickle.py
Method: <bound method Pickler._batch_setitems of <datasets.utils._dill.Pickler object at 0x3609d7250>>
Defined in: /Users/sghuang/dev/datasets/src/datasets/utils/_dill.py
Line number: 72
Full path: <module 'datasets.utils._dill' from '/Users/sghuang/dev/datasets/src/datasets/utils/_dill.py'>

joblib did a similar fix as well: joblib/joblib#1658.

For more information, see my PR for datasets: huggingface/datasets#7817

@Qubitium Qubitium closed this Oct 15, 2025
@Qubitium
Copy link
Author

@Qubitium @mmckerns thanks for the PR. However I think this should be a downstream fix. The error is caused by datasets package instead of dill. dill already handles the new signature introduced in Python 3.14.

I put some inspection code in datasets and confirmed this:

calling save_dict from pickle.py
Method: <bound method Pickler._batch_setitems of <datasets.utils._dill.Pickler object at 0x3609d7250>>
Defined in: /Users/sghuang/dev/datasets/src/datasets/utils/_dill.py
Line number: 72
Full path: <module 'datasets.utils._dill' from '/Users/sghuang/dev/datasets/src/datasets/utils/_dill.py'>

joblib did a similar fix as well: joblib/joblib#1658.

For more information, see my PR for datasets: huggingface/datasets#7817

Thanks for the core fix!

@Qubitium Qubitium reopened this Oct 15, 2025
@sghng
Copy link

sghng commented Oct 15, 2025

@Qubitium @mmckerns After more experimentation I realized that there might still be a need for upstream fix. That is, dill can provide a compatibility implementation of dill._dill.Pickler._batch_setitems, which takes care of the presence/absence of obj argument based on the Python version. This may or may not be out of scope since we don't have such implementation before.

Take datasets for example, they use pickler from dill with a monkey patched _batch_setitems method. Since this method is not overridden in dill._dill.Pickler, it resolves to pickle.Pickler._batch_setitems, which has a breaking change.

I'm not sure whether this is desirable. If Python decided to make this change in standard library, downstream should be encouraged to address it proactively, instead of being stuck on the old API, simply because it still "works" because of the tricks in dill.

@mmckerns
Copy link
Member

The typical policy for dill is that backward incompatible changes in the standard library are handled in dill.shims -- creating a alternate load function for pickles created with the old version... while all new dumped versions use the new version. If the standard library changes break pickling of once-pickable objects, then dill will attempt to implement the pickling. However, if a third-party code breaks pickling, it should be "fixed" in the third-party library... unless there is some component that can be added to dill that facilitates pickling "in general" (i.e. more broadly than a specific object or objects from a single third-party package). Can you point me to the _batch_setitems you are interested in? It would also help to understand what kind of impact the change would have on dill.

@sghng
Copy link

sghng commented Oct 16, 2025

@mmckerns Perhaps something like this:

_MISSING = object()
def _batch_setitems(items, obj=_MISSING):
    if py_version >= 3.14:
        if obj is _MISSING:
            raise TypeError("breaking change in Py 3.14...")
        return super()._batch_setitems(items, obj)
    else:
        return super()._batch_setitems(items)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants