OverflowError: value too large to convert to uint16_t #369

hillenbrkl · 2025-01-14T10:58:29Z

Details

Slurm Version: 24.11.0
Python Version: 3.6
Cython Version: 0.29.32
PySlurm Branch: main
Linux Distribution: Rocky 8

Issue

After uprading from 24.05 to 24.11 I get the following message in previously working code:

File "./bin/slurm2json", line 907, in
partitions = pyslurm.partition().get()
File "pyslurm/deprecated.pyx", line 1082, in pyslurm.deprecated.partition.get
File "pyslurm/deprecated.pyx", line 6378, in pyslurm.deprecated.get_partition_mode
OverflowError: value too large to convert to uint16_t

tazend · 2025-01-14T11:09:47Z

Hi,

The pyslurm.partition class is deprecated and will be removed in the future, and fixes to this class will likely not be made.
Please use the new class pyslurm.Partition and pyslurm.Partitions.

See here for the documentation: https://pyslurm.github.io/24.11/reference/partition/

hillenbrkl · 2025-01-16T14:23:30Z

Hi,

Thanks for giving the hint to the new documentation. I am now using Partitions/Nodes/Jobs.load() instead and after changing a lot of code, most of the things are working.

The is one issue I cannot resolve. Before the update every job had a list with the nodes and the cpus it used on those nodes. With the new API I cannot find a way to get this information, I can only see the nodes a job is allocating but not how many cpus it uses there. Any hint?

Markus

tazend · 2025-01-16T16:15:13Z

Hi,

I think you mean the cpus_allocated and/or cpus_alloc_layout fields of the job dict?
Yes, there is this function you can call on every Job instance:

https://pyslurm.github.io/24.11/reference/job/#pyslurm.Job.get_resource_layout_per_node

It will look something like this:

{
   'node015': 
       {
         'cpu_ids': '0',
         'gres':
            {
              'gpu:tesla-k80':
                  {
                   'count': 1,
                   'indexes': '0'
                  }
            },
        'memory': 4096
      }

    'node016': { ... }
}

cpu_ids is just a string which can also look like this for example I believe if the CPU-IDs allocated are contiguous:

0,2-10,12

I think in the old API, that string was already expanded to a list. You can use this function though to expand it:

https://pyslurm.github.io/24.11/reference/utilities/#pyslurm.utils.expand_range_str

So basically:

import pyslurm

... other stuff ...

resources_by_node = job_instance.get_resource_layout_per_node ()

for node, resources = resources_by_node.items():
    cpu_layout = pyslurm.utils.expand_range_str(resources["cpu_ids"])
    cpu_count = len(cpu_layout)

    print(node, cpu_count)

That should give you the CPU-Count per node. In the old API, starting with 24.11, the dict fields I mentioned at the start are now also disfunctional, because the Slurm devs removed some API functions that were used to get this.

So yeah, by using the get_resource_layout_per_node function you should get everything needed. As it says in the docs, the return type is still subject to change in the future (haven't really gotten much feedback on this one, so I just stuck with a nested dict). Maybe I make the inner dict a seperate class with the memory, cpu_ids and gres keys being attributes instead (so accessing them by dot is also nicer). And I can also provide an attribute/method that calculates the cpu count, for convenience. What do you think, would you prefer that?

hillenbrkl · 2025-01-17T08:23:20Z

Hi,

Great, that does the trick. There is one drawback, through: if the job runs on only one node, get_resource_layout_per_node() does not return anything. Is this intended?

A separate inner class for accessing memory, cpu_ids and gres directly as an attribute would indeed be much nicer. And then another attribute cpus could sum up the cpu count.

Markus

tazend · 2025-01-17T08:33:55Z

Hi,

that there is nothing returned when the Job only runs on a single node is definitely a bug. I will debug this and find a fix. But it works when the job is running on two nodes? That would be weird indeed. I'll check it out.

Alright, I will also go ahead and make the inner dict a class instead for more convenience

hillenbrkl · 2025-01-17T09:22:11Z

Hi,

It seems that the return value is empty sometimes only for 1 node jobs, please cf. attachment.

The code I wrote for this:

for j in jobs :
	data = jobs[j]
	if not data.state == "RUNNING" : continue
	print("%d node job (%d cpus): %s -> %s" % (data.num_nodes, data.cpus, data.id, data.get_resource_layout_per_node()))

Markus

running_jobs.txt

tazend · 2025-01-17T09:30:14Z

Hi,

alright, interesting. Could you check what scontrol show job <JOB> -d shows for a JobID that produces no output from get_resource_layout_per_node via pyslurm?

With -d scontrol will also show the resource per node. Would be interesting to see if it works correctly via scontrol. If it works for a job that procudes no return value in pyslurm, could you show me the scontrol output?

hillenbrkl · 2025-01-17T09:35:09Z

Hi,

here is one example (I replaced user data with XXX):

JobId=14171146 JobName='g16 21J_SP.gjf'
   UserId=XXX GroupId=XXX MCS_label=N/A
   Priority=4788 Nice=0 Account=default QOS=u-proj
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=1-23:31:20 TimeLimit=30-00:00:00 TimeMin=N/A
   SubmitTime=2024-12-12T13:23:37 EligibleTime=2024-12-12T13:23:37
   AccrueTime=2024-12-12T13:23:37
   StartTime=2025-01-15T11:01:17 EndTime=2025-02-14T11:01:19 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-01-15T11:01:17 Scheduler=Backfill
   Partition=skylake-384 AllocNode:Sid=head4:2286380
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node855
   BatchHost=node855
   NumNodes=1 NumCPUs=24 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=24,mem=114G,node=1,billing=24
   AllocTRES=cpu=24,mem=114G,node=1,billing=24
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   JOB_GRES=(null)
     Nodes=node855 CPU_IDs=0-23 Mem=116736 GRES=
   MinCPUsNode=1 MinMemoryNode=114G MinTmpDiskNode=0
   Features=JT:1 DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/tmp/2286731
   WorkDir=XXX
   Comment=jHPC::g16 
   StdErr=XXX
   StdIn=/dev/null
   StdOut=XXX

tazend · 2025-01-17T10:22:59Z

Hi,

mhhh, I cannot reproduce this at the moment on my HPC-Cluster, running 23.11 with python 3.11.
Maybe something changed on 24.11, but I will need to see. It is weird that there is sometimes output, and sometimes not. I'll also need to check if its python 3.6 being the problem, but I don't think so.

If all everything fails, I would probably give you a patch that inserts debug output in the get_resource_layout_per_node function, so you could rebuild with this applied and we can see where it fails.

hillenbrkl · 2025-01-17T10:43:25Z

Great.

tazend · 2025-01-17T13:11:33Z

Hi,

I attached a patch with some debug print statements. Can you apply them on the main branch with:

git apply -v pyslurm-resource-layout-debug.txt

Then recompile and run your previous code again, and send me the output?

pyslurm-resource-layout-debug.txt

hillenbrkl · 2025-01-17T13:56:49Z

Here we go ...

test_result.txt

tazend · 2025-01-17T14:47:00Z

Hi,

alright, thanks for the output. I also tried it with python 3.6.15 in my test setup, but I still cannot reproduce it.

You are using the stock 3.6.8 python on Rocky 8, right? How did you compile/install pyslurm?
Can you check during the installation which Cython version is used again? It cannot be 0.29.32 though without modifications to setup.py, because setup.py will decline anything below 0.29.37.

Really wondering what is going on here. I can't make sense of the fact that sometimes it reports that there are 0 hosts, and other times there are like 600.

hillenbrkl · 2025-01-17T15:24:38Z

Hi,

Yes, I am using Python 3.6.8 from the Rocky 8.10 installation

hillenbr@head0 [pyslurm] python3
Python 3.6.8 (default, Dec  4 2024, 12:35:02) 
[GCC 8.5.0 20210514 (Red Hat 8.5.0-22)] on linux
Type "help", "copyright", "credits" or "license" for more information.

I compile and install PySlurm this way:

git clone https://github.com/PySlurm/pyslurm.git --branch main
cd pyslurm
python3 setup.py build
python3 setup.py install --prefix=/cluster/slurm

Cython:

# cython --version
Cython version 0.29.37

I have re-installed cython with pip3, deleted /cluster/slurm/lib64/python3.6 and reinstalled PySlurm ... it changed nothing.

tazend · 2025-01-17T16:25:42Z

Hi,

alright...seems good. I might've found what is potentially causing the issue... not 100% sure yet, but I will create a seperate branch with a fix and will let you know about it, so you can try.

tazend · 2025-01-20T07:05:44Z

Hi,

could you please try and install pyslurm from this branch: https://github.com/tazend/pyslurm/tree/fix-job-resource-layout

And see if this fixes the problems?

hillenbrkl · 2025-01-21T13:53:18Z

Hi,

I get the following errors now (after git checkout -b fix-job-resource-layout):

[11/29] Cythonizing pyslurm/core/partition.pyx

Error compiling Cython file:
------------------------------------------------------------
...
# cython: c_string_type=unicode, c_string_encoding=default
# cython: language_level=3

from .config cimport Config
^
------------------------------------------------------------

pyslurm/core/slurmctld/__init__.pxd:4:0: 'pyslurm/core/config.pxd' not found

Error compiling Cython file:
------------------------------------------------------------
...
# cython: c_string_type=unicode, c_string_encoding=default
# cython: language_level=3

from .config cimport Config
^
------------------------------------------------------------

pyslurm/core/slurmctld/__init__.pxd:4:0: 'pyslurm/core/config/Config.pxd' not found

Error compiling Cython file:
------------------------------------------------------------
...
            Whether nodes power down on idle after running jobs
    """
    cdef:
        partition_info_t *ptr
        int power_save_enabled
        slurmctld.Config slurm_conf
       ^
------------------------------------------------------------

pyslurm/core/partition.pxd:221:8: 'Config' is not a type identifier

Error compiling Cython file:
------------------------------------------------------------
...
        raise ValueError(f"Invalid Preempt mode: {mode}")

    return pmode


def _preempt_mode_int_to_str(mode, slurmctld.Config slurm_conf):
                                  ^
------------------------------------------------------------

pyslurm/core/partition.pyx:804:35: 'Config' is not a type identifier

Error compiling Cython file:
------------------------------------------------------------
...
        """
        cdef:
            Partitions partitions = Partitions()
            int flags = slurm.SHOW_ALL
            Partition partition
            slurmctld.Config slurm_conf
           ^
------------------------------------------------------------

pyslurm/core/partition.pyx:81:12: 'Config' is not a type identifier

Error compiling Cython file:
------------------------------------------------------------
...
            Partition partition
            slurmctld.Config slurm_conf
            int power_save_enabled = 0

        verify_rpc(slurm_load_partitions(0, &partitions.info, flags))
        slurm_conf = slurmctld.Config.load()
                                    ^
------------------------------------------------------------

pyslurm/core/partition.pyx:85:37: cimported module has no attribute 'load'
Traceback (most recent call last):
  File "/home/hillenbr/.local/lib/python3.6/site-packages/Cython/Build/Dependencies.py", line 1262, in cythonize_one_helper
    return cythonize_one(*m)
  File "/home/hillenbr/.local/lib/python3.6/site-packages/Cython/Build/Dependencies.py", line 1238, in cythonize_one
    raise CompileError(None, pyx_file)
Cython.Compiler.Errors.CompileError: pyslurm/core/partition.pyx
[12/29] Cythonizing pyslurm/core/slurmctld/base.pyx
[13/29] Cythonizing pyslurm/core/slurmctld/config.pyx
[14/29] Cythonizing pyslurm/core/slurmctld/enums.pyx
[15/29] Cythonizing pyslurm/db/assoc.pyx
[16/29] Cythonizing pyslurm/db/connection.pyx
[17/29] Cythonizing pyslurm/db/job.pyx
[18/29] Cythonizing pyslurm/db/qos.pyx
[19/29] Cythonizing pyslurm/db/stats.pyx
[20/29] Cythonizing pyslurm/db/step.pyx
[21/29] Cythonizing pyslurm/db/tres.pyx
[22/29] Cythonizing pyslurm/db/util.pyx
[23/29] Cythonizing pyslurm/deprecated.pyx
[24/29] Cythonizing pyslurm/settings.pyx
[25/29] Cythonizing pyslurm/utils/cstr.pyx
[26/29] Cythonizing pyslurm/utils/ctime.pyx
[27/29] Cythonizing pyslurm/utils/helpers.pyx
[28/29] Cythonizing pyslurm/utils/uint.pyx
[29/29] Cythonizing pyslurm/xcollections.pyx
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/hillenbr/.local/lib/python3.6/site-packages/Cython/Build/Dependencies.py", line 1262, in cythonize_one_helper
    return cythonize_one(*m)
  File "/home/hillenbr/.local/lib/python3.6/site-packages/Cython/Build/Dependencies.py", line 1238, in cythonize_one
    raise CompileError(None, pyx_file)
Cython.Compiler.Errors.CompileError: pyslurm/core/partition.pyx
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "setup.py", line 279, in <module>
    setup_package()
  File "setup.py", line 268, in setup_package
    cythongen()
  File "setup.py", line 239, in cythongen
    metadata["ext_modules"] = cythonize(get_extensions(), nthreads=int(nthreads))
  File "/home/hillenbr/.local/lib/python3.6/site-packages/Cython/Build/Dependencies.py", line 1106, in cythonize
    result.get(99999)  # seconds
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
Cython.Compiler.Errors.CompileError: pyslurm/core/partition.pyx

tazend · 2025-01-21T14:13:59Z

Hi,

could you please try the latest Cython 3.0.11 and see if it works there?

hillenbrkl · 2025-01-22T08:11:42Z

Hi,

Yes, with Cython 3.0.11 it compiled successfully.

Unfortunately, it produces the same output as before, there is no change for the result of get_resource_layout_per_node().

Instead, I now get an error message when loading the partitions:

Traceback (most recent call last):
  File "./test.py", line 8, in <module>
    partitions = pyslurm.Partitions.load()
  File "pyslurm/core/partition.pyx", line 85, in pyslurm.core.partition.Partitions.load
  File "pyslurm/core/slurmctld/config.pyx", line 209, in pyslurm.core.slurmctld.config.Config.load
  File "pyslurm/core/slurmctld/config.pyx", line 112, in pyslurm.core.slurmctld.config.CgroupConfig.from_ptr
ValueError: invalid literal for int() with base 10: '1000 ms'

tazend · 2025-01-22T12:53:43Z

Hi,

alright. I am really not yet sure why the results are wrong in get_resource_layout_per_node(). Could you send me the debug-output again when running from the fix-job-resource-layout branch? I just need to see if atleast anything is different in the debug-output. Perhaps you also have the possibility of running it on a newer python version as an additional test? (Although I can't really imagine python 3.6 being the problem)

The other error you are receiving when loading Partitions: thanks for making me aware of it. It comes from some new code I added a few days ago. I'll fix it soon. (it affects only the main branch at the moment)

hillenbrkl · 2025-01-22T15:22:30Z

Hi,

Enclosed please find the debug output.

debug.log

I have also compiled PySlurm and executed the test code with python 3.13.1, but there is no change in behaviour.

tazend self-assigned this Jan 20, 2025

OverflowError: value too large to convert to uint16_t #369

OverflowError: value too large to convert to uint16_t #369

Comments

hillenbrkl commented Jan 14, 2025

Details

Issue

tazend commented Jan 14, 2025

Uh oh!

hillenbrkl commented Jan 16, 2025

Uh oh!

tazend commented Jan 16, 2025

Uh oh!

hillenbrkl commented Jan 17, 2025

Uh oh!

tazend commented Jan 17, 2025

Uh oh!

hillenbrkl commented Jan 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tazend commented Jan 17, 2025

Uh oh!

hillenbrkl commented Jan 17, 2025

Uh oh!

tazend commented Jan 17, 2025

Uh oh!

hillenbrkl commented Jan 17, 2025

Uh oh!

tazend commented Jan 17, 2025

Uh oh!

hillenbrkl commented Jan 17, 2025

Uh oh!

tazend commented Jan 17, 2025

Uh oh!

hillenbrkl commented Jan 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tazend commented Jan 17, 2025

Uh oh!

tazend commented Jan 20, 2025

Uh oh!

hillenbrkl commented Jan 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tazend commented Jan 21, 2025

Uh oh!

hillenbrkl commented Jan 22, 2025

Uh oh!

tazend commented Jan 22, 2025

Uh oh!

hillenbrkl commented Jan 22, 2025

Uh oh!

hillenbrkl commented Jan 17, 2025 •

edited

Loading

hillenbrkl commented Jan 17, 2025 •

edited

Loading

hillenbrkl commented Jan 21, 2025 •

edited

Loading