Skip to content

OverflowError: value too large to convert to uint16_t #369

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hillenbrkl opened this issue Jan 14, 2025 · 21 comments
Open

OverflowError: value too large to convert to uint16_t #369

hillenbrkl opened this issue Jan 14, 2025 · 21 comments
Assignees

Comments

@hillenbrkl
Copy link

Details

  • Slurm Version: 24.11.0
  • Python Version: 3.6
  • Cython Version: 0.29.32
  • PySlurm Branch: main
  • Linux Distribution: Rocky 8

Issue

After uprading from 24.05 to 24.11 I get the following message in previously working code:

File "./bin/slurm2json", line 907, in
partitions = pyslurm.partition().get()
File "pyslurm/deprecated.pyx", line 1082, in pyslurm.deprecated.partition.get
File "pyslurm/deprecated.pyx", line 6378, in pyslurm.deprecated.get_partition_mode
OverflowError: value too large to convert to uint16_t

@tazend
Copy link
Member

tazend commented Jan 14, 2025

Hi,

The pyslurm.partition class is deprecated and will be removed in the future, and fixes to this class will likely not be made.
Please use the new class pyslurm.Partition and pyslurm.Partitions.

See here for the documentation: https://pyslurm.github.io/24.11/reference/partition/

@hillenbrkl
Copy link
Author

Hi,

Thanks for giving the hint to the new documentation. I am now using Partitions/Nodes/Jobs.load() instead and after changing a lot of code, most of the things are working.

The is one issue I cannot resolve. Before the update every job had a list with the nodes and the cpus it used on those nodes. With the new API I cannot find a way to get this information, I can only see the nodes a job is allocating but not how many cpus it uses there. Any hint?

Markus

@tazend
Copy link
Member

tazend commented Jan 16, 2025

Hi,

I think you mean the cpus_allocated and/or cpus_alloc_layout fields of the job dict?
Yes, there is this function you can call on every Job instance:

https://pyslurm.github.io/24.11/reference/job/#pyslurm.Job.get_resource_layout_per_node

It will look something like this:

{
   'node015': 
       {
         'cpu_ids': '0',
         'gres':
            {
              'gpu:tesla-k80':
                  {
                   'count': 1,
                   'indexes': '0'
                  }
            },
        'memory': 4096
      }

    'node016': { ... }
}

cpu_ids is just a string which can also look like this for example I believe if the CPU-IDs allocated are contiguous:

0,2-10,12

I think in the old API, that string was already expanded to a list. You can use this function though to expand it:

https://pyslurm.github.io/24.11/reference/utilities/#pyslurm.utils.expand_range_str

So basically:

import pyslurm

... other stuff ...

resources_by_node = job_instance.get_resource_layout_per_node ()

for node, resources = resources_by_node.items():
    cpu_layout = pyslurm.utils.expand_range_str(resources["cpu_ids"])
    cpu_count = len(cpu_layout)

    print(node, cpu_count)

That should give you the CPU-Count per node. In the old API, starting with 24.11, the dict fields I mentioned at the start are now also disfunctional, because the Slurm devs removed some API functions that were used to get this.

So yeah, by using the get_resource_layout_per_node function you should get everything needed. As it says in the docs, the return type is still subject to change in the future (haven't really gotten much feedback on this one, so I just stuck with a nested dict). Maybe I make the inner dict a seperate class with the memory, cpu_ids and gres keys being attributes instead (so accessing them by dot is also nicer). And I can also provide an attribute/method that calculates the cpu count, for convenience. What do you think, would you prefer that?

@hillenbrkl
Copy link
Author

Hi,

Great, that does the trick. There is one drawback, through: if the job runs on only one node, get_resource_layout_per_node() does not return anything. Is this intended?

A separate inner class for accessing memory, cpu_ids and gres directly as an attribute would indeed be much nicer. And then another attribute cpus could sum up the cpu count.

Markus

@tazend
Copy link
Member

tazend commented Jan 17, 2025

Hi,

that there is nothing returned when the Job only runs on a single node is definitely a bug. I will debug this and find a fix. But it works when the job is running on two nodes? That would be weird indeed. I'll check it out.

Alright, I will also go ahead and make the inner dict a class instead for more convenience

@hillenbrkl
Copy link
Author

hillenbrkl commented Jan 17, 2025

Hi,

It seems that the return value is empty sometimes only for 1 node jobs, please cf. attachment.

The code I wrote for this:

for j in jobs :
	data = jobs[j]
	if not data.state == "RUNNING" : continue
	print("%d node job (%d cpus): %s -> %s" % (data.num_nodes, data.cpus, data.id, data.get_resource_layout_per_node()))

Markus

running_jobs.txt

@tazend
Copy link
Member

tazend commented Jan 17, 2025

Hi,

alright, interesting. Could you check what scontrol show job <JOB> -d shows for a JobID that produces no output from get_resource_layout_per_node via pyslurm?

With -d scontrol will also show the resource per node. Would be interesting to see if it works correctly via scontrol. If it works for a job that procudes no return value in pyslurm, could you show me the scontrol output?

@hillenbrkl
Copy link
Author

Hi,

here is one example (I replaced user data with XXX):

JobId=14171146 JobName='g16 21J_SP.gjf'
   UserId=XXX GroupId=XXX MCS_label=N/A
   Priority=4788 Nice=0 Account=default QOS=u-proj
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=1-23:31:20 TimeLimit=30-00:00:00 TimeMin=N/A
   SubmitTime=2024-12-12T13:23:37 EligibleTime=2024-12-12T13:23:37
   AccrueTime=2024-12-12T13:23:37
   StartTime=2025-01-15T11:01:17 EndTime=2025-02-14T11:01:19 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-01-15T11:01:17 Scheduler=Backfill
   Partition=skylake-384 AllocNode:Sid=head4:2286380
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node855
   BatchHost=node855
   NumNodes=1 NumCPUs=24 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=24,mem=114G,node=1,billing=24
   AllocTRES=cpu=24,mem=114G,node=1,billing=24
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   JOB_GRES=(null)
     Nodes=node855 CPU_IDs=0-23 Mem=116736 GRES=
   MinCPUsNode=1 MinMemoryNode=114G MinTmpDiskNode=0
   Features=JT:1 DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/tmp/2286731
   WorkDir=XXX
   Comment=jHPC::g16 
   StdErr=XXX
   StdIn=/dev/null
   StdOut=XXX
 

@tazend
Copy link
Member

tazend commented Jan 17, 2025

Hi,

mhhh, I cannot reproduce this at the moment on my HPC-Cluster, running 23.11 with python 3.11.
Maybe something changed on 24.11, but I will need to see. It is weird that there is sometimes output, and sometimes not. I'll also need to check if its python 3.6 being the problem, but I don't think so.

If all everything fails, I would probably give you a patch that inserts debug output in the get_resource_layout_per_node function, so you could rebuild with this applied and we can see where it fails.

@hillenbrkl
Copy link
Author

Great.

@tazend
Copy link
Member

tazend commented Jan 17, 2025

Hi,

I attached a patch with some debug print statements. Can you apply them on the main branch with:

git apply -v pyslurm-resource-layout-debug.txt

Then recompile and run your previous code again, and send me the output?

pyslurm-resource-layout-debug.txt

@hillenbrkl
Copy link
Author

Here we go ...

test_result.txt

@tazend
Copy link
Member

tazend commented Jan 17, 2025

Hi,

alright, thanks for the output. I also tried it with python 3.6.15 in my test setup, but I still cannot reproduce it.

You are using the stock 3.6.8 python on Rocky 8, right? How did you compile/install pyslurm?
Can you check during the installation which Cython version is used again? It cannot be 0.29.32 though without modifications to setup.py, because setup.py will decline anything below 0.29.37.

Really wondering what is going on here. I can't make sense of the fact that sometimes it reports that there are 0 hosts, and other times there are like 600.

@hillenbrkl
Copy link
Author

hillenbrkl commented Jan 17, 2025

Hi,

Yes, I am using Python 3.6.8 from the Rocky 8.10 installation

hillenbr@head0 [pyslurm] python3
Python 3.6.8 (default, Dec  4 2024, 12:35:02) 
[GCC 8.5.0 20210514 (Red Hat 8.5.0-22)] on linux
Type "help", "copyright", "credits" or "license" for more information.

I compile and install PySlurm this way:

git clone https://github.com/PySlurm/pyslurm.git --branch main
cd pyslurm
python3 setup.py build
python3 setup.py install --prefix=/cluster/slurm

Cython:

# cython --version
Cython version 0.29.37

I have re-installed cython with pip3, deleted /cluster/slurm/lib64/python3.6 and reinstalled PySlurm ... it changed nothing.

@tazend
Copy link
Member

tazend commented Jan 17, 2025

Hi,

alright...seems good. I might've found what is potentially causing the issue... not 100% sure yet, but I will create a seperate branch with a fix and will let you know about it, so you can try.

@tazend
Copy link
Member

tazend commented Jan 20, 2025

Hi,

could you please try and install pyslurm from this branch: https://github.com/tazend/pyslurm/tree/fix-job-resource-layout

And see if this fixes the problems?

@tazend tazend self-assigned this Jan 20, 2025
@hillenbrkl
Copy link
Author

hillenbrkl commented Jan 21, 2025

Hi,

I get the following errors now (after git checkout -b fix-job-resource-layout):

[11/29] Cythonizing pyslurm/core/partition.pyx

Error compiling Cython file:
------------------------------------------------------------
...
# cython: c_string_type=unicode, c_string_encoding=default
# cython: language_level=3

from .config cimport Config
^
------------------------------------------------------------

pyslurm/core/slurmctld/__init__.pxd:4:0: 'pyslurm/core/config.pxd' not found

Error compiling Cython file:
------------------------------------------------------------
...
# cython: c_string_type=unicode, c_string_encoding=default
# cython: language_level=3

from .config cimport Config
^
------------------------------------------------------------

pyslurm/core/slurmctld/__init__.pxd:4:0: 'pyslurm/core/config/Config.pxd' not found

Error compiling Cython file:
------------------------------------------------------------
...
            Whether nodes power down on idle after running jobs
    """
    cdef:
        partition_info_t *ptr
        int power_save_enabled
        slurmctld.Config slurm_conf
       ^
------------------------------------------------------------

pyslurm/core/partition.pxd:221:8: 'Config' is not a type identifier

Error compiling Cython file:
------------------------------------------------------------
...
        raise ValueError(f"Invalid Preempt mode: {mode}")

    return pmode


def _preempt_mode_int_to_str(mode, slurmctld.Config slurm_conf):
                                  ^
------------------------------------------------------------

pyslurm/core/partition.pyx:804:35: 'Config' is not a type identifier

Error compiling Cython file:
------------------------------------------------------------
...
        """
        cdef:
            Partitions partitions = Partitions()
            int flags = slurm.SHOW_ALL
            Partition partition
            slurmctld.Config slurm_conf
           ^
------------------------------------------------------------

pyslurm/core/partition.pyx:81:12: 'Config' is not a type identifier

Error compiling Cython file:
------------------------------------------------------------
...
            Partition partition
            slurmctld.Config slurm_conf
            int power_save_enabled = 0

        verify_rpc(slurm_load_partitions(0, &partitions.info, flags))
        slurm_conf = slurmctld.Config.load()
                                    ^
------------------------------------------------------------

pyslurm/core/partition.pyx:85:37: cimported module has no attribute 'load'
Traceback (most recent call last):
  File "/home/hillenbr/.local/lib/python3.6/site-packages/Cython/Build/Dependencies.py", line 1262, in cythonize_one_helper
    return cythonize_one(*m)
  File "/home/hillenbr/.local/lib/python3.6/site-packages/Cython/Build/Dependencies.py", line 1238, in cythonize_one
    raise CompileError(None, pyx_file)
Cython.Compiler.Errors.CompileError: pyslurm/core/partition.pyx
[12/29] Cythonizing pyslurm/core/slurmctld/base.pyx
[13/29] Cythonizing pyslurm/core/slurmctld/config.pyx
[14/29] Cythonizing pyslurm/core/slurmctld/enums.pyx
[15/29] Cythonizing pyslurm/db/assoc.pyx
[16/29] Cythonizing pyslurm/db/connection.pyx
[17/29] Cythonizing pyslurm/db/job.pyx
[18/29] Cythonizing pyslurm/db/qos.pyx
[19/29] Cythonizing pyslurm/db/stats.pyx
[20/29] Cythonizing pyslurm/db/step.pyx
[21/29] Cythonizing pyslurm/db/tres.pyx
[22/29] Cythonizing pyslurm/db/util.pyx
[23/29] Cythonizing pyslurm/deprecated.pyx
[24/29] Cythonizing pyslurm/settings.pyx
[25/29] Cythonizing pyslurm/utils/cstr.pyx
[26/29] Cythonizing pyslurm/utils/ctime.pyx
[27/29] Cythonizing pyslurm/utils/helpers.pyx
[28/29] Cythonizing pyslurm/utils/uint.pyx
[29/29] Cythonizing pyslurm/xcollections.pyx
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/hillenbr/.local/lib/python3.6/site-packages/Cython/Build/Dependencies.py", line 1262, in cythonize_one_helper
    return cythonize_one(*m)
  File "/home/hillenbr/.local/lib/python3.6/site-packages/Cython/Build/Dependencies.py", line 1238, in cythonize_one
    raise CompileError(None, pyx_file)
Cython.Compiler.Errors.CompileError: pyslurm/core/partition.pyx
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "setup.py", line 279, in <module>
    setup_package()
  File "setup.py", line 268, in setup_package
    cythongen()
  File "setup.py", line 239, in cythongen
    metadata["ext_modules"] = cythonize(get_extensions(), nthreads=int(nthreads))
  File "/home/hillenbr/.local/lib/python3.6/site-packages/Cython/Build/Dependencies.py", line 1106, in cythonize
    result.get(99999)  # seconds
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
Cython.Compiler.Errors.CompileError: pyslurm/core/partition.pyx

@tazend
Copy link
Member

tazend commented Jan 21, 2025

Hi,

could you please try the latest Cython 3.0.11 and see if it works there?

@hillenbrkl
Copy link
Author

Hi,

Yes, with Cython 3.0.11 it compiled successfully.

Unfortunately, it produces the same output as before, there is no change for the result of get_resource_layout_per_node().

Instead, I now get an error message when loading the partitions:

Traceback (most recent call last):
  File "./test.py", line 8, in <module>
    partitions = pyslurm.Partitions.load()
  File "pyslurm/core/partition.pyx", line 85, in pyslurm.core.partition.Partitions.load
  File "pyslurm/core/slurmctld/config.pyx", line 209, in pyslurm.core.slurmctld.config.Config.load
  File "pyslurm/core/slurmctld/config.pyx", line 112, in pyslurm.core.slurmctld.config.CgroupConfig.from_ptr
ValueError: invalid literal for int() with base 10: '1000 ms'

@tazend
Copy link
Member

tazend commented Jan 22, 2025

Hi,

alright. I am really not yet sure why the results are wrong in get_resource_layout_per_node(). Could you send me the debug-output again when running from the fix-job-resource-layout branch? I just need to see if atleast anything is different in the debug-output. Perhaps you also have the possibility of running it on a newer python version as an additional test? (Although I can't really imagine python 3.6 being the problem)

The other error you are receiving when loading Partitions: thanks for making me aware of it. It comes from some new code I added a few days ago. I'll fix it soon. (it affects only the main branch at the moment)

@hillenbrkl
Copy link
Author

Hi,

Enclosed please find the debug output.

debug.log

I have also compiled PySlurm and executed the test code with python 3.13.1, but there is no change in behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants