-
Notifications
You must be signed in to change notification settings - Fork 121
OverflowError: value too large to convert to uint16_t #369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, The See here for the documentation: https://pyslurm.github.io/24.11/reference/partition/ |
Hi, Thanks for giving the hint to the new documentation. I am now using Partitions/Nodes/Jobs.load() instead and after changing a lot of code, most of the things are working. The is one issue I cannot resolve. Before the update every job had a list with the nodes and the cpus it used on those nodes. With the new API I cannot find a way to get this information, I can only see the nodes a job is allocating but not how many cpus it uses there. Any hint? Markus |
Hi, I think you mean the https://pyslurm.github.io/24.11/reference/job/#pyslurm.Job.get_resource_layout_per_node It will look something like this: {
'node015':
{
'cpu_ids': '0',
'gres':
{
'gpu:tesla-k80':
{
'count': 1,
'indexes': '0'
}
},
'memory': 4096
}
'node016': { ... }
}
I think in the old API, that string was already expanded to a list. You can use this function though to expand it: https://pyslurm.github.io/24.11/reference/utilities/#pyslurm.utils.expand_range_str So basically: import pyslurm
... other stuff ...
resources_by_node = job_instance.get_resource_layout_per_node ()
for node, resources = resources_by_node.items():
cpu_layout = pyslurm.utils.expand_range_str(resources["cpu_ids"])
cpu_count = len(cpu_layout)
print(node, cpu_count) That should give you the CPU-Count per node. In the old API, starting with 24.11, the dict fields I mentioned at the start are now also disfunctional, because the Slurm devs removed some API functions that were used to get this. So yeah, by using the get_resource_layout_per_node function you should get everything needed. As it says in the docs, the return type is still subject to change in the future (haven't really gotten much feedback on this one, so I just stuck with a nested dict). Maybe I make the inner dict a seperate class with the |
Hi, Great, that does the trick. There is one drawback, through: if the job runs on only one node, get_resource_layout_per_node() does not return anything. Is this intended? A separate inner class for accessing Markus |
Hi, that there is nothing returned when the Job only runs on a single node is definitely a bug. I will debug this and find a fix. But it works when the job is running on two nodes? That would be weird indeed. I'll check it out. Alright, I will also go ahead and make the inner dict a class instead for more convenience |
Hi, It seems that the return value is empty sometimes only for 1 node jobs, please cf. attachment. The code I wrote for this:
Markus |
Hi, alright, interesting. Could you check what With |
Hi, here is one example (I replaced user data with XXX):
|
Hi, mhhh, I cannot reproduce this at the moment on my HPC-Cluster, running 23.11 with python 3.11. If all everything fails, I would probably give you a patch that inserts debug output in the get_resource_layout_per_node function, so you could rebuild with this applied and we can see where it fails. |
Great. |
Hi, I attached a patch with some debug print statements. Can you apply them on the main branch with: git apply -v pyslurm-resource-layout-debug.txt Then recompile and run your previous code again, and send me the output? |
Here we go ... |
Hi, alright, thanks for the output. I also tried it with python 3.6.15 in my test setup, but I still cannot reproduce it. You are using the stock 3.6.8 python on Rocky 8, right? How did you compile/install pyslurm? Really wondering what is going on here. I can't make sense of the fact that sometimes it reports that there are 0 hosts, and other times there are like 600. |
Hi, Yes, I am using Python 3.6.8 from the Rocky 8.10 installation
I compile and install PySlurm this way:
Cython:
I have re-installed cython with pip3, deleted /cluster/slurm/lib64/python3.6 and reinstalled PySlurm ... it changed nothing. |
Hi, alright...seems good. I might've found what is potentially causing the issue... not 100% sure yet, but I will create a seperate branch with a fix and will let you know about it, so you can try. |
Hi, could you please try and install pyslurm from this branch: https://github.com/tazend/pyslurm/tree/fix-job-resource-layout And see if this fixes the problems? |
Hi, I get the following errors now (after git checkout -b fix-job-resource-layout):
|
Hi, could you please try the latest Cython 3.0.11 and see if it works there? |
Hi, Yes, with Cython 3.0.11 it compiled successfully. Unfortunately, it produces the same output as before, there is no change for the result of get_resource_layout_per_node(). Instead, I now get an error message when loading the partitions:
|
Hi, alright. I am really not yet sure why the results are wrong in The other error you are receiving when loading Partitions: thanks for making me aware of it. It comes from some new code I added a few days ago. I'll fix it soon. (it affects only the main branch at the moment) |
Hi, Enclosed please find the debug output. I have also compiled PySlurm and executed the test code with python 3.13.1, but there is no change in behaviour. |
Details
Issue
After uprading from 24.05 to 24.11 I get the following message in previously working code:
File "./bin/slurm2json", line 907, in
partitions = pyslurm.partition().get()
File "pyslurm/deprecated.pyx", line 1082, in pyslurm.deprecated.partition.get
File "pyslurm/deprecated.pyx", line 6378, in pyslurm.deprecated.get_partition_mode
OverflowError: value too large to convert to uint16_t
The text was updated successfully, but these errors were encountered: