Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Schedule jobs deleted or cleaned up on highstate run - v3006.9 #67809

Open
ns-ksahoo opened this issue Mar 12, 2025 · 1 comment
Open
Labels
Bug broken, incorrect, or confusing behavior needs-triage

Comments

@ns-ksahoo
Copy link

ns-ksahoo commented Mar 12, 2025

Description
Salt Minion scheduler (salt-minion v3006.X) on Ubuntu 20.04, 24.04 stops executing scheduled jobs after running once or twice. The schedule initially appears correctly upon restarting the service (systemctl restart salt-minion) but then disappears silently from the runtime (salt-call schedule.list becomes empty).

Affected Versions and Environment:

Salt Minion version: 3006.X
Operating System: Ubuntu 24.04, 20.04
Python version: 3.10, 3.8

Setup
On Ubuntu 20.04, 24.04, configure the schedule in /etc/salt/minion.d/schedule.conf:

$ cat /etc/salt/minion.d/_schedule.conf
schedule:
  __master_alive_MASTER1_IP:
    enabled: true
    function: status.master
    jid_include: true
    kwargs: {connected: true, master: MASTER1_IP}
    maxrunning: 1
    return_job: false
    seconds: 60
  __mine_interval: {enabled: true, function: mine.update, jid_include: true, maxrunning: 2,
    minutes: 60, return_job: false, run_on_start: true}
  __update_grains:
    args:
    - {}
    - grains_refresh
    function: event.fire
    jid_include: true
    maxrunning: 1
    minutes: 15
    name: __update_grains
    run: true
    splay: null
  highstate: {enabled: true, function: state.highstate, jid_include: true, maxrunning: 1,
    name: highstate, run: true, seconds: 600, splay: 49}
$ cat /etc/salt/minion.d/minion.conf
# Managed by Salt, do not edit directly
# salt://salt/templates/minion.jinja

master:
  - MASTER1_IP
  - MASTER2_IP
master_alive_interval: 60
master_failback: True
master_failback_interval: 120
master_type: failover
retry_dns: 0
startup_states: highstate
keysize: 4096
pillar_raise_on_missing: True

acceptance_wait_time: 10
acceptance_wait_time_max: 90
auth_timeout: 90
enable_zip_modules: True
grains_refresh_every: 15
hash_type: sha256
log_level: warning
log_level_logfile: debug
random_reauth_delay: 90
random_startup_delay: 10
recon_default: 250
recon_max: 90000
recon_randomize: True
state_verbose: True
state_output: changes

Steps to Reproduce the behavior

Restart the Salt Minion:
systemctl restart salt-minion

Verify the schedule is initially listed:

salt-call schedule.list show_all=True

Wait for one or two scheduled executions. After that, check again:

salt-call schedule.list show_all=True

Expected behavior

Schedules remain persistent and the scheduled jobs run continuously at defined intervals.
Salt Minion should terminate cleanly, including all child processes, preserving scheduler integrity across service restarts.

Actual Behavior:

Scheduled jobs vanish from runtime scheduler silently after running one or two times, stopping any further execution until minion is restarted.
Child processes become defunct (zombie processes), causing the scheduler to lose its state and stop executing scheduled tasks.

$ sudo salt-call schedule.list
local:
    schedule:
      highstate:
        enabled: true
        function: state.highstate
        jid_include: true
        maxrunning: 1
        name: highstate
        saved: true
        seconds: 600
        splay: 49
$ ll /etc/salt.lastcontact ; date
-rw-r--r-- 1 root root 14 Mar 12 03:36 /etc/salt.lastcontact
Wed Mar 12 03:38:54 UTC 2025
$ ll /etc/salt.lastcontact ; date
-rw-r--r-- 1 root root 14 Mar 12 03:46 /etc/salt.lastcontact
Wed Mar 12 03:48:23 UTC 2025

Here=, highstate ran twice and the schedule vanished

$ sudo salt-call schedule.list
local:
    schedule: {}

Logs and Observations:

Observed Logs:

systemd[1]: salt-minion.service: State 'stop-sigterm' timed out. Killing.
systemd[1]: salt-minion.service: Killing process 1341396 (python3.10) with signal SIGKILL.
systemd[1]: salt-minion.service: Main process exited, code=killed, status=9/KILL
systemd[1]: salt-minion.service: Failed with result 'timeout'.
systemd[1]: salt-minion.service: Unit process 1341414 (/opt/saltstack/) remains running after unit stopped.
systemd[1]: Stopped salt-minion.service - The Salt Minion.
systemd[1]: salt-minion.service: Consumed 1min 6.565s CPU time, 329.1M memory peak, 0B memory swap peak.
systemd[1]: salt-minion.service: Found left-over process 1341414 (/opt/saltstack/) in control group while starting unit. Ignoring.
systemd[1]: salt-minion.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
systemd[1]: Starting salt-minion.service - The Salt Minion...
systemd[1]: Started salt-minion.service - The Salt Minion.
salt-minion[1897212]: The Salt Minion is shutdown.
salt-minion[1341414]: Minion process encountered exception: [Errno 3] No such process
systemd[1]: salt-minion.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: salt-minion.service: Failed with result 'exit-code'.
systemd[1]: salt-minion.service: Consumed 1.030s CPU time.
systemd[1]: salt-minion.service: Scheduled restart job, restart counter is at 1.
systemd[1]: Starting salt-minion.service - The Salt Minion...
systemd[1]: Started salt-minion.service - The Salt Minion.

Scheduler initially loads correctly (salt-call --local config.get schedule shows correct schedules).
Logs reveal scheduler evaluates once or twice, runs the scheduled states, and then clears or loses schedule internally without obvious error logging.
Scheduler-related commands (schedule.reload, schedule.enable) don't restore vanished schedules.

Symptoms are exactly I see:

Schedule config is read at startup (config.get schedule is good).

Schedule briefly appears and even runs once or twice.

The scheduler then silently clears or forgets the schedule (schedule.list becomes empty).

Minion logs report no meaningful errors—it’s a silent runtime failure in the scheduler process.

Workarounds Attempted:

  • Renaming schedules to avoid special characters from the scheduler name. (Prior I had core|salt|highstate)

  • Increased the seconds from 600 to 1800 and splay to 500 seconds.

  • Added startup_splay 30seconds.

  • Clearing minion cache.

  • Increasing file descriptor limits.

  • Explicitly reloading schedules (schedule.reload).

  • Adjusted the upstream systemd service file /lib/systemd/system/salt-minion.service to reflect these improved configurations:

[Service]
KillMode=mixed
TimeoutStopSec=900
Restart=on-failure
RestartSec=30
Type=simple

KillMode=mixed: Gracefully terminates all associated child processes.
TimeoutStopSec=900: Provides sufficient time (15 minutes) for ongoing highstate jobs to complete gracefully.
Restart=on-failure: Enables automatic recovery following unexpected exits.
RestartSec=30: Implements a brief delay before service restart, ensuring stability.

However, none of them permanently resolve the issue.

Unstable Environments (Issue Occurs):

Ubuntu 24.04 with Salt 3006.X
Ubuntu 20.04 with Salt Minion versions: 3006.X

Versions Report

$ sudo salt-call --versions-report
Salt Version:
          Salt: 3006.9

Python Version:
        Python: 3.10.14 (main, Jun 26 2024, 11:44:37) [GCC 11.2.0]

Dependency Versions:
          cffi: 1.17.1
      cherrypy: 18.6.1
  cryptography: 42.0.5
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.4
       libgit2: Not Installed
  looseversion: 1.0.2
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 22.0
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.19.1
        pygit2: Not Installed
  python-gnupg: 0.4.8
        PyYAML: 6.0.1
         PyZMQ: 23.2.0
        relenv: 0.17.0
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: ubuntu 24.04.1 noble
        locale: utf-8
       machine: x86_64
       release: 6.8.0-48-generic
        system: Linux
       version: Ubuntu 24.04.1 noble

$ lsb_release -a; uname -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 24.04.1 LTS
Release:	24.04
Codename:	noble
Linux salt-master02 6.8.0-48-generic #48-Ubuntu SMP PREEMPT_DYNAMIC Fri Sep 27 14:04:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Severity:
Critical: causes silent failure and configuration drift in production environments.

Additional Notes:
The issue strongly suggests compatibility problems specifically related to system libraries or runtime conditions in Ubuntu 20.04, 24.04.

Request:
Please prioritize investigating compatibility and scheduling behavior in Salt Minion running v3006.x

Let me know if additional logs or tests are required.

@ns-ksahoo ns-ksahoo added Bug broken, incorrect, or confusing behavior needs-triage labels Mar 12, 2025
Copy link

welcome bot commented Mar 12, 2025

Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey.
Please be sure to review our Code of Conduct. Also, check out some of our community resources including:

There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar.
If you have additional questions, email us at [email protected]. We’re glad you’ve joined our community and look forward to doing awesome things with you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior needs-triage
Projects
None yet
Development

No branches or pull requests

1 participant