Skip to content

[Documentation:System] Repair Services Cron Job Documentation #679

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions _docs/developer/development_instructions/automated_grading.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,8 @@ To debug new features for autograding, it can be helpful to run
`submitty_autograding_shipper.py` and `submitty_autograding_worker.py`
interactively and inspect the output.

_NOTE: A cron job runs hourly to detect autograding shipper/worker outages on both local and remote machines. To avoid interference during debugging, this job should be disabled before proceeding. See [Capture Cron Error Messages](/sysadmin/installation/system_customization#capture-cron-error-messages) for instructions on disabling the script._

To do this:

1. Stop the daemons (on each server, as appropriate)
Expand Down
21 changes: 18 additions & 3 deletions _docs/sysadmin/installation/system_customization.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,10 +28,25 @@ You may want to back up more of `/var/local/submitty` to save configurations and

## Capture cron error messages

The `submitty_daemon` user runs the [sbin/send_email.py](https://github.com/Submitty/Submitty/blob/master/sbin/send_email.py)
script. Console output from this script can be emailed to a sysadmin to help ensure that errors can be reported and addressed.
To ensure the reliability of the various Submitty services, such as the WebSocket server, their health status is monitored and restarted hourly via the [sbin/repair_services.sh](https://github.com/Submitty/Submitty/blob/master/sbin/repair_services.sh) script run by the `submitty_daemon` user. This script leverages `systemctl` along with various health-check utility scripts to verify the active state of these services, triggering a restart if an inactive state is detected.

The first line should be set as `MAILTO=` with a valid email address. For example:
Service failures can occur for various reasons, including unhandled exceptions, memory leaks, port binding issues, or OS-level disruptions such as resource exhaustion. All failures are logged with their relevant timestamp, source, and last output within the `/var/log/services` directory for the given day in the format `YYYYMMDD.txt`.

To disable this auto-repair mechanism, comment out the relevant line in the source `.setup/submitty_crontab` file within your repository. Since the crontab is auto-generated during installation, any changes must be followed by a re-run of `submitty_install` to persist them.

```bash
# In .setup/submitty_crontab, comment out the repair_services.sh line:
# 0 * * * * submitty_daemon sudo /usr/local/submitty/sbin/repair_services.sh

# Then re-apply the configuration:
submitty_install
```

_Note: This mechanism should only be disabled with caution in production environments._

The `submitty_daemon` user runs a variety of other scripts, such as [sbin/send_email.py](https://github.com/Submitty/Submitty/blob/master/sbin/send_email.py) to send pending emails every minute. Console output from these scripts can be emailed to a sysadmin to help ensure that errors can be reported and addressed.

The first line of the relevant script should be set as `MAILTO=` with a valid email address, as shown below.
```
[email protected]
* * * * * python3 /usr/local/submitty/sbin/send_email.py
Expand Down
6 changes: 6 additions & 0 deletions _docs/sysadmin/troubleshooting/system_debugging.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,12 @@ redirect_from:
/var/log/nginx/error.log
```

* Look for errors in the daily service outage log

```
/var/log/services/YYYYMMDD.txt
```

* Check the SSL keys / certificates for apache & nginx.
Look for ssl key & certificate files specified in the enabled
`.conf` files for apache & nginx:
Expand Down
Loading