When we first setup our group’s HPC here at CU Boulder, we ran into issues where we would have processes hang on the compute nodes. The jobs would no longer be in the queue, however the processes would still be running on the nodes. This made other jobs that were supposed to be running on said nodes (queued jobs) run much slower.
As this was a problem that occurred relatively frequently, it was necessary to develop a script that could be placed in cron that would automatically kill the non-queued processes on the compute nodes.
Download the script here.
As a starting point for the script, I found a python script that had been developed by David Black-Schaffer for use on the ROCKS Linux Cluster at Stanford (David’s Website). This script was developed to use the SGE queue environment. However, our group had decided to use the Torque PBS queue system.
In order to use the script, there are many edits that needed to be made. This was actually my first exposure to the python language, and took me much longer than it should have. In order for you to use the script, there are many edits that will need to be made, which depend on your current system setup. To get you started, here are a few of the changes that might need to be made:
- Define the list of priority users. These are users that are exempt from having their processes killed.
- In line 65, modify the re.match command to match the node names for your cluster.
- Line 77: modify the re.match command to match the output from the gstat command on your cluster
- As root, run “crontab -e”
- Add a cron entry to run the kill script. Ex:
- 0 12 * * * /root/scripts/kill_non_queued_processes.py
- Restart cron
- /etc/init.d/crond restart
Hello there, You have done an excellent job. I’ll definitely digg it and personally suggest to my friends. I’m sure they’ll be benefited from this web site.