SLURM 20.02 Cluster Setup

1. Pre-requisites:

1.  Add Host entry in /etc/hosts

  192.168.1.6 slurmmaster.unixadmin.in slurmmaster

192.168.1.7 cnode01.unixadmin.in cnode01

192.168.1.8 cnode02.unixdmin.in cnode02

2. Disable Selinux in /etc/selinux/config

    selinux=disabled

3. Stop Firewall Service:

systemctl stop firewalld.service

systemctl disable firewalld.service

4. Start NTP Service:

systemctl start chronyd

systemctl enable chronyd

5. SSH auto login  from masternode to all compute nodes

2. Installation steps

  1. MUNGE INSTALLATION:

Munge is used as an authentication service which allows a process to authenticate within a group of nodes having common user groups using UID and GID. it is secured by an authentication key shared among all the nodes.

Prerequisites for munge installation :

  1. Create user and group for Munge and Slurm on all the nodes :
  1. Create munge user with id-1001

export MUNGEUSER=1001 

  1. Create a munge group and add a user munge to it.

groupadd -g $MUNGEUSER munge  

useradd  -m -c “MUNGE Uid ‘N’ Gid Emporium” -d /var/lib/munge -u $MUNGEUSER g munge  -s /sbin/nologin munge

  1. Creating slurm user with id-1002

  export SLURMUSER=1002

  1. Create a munge group and add the slurm user to it.

groupadd -g $SLURMUSER slurm  

useradd  -m -c “SLURM workload manager” -d /var/lib/slurm -u $SLURMUSER -g slurm  -s /bin/bash slurm.

 Munge installation steps :

  1. On Master node: 
  2. Install EPEL in order to install Munge 

yum install epel-release.noarch 

Then , install Munge

yum install munge munge-libs munge-devel  

  1. Create an munge authentication key

/usr/sbin/create-munge-key 

  1. Copy the  Munge authentication key to /home directory

cp /etc/munge/munge.key /home  

  1. Then, share the copied key to all compute nodes (cnode01, cnode02)

scp /home/munge.key root@cnode01:/etc/munge 

scp /home/munge.key root@cnode02:/etc/munge 

scp /home/munge.key root@slurmdbd:/etc/munge

scp /home/munge.key root@loginnode:/etc/munge

2. On Slurm Master, Slurmdb and compute nodes: 

  1. Set the permissions on master and the compute nodes.

chown -R munge: /etc/munge/ /var/log/munge/ /var/lib/munge/ /run/munge/

 chmod 0700 /etc/munge/ /var/log/munge/ /var/lib/munge/ /run/munge/

  1. Reboot and start the munge service

 systemctl enable munge

 systemctl start munge

  1. Test the munge on the master node 

munge -n | unmunge

    2. Slurm Installation steps:

  1. MariaDb Installation On Slurmdb Node :
  1. Install mariadb on master

yum install mariadb-server mariadb-devel

  1. Enable and start the mariadb service

systemctl start mariadb

systemctl enable mariadb

  1. Set up the root password and secure mariadb using the following command 

mysql_secure_installation 

  1. Login to the database through created password and then create and configure the slurm_acct_db as:

mysql -u root -p

Enter password:

MariaDB [(none)]> grant all on slurm_acct_db.* TO ‘slurm’@’slurmdb.unixadmin.in’ identified by ‘xyz’ with grant option;

SHOW VARIABLES LIKE ‘have_innodb’;

create database slurm_acct_db;

quit;

  1. Modify the innodb configuration:

Set the innodb_lock_wait_timeout,innodb_log_file_size and innodb_buffer_pool_size  to larger values 

  vim /etc/my.cnf.d/innodb.cnf

[mysqld]

 innodb_buffer_pool_size=1024M

 innodb_log_file_size=64M

 innodb_lock_wait_timeout=900

To implement this change you need to stop/ shutdown the database and move or remove the log files and then restart the database.

 systemctl stop mariadb

 mv /var/lib/mysql/ib_logfile? /tmp/

 systemctl start mariadb

2. On Slurm Master:

  1. In order to install the Slurm , install the following prerequisites : 

# yum install openssl openssl-devel pam-devel rpmbuild numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel man2html libibmad libibumad

# yum install python3

# wget http://mirror.centos.org/centos/7/os/x86_64/Packages/perl-ExtUtils-MakeMaker-6.68-3.el7.noarch.rpm 

# yum localinstall perl-ExtUtils-MakeMaker-6.68-3.el7.noarch.rpm

  1. Download and retrieve the tar ball 

wget https://download.schedmd.com/slurm/slurm-20.02.5.tar.bz2

  1. Create the RPMs as

rpmbuild -ta slurm-20.02.5.tar.bz2 

The RPM packages will typically be in /root/rpmbuild/RPMS/x86_64/ and should be installed on all relevant nodes.

  1. Share this RPMS created through NFS on all the nodes as :

yum install nfs-utils libnfsidmap

systemctl enable rpcbind

systemctl start rpcbind

systemctl enable nfs-server

systemctl start nfs-server

systemctl start rpc-statd

systemctl enable nfs-idmapd

Create a directory /source and export all the RPMs to it, so that it can be accessible to all nodes.

mkdir /source/slurm_20.02

chmod 777 /source/slurm_20.02

vi /etc/exports

/source 192.168.1.7(rw,sync,no_root_squash)

/source 192.168.1.8(rw,sync,no_root_squash)

exportfs -r

  showmount -e 

Export list for slurmmaster.unixadmin.in:

/source 192.168.1.8,192.168.1.7

  1. Now install the relevant RPMs on the master node.

 yum localinstall slurm-20.02.5-1.el7.x86_64.rpm

 yum localinstall slurm-perlapi-20.02.5-1.el7.x86_64.rpm

 yum localinstall slurm-slurmctld-20.02.5-1.el7.x86_64.rpm

 yum localinstall slurm-example-configs-20.02.5-1.el7.x86_64.rpm

 yum localinstall slurm-torque-20.02.5-1.el7.x86_64.rpm

On Compute Nodes and Slurmdb node :

  1. Create a directory to mount the shared rpm through master node , enable and start  nfs service.

yum install nfs-utils libnfsidmap

systemctl enable rpcbind

systemctl start rpcbind

  showmount -e 192.168.1.6

Create the directory /mnt/source/slurm_20.02 and mount it 

mkdir -p /mnt/source/slurm_20.02

mount 192.168.0.118:/source/slurm /mnt/source/slurm_20.02

  1. Now install the relevant RPMs on the Head/Slurmdbd node and the compute nodes.

         On slurmdb node : 

slurm-20.02.5-1.el7.x86_64.rpm

slurm-slurmdbd-20.02.5-1.el7.x86_64.rpm

slurm-devel-20.02.5-1.el7.x86_64.rpm

On Compute nodes:

slurm-20.02.5-1.el7.x86_64.rpm

slurm-perlapi-20.02.5-1.el7.x86_64.rpm

slurm-pam_slurm-20.02.5-1.el7.x86_64.rpm

slurm-li bpmi-20.02.5-1.el7.x86_64.rpm

slurm-slurmd-20.02.5-1.el7.x86_64.rpm

slurm-devel-20.02.5-1.el7.x86_64.rpm

slurm-example-configs-20.02.5-1.el7.x86_64.rpm

slurm-torque-20.02.5-1.el7.x86_64.rpm 

  1. SLURM Configuration :
  1. Configure the Slurmdbd configuration file : 

On master node edit the following configuration files as per the requirement:

vim /etc/slurm/slurmdbd.conf

## Example slurmdbd.conf file.## See the slurmdbd.conf man page for more information.## Archive info#ArchiveJobs=yes#ArchiveDir=”/tmp”#ArchiveSteps=yes#ArchiveScript=#JobPurge=12#StepPurge=1## Authentication infoAuthType=auth/munge#AuthInfo=/var/run/munge/munge.socket.2## slurmDBD infoDbdAddr=192.168.1.9DbdHost=slurmdbd.unixadmin.inDbdPort=6819SlurmUser=slurm#MessageTimeout=300DebugLevel=verbose#DefaultQOS=normal,standbyLogFile=/var/log/slurm/slurmdbd.logPidFile=/var/run/slurmdbd.pid#PluginDir=/usr/lib/slurm#PrivateData=accounts,users,usage,jobs#TrackWCKey=yes## Database infoStorageType=accounting_storage/mysqlStorageHost=slurmdbd.unixadmin.in#StoragePort=6819StoragePass=root1234StorageUser=slurmStorageLoc=slurm_acct_db

Then Enable and start the slurmdbd service :

systemctl start slurmdbd

systemctl enable slurmdbd

systemctl status slurmdbd

  1. Modify the following parameters according to the cluster

 Vim /etc/slurm/slurm.conf

## Example slurm.conf file. Please run configurator.html# (in doc/html) to build a configuration file customized# for your environment.### slurm.conf file generated by configurator.html.## See the slurm.conf man page for more information.#ClusterName=HPC_ClusterControlMachine=slurmmaster#ControlAddr=#BackupController=#BackupAddr=#SlurmUser=slurm#SlurmdUser=rootSlurmctldPort=6817SlurmdPort=6818AuthType=auth/munge#JobCredentialPrivateKey=#JobCredentialPublicCertificate=StateSaveLocation=/var/spool/slurmctldSlurmdSpoolDir=/var/spool/slurmdSwitchType=switch/noneMpiDefault=noneSlurmctldPidFile=/var/run/slurmctld.pidSlurmdPidFile=/var/run/slurmd.pidProctrackType=proctrack/pgid#PluginDir=#FirstJobId=ReturnToService=0#MaxJobCount=#PlugStackConfig=#PropagatePrioProcess=#PropagateResourceLimits=#PropagateResourceLimitsExcept=#Prolog=#Epilog=#SrunProlog=#SrunEpilog=#TaskProlog=#TaskEpilog=#TaskPlugin=#TrackWCKey=no#TreeWidth=50#TmpFS=#UsePAM=## TIMERSSlurmctldTimeout=300SlurmdTimeout=300InactiveLimit=0MinJobAge=300KillWait=30Waittime=0## SCHEDULINGSchedulerType=sched/backfill#SchedulerAuth=SelectType=select/cons_tresSelectTypeParameters=CR_Core#PriorityType=priority/multifactor#PriorityDecayHalfLife=14-0#PriorityUsageResetPeriod=14-0#PriorityWeightFairshare=100000#PriorityWeightAge=1000#PriorityWeightPartition=10000#PriorityWeightJobSize=1000#PriorityMaxAge=1-0## LOGGINGSlurmctldDebug=infoSlurmctldLogFile=/var/log/slurm/slurmctld.logSlurmdDebug=infoSlurmdLogFile=/var/log/slurm/slurmd.log#JobCompType=jobcomp/none#JobCompLoc=## ACCOUNTING#JobAcctGatherType=jobacct_gather/linux#JobAcctGatherFrequency=30#AccountingStorageType=accounting_storage/slurmdbdAccountingStorageHost=slurmdbd#AccountingStorageLoc=#AccountingStoragePass=root1234AccountingStorageUser=slurm## COMPUTE NODESNodeName=cnode[01-02] Procs=1 State=UNKNOWNPartitionName=Dev Nodes=ALL Default=YES MaxTime=INFINITE State=UP
  1. When the slurmdbd.conf and slurm.conf parameters are filled correctly we need to send these configuration files to slurmdbd and all the compute nodes as:

First copy both the configuration files to /home directory using following command:

cp /etc/slurm/slurm.conf /home

cp /etc/slurm/slurmdbd.conf /home

Then, send these to /etc/slurm directory of all the compute nodes using the following command:

scp /home/slurm.conf root@cnode01:/etc/slurm

scp /home/slurmdbd.conf root@cnode01:/etc/slurm

scp /home/slurmdbd.conf root@slurmdbd:/etc/slurm

scp /home/slurm.conf root@slurmdbd:/etc/slurm

scp /home/slurm.conf root@loginnode:/etc/slurm

scp /home/slurmdbd.conf root@loginnode:/etc/slurm

  1. Create the Folders to host the logs and assign the permissions to it:

           On Master Node:

 mkdir /var/spool/slurmctld

 chown slurm:slurm /var/spool/slurmctld

 chmod 755 /var/spool/slurmctld

 mkdir  /var/log/slurm

 touch /var/log/slurm/slurmctld.log

 touch /var/log/slurm/slurm_jobacct.log /var/log/slurm/slurm_jobcomp.log

 chown -R slurm:slurm /var/log/slurm/

On login node and Compute Nodes:

mkdir /var/spool/slurmd

chown slurm: /var/spool/slurmd

chmod 755 /var/spool/slurmd

mkdir /var/log/slurm/

touch /var/log/slurm/slurmd.log

chown -R slurm:slurm /var/log/slurm/slurmd.log

  1. Test the configuration using the following command

slurmd -C

The output will be as

NodeName=cnode01 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=3770

UpTime=0-00:59:45

  1. Activate the services:

slurmd service on the compute node 

systemctl enable slurmd.service

systemctl start slurmd.service

systemctl status slurmd.service

slurmctld service on the master node

systemctl enable slurmctld.service

systemctl start slurmctld.service

systemctl status slurmctld.service

Be the first to comment on "SLURM 20.02 Cluster Setup"

Leave a comment

Your email address will not be published.


*