Version 7 - History - Dione user instructions - User's Page - Redmine

Dione user instructions » History » Version 7

Anonymous, 2019-10-21 16:41

-Anonymous
+h1. User instructions for Dione cluster
 Anonymous
-Anonymous
+University of Turku
-Anonymous
+Åbo Akademi
-Anonymous
+Jussi Salmi (jussi.salmi@utu.fi)
 Anonymous
-Anonymous
+h2. 1. Resources
 Anonymous
-Anonymous
+h3. 1.1. Computation nodes
 Anonymous
-Anonymous
+<pre>
-Anonymous
+PARTITION NODES NODELIST  MEMORY
-Anonymous
+normal    36    di[1-36]  192GB
-Anonymous
+gpu       6     di[37-42] 384GB
-Anonymous
+</pre>
 Anonymous
-Anonymous
+Dione has 6 GPU-nodes where the user can perform calculation which benefits from very fast and parallel number crunching. This includes e.g. neural nets. The 36 other nodes are general purpose processors. Please use 'normal' or 'gpu' partitions when starting jobs to keep the gpu resources separate. The nodes are connected via a fast network, Infiniband, enabling MPI (Message Passing Interface) usage in the cluster. In addition, the cluster is connected to the EGI-grid (European Grid Infrastructure) and NORDUGRID which are allowed to use a part of the computational resources. The website
 Anonymous
-Anonymous
+https://p55cc.utu.fi/
 Anonymous
-Anonymous
+Contains information on the cluster, a cluster monitor and provides instructions on getting access and using the cluster.
 Anonymous
-Anonymous
+h3. 1.2. Disk space
 Anonymous
-Anonymous
+The system has an NFS4 file system with 100TB capacity on the home partition. The system is not backed up anywhere, so the user must handle backups himself/herself.
 Anonymous
-Anonymous
+h3. 1.3. Software
 Anonymous
-Anonymous
+The system uses the SLURM workload manager (Simple Linux Utility for Resource Management) for scheduling the jobs.
 Anonymous
-Anonymous
+The cluster uses the module-system for loading software modules with different version for execution.
 Anonymous
-Anonymous
+h2. 2. Executing jobs in the cluster
 Anonymous
-Anonymous
+The user may not execute jobs on the login node. All jobs must be dispatched to the cluster by using SLURM commands. Normally a script is used to define the jobs and the parameters for SLURM. There is a large number of parameters and environment variables that can be used to define how the jobs should be executed, please look at the SLURM manual for a complete list.
 Anonymous
-Anonymous
+A typical script for starting the jobs can look as follows (name:batch-submit.job):
 Anonymous
-Anonymous
+<pre>
-Anonymous
+#!/bin/bash
-Anonymous
+#SBATCH --job-name=test
-Anonymous
+#SBATCH -o result.txt
-Anonymous
+#SBATCH --workdir=<Workdir path>
-Anonymous
+#SBATCH -c 1
-Anonymous
+#SBATCH -t 10:00
-Anonymous
+#SBATCH --mem=10M
-Anonymous
+#SBATCH --partition=all
 Anonymous
-Anonymous
+module purge # Purge modules for a clean start
-Anonymous
+module load <desired modules if needed> # You can either inherit module environment, or insert one here
 Anonymous
-Anonymous
+srun <executable>
-Anonymous
+srun sleep 60
-Anonymous
+</pre>
 Anonymous
 Anonymous
-Anonymous
+The script is run with
 Anonymous
-Anonymous
+sbatch batch-submit.job
 Anonymous
-Anonymous
+The script defines several parameters that will be used for the job.
 Anonymous
-Anonymous
+<pre>
-Anonymous
+--job-name          defines the name
-Anonymous
+-o result.txt       redirects the standard output to results.txt
-Anonymous
+--workdir           defines the working directory
-Anonymous
+-c 1                sets the number of cpus per task to 1
-Anonymous
+-t 10:00            the time limit of the task is set to 10 minutes. After that the process is stopped
-Anonymous
+--mem=10M           the memory required for the task is 10MB.
-Anonymous
+--partition=normal  use the 'normal' partition. Please use 'normal' or 'gpu' partitions to keep the gpu resources separate. 'all' uses all partitions.
-Anonymous
+</pre>
 Anonymous
-Anonymous
+srun starts a task. When starting the task SLURM gives it a job id which can be used to track it’s execution with e.g. the squeue command.
 Anonymous
 Anonymous
-Anonymous
+h2. 3. The module system
 Anonymous
-Anonymous
+Many of the software packages in Dione require you to load the kernel modules prior to using the software. Different versions of the software can be used with module.
 Anonymous
-Anonymous
+<pre>
-Anonymous
+module avail Show available modules
 Anonymous
-Anonymous
+module list Show loaded modules
 Anonymous
-Anonymous
+module unload <module> Unload a module
 Anonymous
-Anonymous
+module load <module> Load a module
 Anonymous
-Anonymous
+module load <module>/10.0 Load version 10.0 of <module>
 Anonymous
-Anonymous
+module purge unload all modules
-Anonymous
+</pre>
 Anonymous
 Anonymous
-Anonymous
+h2. 4. Useful commands in SLURM
 Anonymous
-Anonymous
+<pre>
-Anonymous
+sinfo shows the current status of the cluster.
 Anonymous
-Anonymous
+sinfo -p gpu Shows the status of the GPU-partition
-Anonymous
+sinfo -O all Shows a comprehensive status report node per node
 Anonymous
-Anonymous
+sstat <job id> Shows information on your job
 Anonymous
-Anonymous
+squeue The status of the job queue
-Anonymous
+squeue -u <username> Show only your jobs
 Anonymous
-Anonymous
+srun <command> Dispatch jobs to the scheduler
 Anonymous
-Anonymous
+sbatch <script> Run a script defining jobs to be run
 Anonymous
-Anonymous
+scontrol Control your jobs in many aspects
-Anonymous
+scontrol show job <job id> Show details about the job
-Anonymous
+scontrol -u <username> Show only a certain users jobs
 Anonymous
-Anonymous
+scancel <job id> Cancel a job
-Anonymous
+scancel -u <username> Cancel all your jobs
-Anonymous
+</pre>
 Anonymous
 Anonymous
-Anonymous
+h2. 5. Further information
 Anonymous
-Anonymous
+Further information can be asked from the administrators (fgi-admins@lists.utu.fi).

Project

General

Profile

Dione user instructions » History » Version 7