TACC launcher 是什么?

它是一个简单实用的工具,用来帮助用户在一个批处理脚本中提交多个单线程或多线程的任务。

它的详细介绍请参考官网:传送门

它的下载地址:传送门

TACC launcher 怎么用?

非常推荐前往官网查看它的使用方法,有很详细的介绍。我就不再重复了,英文不好的朋友可以使用网页翻译工具翻译一下。

简单讲,就是:

  1. 将这个工具下载下来
  2. 解压缩
  3. 不需要编译!
  4. 配置环境变量
  5. 写一个joblist文件,里面写上所有需要执行的任务
  6. 使用launcher的命令提交

TACC launcher + slurm 实例

准备算例

我们准备一个joblist文件:myjoblist,里面写上要执行的任务,先简单些12行helloworld做测试:

echo "hello, world"
echo "hello, world"
echo "hello, world"
echo "hello, world"
echo "hello, world"
echo "hello, world"
echo "hello, world"
echo "hello, world"
echo "hello, world"
echo "hello, world"
echo "hello, world"
echo "hello, world"
echo "hello, world"
echo "hello, world"

编写提交脚本

我们再编写一个提交脚本sub.sh,里面写上launcher的相关命令:

#!/bin/bash
export LAUNCHER_JOB_FILE=/path/to/myjoblist
export LAUNCHER_DIR=$HOME/launcher/launcher-3.1.1
export PATH=$LAUNCHER_DIR:$PATH
export LAUNCHER_PLUGIN_DIR=$LAUNCHER_DIR/plugins
export LAUNCHER_RMI=SLURM
export LAUNCHER_SCHED=interleaved
export LAUNCHER_WORKDIR=`pwd`
$LAUNCHER_DIR/paramrun

说明: 1. LAUNCHER_JOB_FILE 为myjoblist路径,请改为实际路径 2. LAUNCHER_DIR 为launcher的安装路径,请改为实际路径 3. 其他的变量暂时不需要修改

提交脚本

yhbatch -N 2 -n 6 -p debug sub.sh

说明: 1. -N 2 表示2个节点 2. -n 6 表示6个cpu核(一共6个,不是每个节点6个;另外,注意n需要能被N整除,否则报错) 3. -p debug 表示使用debug分区

查看结果

使用slurm作业调度系统提交的任务会有一个默认的输出文件slurm-jobid.out,我们查看这个文件:

Launcher: Setup complete.

------------- SUMMARY ---------------
   Number of hosts:    2
   Working directory:  $HOME/workdir/test
   Processes per host: 3
   Total processes:    6
   Total jobs:         12
   Scheduling method:  interleaved

-------------------------------------
Launcher: Starting parallel tasks...
Launcher: Task 1 running job 2 on cn95 (echo "hello, world")
Launcher: Task 0 running job 1 on cn95 (echo "hello, world")
hello, world
hello, world
Launcher: Task 2 running job 3 on cn95 (echo "hello, world")
hello, world
Launcher: Job 1 completed in 0 seconds.
Launcher: Task 5 running job 6 on cn96 (echo "hello, world")
Launcher: Task 4 running job 5 on cn96 (echo "hello, world")
hello, world
hello, world
Launcher: Task 3 running job 4 on cn96 (echo "hello, world")
Launcher: Job 3 completed in 0 seconds.
hello, world
Launcher: Job 2 completed in 0 seconds.
Launcher: Job 6 completed in 0 seconds.
Launcher: Job 5 completed in 0 seconds.
Launcher: Job 4 completed in 0 seconds.
Launcher: Task 0 running job 7 on cn95 (echo "hello, world")
hello, world
Launcher: Task 2 running job 9 on cn95 (echo "hello, world")
hello, world
Launcher: Task 1 running job 8 on cn95 (echo "hello, world")
hello, world
Launcher: Task 5 running job 12 on cn96 (echo "hello, world")
hello, world
Launcher: Task 3 running job 10 on cn96 (echo "hello, world")
hello, world
Launcher: Task 4 running job 11 on cn96 (echo "hello, world")
hello, world
Launcher: Job 7 completed in 0 seconds.
Launcher: Job 9 completed in 0 seconds.
Launcher: Job 8 completed in 0 seconds.
Launcher: Job 12 completed in 0 seconds.
Launcher: Job 10 completed in 0 seconds.
Launcher: Job 11 completed in 0 seconds.
Launcher: Task 0 done. Exiting.
Launcher: Task 2 done. Exiting.
Launcher: Task 1 done. Exiting.
Launcher: Task 5 done. Exiting.
Launcher: Task 3 done. Exiting.
Launcher: Task 4 done. Exiting.
Launcher: Done. Job exited without errors

说明:

参数 说明
Number of hosts 2 -N 2,所以为2个节点
Working directory $HOME/workdir/test 这个是实际的提交目录
Processes per host 3 每个节点的进程数,是通过 62=3 得到,所以注意要整除 !
Total processes 6 -n 6,所以有一共6个进程
Total jobs 12 在myjobslist中我们写了12行,所以是12个jobs
Scheduling method interleaved 这个参数是调度方法,有3种,详见官网

记录

  1. 在测试的时候,默认使用LAUNCHER_SCHED=dynamic会一直计算无法结束,暂时不考虑。
  2. 对openmp程序的支持?mpi程序呢?待测试。(看时间吧)
  3. 如果出现缺少库的情况,请将缺少的库添加到LD_LIBRARY_PATH中即可。