サイズ: 7086
コメント:
|
サイズ: 6698
コメント:
|
削除された箇所はこのように表示されます。 | 追加された箇所はこのように表示されます。 |
行 14: | 行 14: |
Resources (e.g. memory) used by the processes can be viewed by using the GNU time command (/usr/bin/time). Specifying the option -v displays the resource details used by the process, such as follows ([[http://linuxjm.osdn.jp/html/LDP_man-pages/man1/time.1.html|see time command man page]]). | Resources (e.g. memory) used by the processes can be viewed by using the GNU '''time''' command (/usr/bin/time). Specifying the option -v displays the resource details used by the process, such as follows ([[http://linuxjm.osdn.jp/html/LDP_man-pages/man1/time.1.html|see time command man page]]). |
行 47: | 行 47: |
メモリの使用量は Maximum resident set size で知ることができますが, v1.7 には実際の4倍で表示されるという既知のバグ(実使用量は表示された値の4分の1)があります([[http://qiita.com/guicho271828/items/2ad3df13e915ecbb9cac|GNU Time にある壮大なバグ]]). | '''Maximum resident set''' size shows the memory usage. Note that the GNU '''time''' command V. 1.7 shows this figure 4 times greater than the actual memory usage, which is a known bug ([[http://qiita.com/guicho271828/items/2ad3df13e915ecbb9cac|see Massive bug in GNU time (Japanese)]]). |
行 49: | 行 49: |
GNU 版 time コマンドはクラスタの演算ノードでも利用できますが,結果が標準エラー出力であることに留意する必要があります. | You can run the GNU '''time''' command on a calculation node, however, the result is output as the standard error. |
行 51: | 行 51: |
== 投入しているジョブの一括削除 == | == Batch Deletion of Submitted Jobs == |
行 53: | 行 53: |
一般のユーザには qdel コマンドのオプション all が許可されておらず,また,オプション -t が正常に作動しないため,投入しているジョブの一括削除には,以下のようなスクリプトを利用します. | End users are not permitted to use the '''-all''' option for '''qdel''' command, and '''qdel -t''' does not function correctly. For this reason, use the following script to perform batch deletion of multiple submitted jobs. |
行 63: | 行 63: |
# 第一引数が数値か否か | # Is the first argument a numerical value? |
行 74: | 行 74: |
第一引数に all を指定することにより,投入しているジョブ(実行中か否かは問いません)の全てを削除します.また,第一引数及び第二引数を利用することにより,削除するジョブIDの範囲を指定することもできます.実行スクリプトを qdel.sh としたとき,以下のように実行します. | Specifying '''all''' as the first argument deletes all the submitted jobs (regardless of whether they are being executed or not). It is also possible to delete a range of jobs by specifying the starting job ID as the first argument and the ending job ID as the second argument. To execute a script called '''qdel.sh''', write the code such as follows: |
行 82: | 行 82: |
== 任意のコマンドをジョブ投入するスクリプト == | == Script to Submit Specified Command as a Job == |
行 84: | 行 84: |
qsub コマンドの -v オプションを利用すると,実行スクリプトに任意の環境変数を渡すことができます.この機能を利用し,演算ノードに任意のコマンドを実行できるようにしたものが以下のスクリプトです.なお,標準エラー出力は標準出力とともに出力し(-j oe),実行時間及び実行ノードを記録するためのコマンド(date, hostname)も含まれています. | The '''qsub''' command with the '''-v''' option can pass an environment variable to a script. The following script uses this function to execute the specified command as a job on a calculation node. In this script, the standard error is output with the standard output ('''-j oe'''), and the script also includes commands to log the execution time and the node where the calculation took place ('''date''' and '''hostname'''). |
行 101: | 行 101: |
実行スクリプトを qsub.sh としたとき,以下のように JOB_CMD に任意のコマンドを指定してジョブを投入します. | To execute a script called '''qsub.sh''', specify the command to execute as the argument of '''JOB_CMD'''. Executing this script submits the command as a job. |
行 109: | 行 109: |
== キューにあるすべてのユーザのジョブ一覧 == | == List of All Users' Jobs in the Queue == |
行 111: | 行 111: |
{{{/usr/local/maui/bin/showq}}}コマンドを使うと,キューにある他のユーザも含めたジョブ一覧を確認することができます. | A list of all jobs including other users' jobs in the queue can be checked using the {{{/usr/local/maui/bin/showq}}} command. |
行 114: | 行 114: |
== ジョブの実行状態を繰り返し表示 == | == Displaying Job Status Repeatedly == |
行 116: | 行 116: |
投入したジョブがどのような状況かを表示する場合には, watch コマンドを併用すると便利です.以下のように実行すると,5秒間隔で繰り返し実行され,常に最新の状態を把握することができます.また,オプション -d を指定すると,前回との差分がハイライトされます.なお,繰り返し実行するコマンドにエイリアスは指定できません. | It is useful to use the '''watch''' command to display the execution status of a submitted job. The code examples below run the '''watch''' command every 5 seconds, displaying the latest job execution status. Option '''-d''' highlights the difference from the last status. Note that an alias cannot be specified to a command that is executed repeatedly. |
行 126: | 行 126: |
== ulimit -t (cpu time) による制限 == | == Limitation by ulimit -t (the CPU time) == |
行 128: | 行 128: |
開発ノードには ulimit による cpu time の制限が存在します.この制限値は time コマンドで表示される値とは異なる値(隠れた値)に対して適用されるようです.原因のよくわからない「強制終了」が発生した場合,この制限が原因である可能性があります.例えば,大量のデータを rsync で同期すると以下のようなエラーが発生します. | The development node has a limitation on the CPU time as specified by the '''ulimit''' command. This limit value seems to apply to the value that are not shown by '''time''' command (hidden values). If a job is aborted due to an unknown reason, this limitation could be the cause. For example, synchronizing a large volume of data using the '''rsync''' command generates an error such as follows. |
行 141: | 行 141: |
rsync が消費する cpu time は --bwlimit の指定に関係なく,概ねデータ転送量で決まるようです.リモート同期の場合は ssh による経路暗号化処理が入るため,ローカル同期よりも cpu time を2倍程度消費する傾向にあります. | The CPU time consumed by '''rsync''' appears to be determined by the transmitted data volume regardless of whether '''--bwlimit''' has been specified or not. As remote synchronization involves transport encryption using '''ssh''', it consumes approximately twice as much CPU time as local synchronization. |
行 143: | 行 143: |
== ジョブの割り当て順序 == === 広域連携教育研究用クラスタ === ジョブスケジューラは wsnd30, wsnd29, ..., wsnd00 のようにホスト番号の大きい演算ノードから順にジョブを割り当てる. |
== Job Scheduling Order == === The Wide-Area Coordinated Cluster System for Education and Research === The job scheduler allocates jobs in descending order of host number of the calculation nodes such as wsnd30, wsnd29, ..., and wsnd00. |
行 148: | 行 148: |
=== 次世代シミュレーション技術者教育用クラスタ === ジョブスケジューラは csnd02, csnd03, ..., csnd27, csnd00, csnd01 のように基本的にホスト番号の小さい演算ノードから順にジョブを割り当てる. csnd00, csnd01 は GPGPU が搭載されたノードであるため,最後に割り当てられる. |
=== The Computer Systems (Cluster) for the Next Generation Simulation Technology Education === The job scheduler allocates the jobs in ascending order of host number of the calculation nodes, in principle, such as csnd02, csnd03, ..., csnd27, csnd00, and csnd01. Note that csnd00 and csnd01 are the nodes where a GPGPU is installed, and the jobs are allocated to the node at the end of the list. |
Cluster System Usage Tips
Share any tips that you have discovered on using the Cluster System. You can edit this article after logging in.
Log of the information sharing mailing list for research users
Log of the information sharing mailing list for research users is available at: http://lists.imc.tut.ac.jp/pipermail/research-users/
Measuring resource usage (e.g. memory usage)
Resources (e.g. memory) used by the processes can be viewed by using the GNU time command (/usr/bin/time). Specifying the option -v displays the resource details used by the process, such as follows (see time command man page).
-bash-4.1$ /usr/bin/time --version GNU time 1.7 -bash-4.1$ /usr/bin/time -v whoami my016 Command being timed: "whoami" User time (seconds): 0.00 System time (seconds): 0.00 Percent of CPU this job got: 0% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.00 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 3088 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 243 Voluntary context switches: 3 Involuntary context switches: 1 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0
Maximum resident set size shows the memory usage. Note that the GNU time command V. 1.7 shows this figure 4 times greater than the actual memory usage, which is a known bug (see Massive bug in GNU time (Japanese)).
You can run the GNU time command on a calculation node, however, the result is output as the standard error.
Batch Deletion of Submitted Jobs
End users are not permitted to use the -all option for qdel command, and qdel -t does not function correctly. For this reason, use the following script to perform batch deletion of multiple submitted jobs.
1 #!/bin/bash
2
3 if [ $# -ne 1 -a $# -ne 2 ]; then
4 echo "Usage: bash $0 [ 'all' | firstId lastId | lastId ]"
5 exit
6 fi
7
8 # Is the first argument a numerical value?
9 if expr "$1" : '[0-9]*' > /dev/null ; then
10 ids=`seq $1 $2`
11 else
12 ids=`qstat | cut -d . -f 1 | tail -n +3 | column`
13 fi
14
15 echo qdel $ids
16 qdel $ids
Specifying all as the first argument deletes all the submitted jobs (regardless of whether they are being executed or not). It is also possible to delete a range of jobs by specifying the starting job ID as the first argument and the ending job ID as the second argument. To execute a script called qdel.sh, write the code such as follows:
-bash-4.1$ bash qdel.sh all -bash-4.1$ bash qdel.sh 100 200
Script to Submit Specified Command as a Job
The qsub command with the -v option can pass an environment variable to a script. The following script uses this function to execute the specified command as a job on a calculation node. In this script, the standard error is output with the standard output (-j oe), and the script also includes commands to log the execution time and the node where the calculation took place (date and hostname).
To execute a script called qsub.sh, specify the command to execute as the argument of JOB_CMD. Executing this script submits the command as a job.
-bash-4.1$ qsub -v JOB_CMD="/usr/bin/time -v perl i_love_cats.pl" qsub.sh -bash-4.1$ qsub -v JOB_CMD="perl catching_cats.pl | perl counting_cats.pl" qsub.sh
List of All Users' Jobs in the Queue
A list of all jobs including other users' jobs in the queue can be checked using the /usr/local/maui/bin/showq command.
Displaying Job Status Repeatedly
It is useful to use the watch command to display the execution status of a submitted job. The code examples below run the watch command every 5 seconds, displaying the latest job execution status. Option -d highlights the difference from the last status. Note that an alias cannot be specified to a command that is executed repeatedly.
-bash-4.1$ watch -n 5 qstat -a -bash-4.1$ watch -n 5 qstat -Q -bash-4.1$ watch -n 5 -d qstat -Q
Limitation by ulimit -t (the CPU time)
The development node has a limitation on the CPU time as specified by the ulimit command. This limit value seems to apply to the value that are not shown by time command (hidden values). If a job is aborted due to an unknown reason, this limitation could be the cause. For example, synchronizing a large volume of data using the rsync command generates an error such as follows.
-bash-4.1$ rsync --progress -avh /tmp/source /destination/ sending incremental file list ... rsync: writefd_unbuffered failed to write 4 bytes to socket [sender]: Broken pipe (32) rsync: connection unexpectedly closed (96 bytes received so far) [sender] rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6]
The CPU time consumed by rsync appears to be determined by the transmitted data volume regardless of whether --bwlimit has been specified or not. As remote synchronization involves transport encryption using ssh, it consumes approximately twice as much CPU time as local synchronization.
Job Scheduling Order
The Wide-Area Coordinated Cluster System for Education and Research
The job scheduler allocates jobs in descending order of host number of the calculation nodes such as wsnd30, wsnd29, ..., and wsnd00.