1. CESM Timing, Performance and Load Balancing Data

1.1 PE(processer or MPI tasks) Layouts

Each component is associated with a unique MPI communicator, the driver runs on the union of all processors and controls the sequencing and hardware partitioning. The component processor layout is via three settings: the number of MPI tasks, the number of OpenMP threads per task, and the root MPI processor number from the global set. The layout of components on processors has no impact on the science. Changing processor layouts does not change intrinsic coupling lags or coupling sequencing.

This information is constantly subject to change due to changes in the model or machine hardware and software.

Machine Compset Resolution	Compiler	mpilib	Total PEs	Cost pe-hrs/yr	ThruPut yrs/day	
cheyenne	B1850	 f19_g17	   Intel	   mpt	   12636	    2094.42	      48.27	
cheyenne	B1850	 f19_g17	   Intel	   mpt	   5472	      1410.69	      31.24	
cheyenne	B1850	 f09_g17	   Intel	   mpt	   12960	    3543.22	      29.26	
cheyenne	B1850	 f19_g17	   Intel	   mpt	   1368	      1233.17	      26.62	

Component Total PEs	Root PE	Tasks x Threads
cam	      2700	   0	      900 X 3
cice     	1620	   360	   540 X 3
cism	      108	   0	      36 X 3
clm	      1080	   0	      360 X 3
cpl	      2700	   0	      900 X 3
mosart	   1080  	0	      360 X 3
pop	      1440	   900	   480 X 3
sesp       	3      	0	      1 X 3
ww        	72	      1380   	24 X 3

1.2 Example 1

In one of my CESM run, the PE layout is as follows. This can be read and modified from env_mach_pes.xml. It cannot be modified after “./cesm_setup” has been invoked without first invoking “cesm_setup -clean”.


NTASKS are the total number of MPI tasks/processes. NTASKS must be greater or equal to 1 even for inactive (stub) components.
NTHRDS are the number of OpenMP threads per MPI task. If NTHRDS is set to 1, this generally means threading parallelization will be off for that component. The total number of hardware processors allocated to a component is NTASKS *
ROOTPE is the global MPI task associated with the root task of that component, not the hardware processors counts.
PSTRID is the stride of MPI tasks across the global set of pes (for now this is set to 1)
NINST is the number of instances of the component (will be spread evenly across NTASKS)

NTASKS and ROOTPE are relatively independent of NTHRDS and they determine the layout of mpi processors between components. NTHRDS is used to determine how those mpi tasks should be placed across the machine.
In this instance, the ocean component would run on 180 hardware processors with 180 MPI tasks using 2 thread per task starting from global MPI task 468.
Note that if all components have identical NTASKS, NTHRDS, and ROOTPE set, all components will run sequentially on the same hardware processors.

1.3 Example 2: ROOTPE_OCN

the atmosphere and ocean are running concurrently, each on 64 processors with the atmosphere running on MPI tasks 0-15 and the ocean running on MPI tasks 16-79. The first 16 tasks are each threaded 4 ways for the atmosphere. The batch submission script ($CASE.run) should automatically request 128 hardware processors, and the first 16 MPI tasks will be laid out on the first 64 hardware processors with a stride of 4. The next 64 MPI tasks will be laid out on the second set of 64 hardware processors.
ROOTPE_OCN = 48 means a total of 176 processors (128+48) would have been requested and the atmosphere would have been laid out on the first 64 hardware processors in 16x4 fashion, and the ocean model would have been laid out on hardware processors 113-176. Hardware processors 65-112 would have been allocated but completely idle(空闲).

A very important examples given the model timing performance based on different settings of PE layouts:
BASICS: How do I change processor counts and component layouts on processors?

Estimating Cheyenne core-hours

The computational cost of CESM experiments typically is expressed in core-hours (also known as processor element-hours or “pe-hrs”) per simulated year. The core-hours used for a job (a benchmark run) are calculated by multiplying the number of processor cores used by the wall-clock duration in hours. Cheyenne core-hour calculations should assume that all jobs will run in the regular queue and that they are charged for use of all 36 cores on each node (Exclusive nodes).

Jobs that run in the exclusive-use queues are charged for use of the all of the cores on each node by this formula:
wall-clock hours × nodes used × cores per node × queue factor

important: how to check the model cost?
Entering ${case}/timing, and check the cesm_timing file of a job.

Overall Metrics:
   Model Cost:             703.17   pe-hrs/simulated_year
   Model Throughput:        44.23   simulated_years/day

Check the number of nodes and cores

                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
3545063.chadmin yttp     economy  restT04     59078  36 129    --  08:00 R 02:43

Job ID
This job carries the identification number “1201” and was submitted from node “cicum81”.
Lists the type of queue the jobs are running in.
Refers to the session id (if the job is running).
Refers to the number of nodes requested by the job.
Refers to the number of cpus (cores) or tasks requested by the job.
Req’d Memory
The amount of memory requested by the job.
Refers to the jobs current state:
E – Job is exiting
H – Job is held. This may be the result of a manual hold or of a job dependency
Q – Job is queued.
R – Job is running.


正常情况下从给系统加电, PC指针就不断累加,CPU按照串行顺序一致执行下去。有两个情况PC的值会发生“突变”:(1)程序主动跳转,比如汇编goto指令,c语言的longjmp函数等。本质上就是主动去修改PC的值,直接跳转到需要执行的指令,而不是位置临近的“下一条指令”。(2)系统发生中断,比如CPU的某一条引脚的电瓶为高表示发生了“突变”,系统比如对这个中断做出反应。中断发出者除了发送高电平到CPU相关引脚,还需要提供一个中断向量号,通知CPU发生了什么类型的中断。CPU根据中断向量号,找到相应的中断处理程序并处理中断。

multithreading 多线程

多线程: 假设现在只有一个CPU,运行两个线程,那么这两个线程其实不是并行运行的,在微观的CPU内部里面,还是串行处理的。CPU通过线程中断,让某一个线程挂起来,然后切换到另一个线程,执行一会儿,再切换回来,使得宏观上看上去好像两个线程同时运行一样。线程切换肯定会带来一定的性能牺牲,于是我们可以增加CPU,这样就真正的做到微观上面的线程并行了。
实际CPU存在等待时间,比如一个硬盘,有四个CPU来访问,这时候因为多个CPU访问,所以需要等待硬盘响应处理。那如果此时CPU只有一个线程呢?这个线程等待的时候, CPU就空闲了。这就是多线程存在的意义了,在大多数存在同步等待情况的程序中,线程数如果大于物理CPU个数的时候,是可以充分利用CPU,提高性能的。但无限制的加大线程数会带来线程切换的开销,所以一个服务程序的最优线程数需要根据具体情况来具体评估论了。

