Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarifying where to preprocess #81

Closed
wants to merge 0 commits into from
Closed

Conversation

rkube
Copy link
Contributor

@rkube rkube commented Apr 25, 2022

Preprocessing results on too much compute load for the traverse head node.

@buildbot-princeton
Copy link
Collaborator

Can one of the admins verify this patch?

@rkube
Copy link
Contributor Author

rkube commented Apr 25, 2022

To preprocess the dataset on traverse I need to limit the number of threads used for preprocessing
#82

@felker
Copy link
Member

felker commented Apr 25, 2022

There are 44 cores on a node of Traverse, right? Any reason why we can only spawn 32 threads?

@felker
Copy link
Member

felker commented Apr 25, 2022

Also I am in favor of not changing the default conf.yaml to make it specific to Princeton-based systems. So:

fs_path: '/Users/'
...
max_cpus: -1

(/Users/ isn't an ideal default, but it is generic-enough. Maybe should be set to $HOME, would need to check the parsing logic)

@rkube
Copy link
Contributor Author

rkube commented Apr 25, 2022

Each traverse node has 2 processors, 16 cores per processor and 4 threads per core. When I run pre-processing with 126 threads it starts off well but throws errors after a while. May be running into memory limits?

@felker
Copy link
Member

felker commented Apr 25, 2022

Ah, I had assumed that the CPU model was the same as on Summit. What do you get when you run lscpu and cat /proc/cpuinfo on a Traverse compute node (just curious)?

But this problem is likely because of the 4-way SMT, which wasnt on the Tiger cluster, which the code was originally written for.

@rkube
Copy link
Contributor Author

rkube commented Apr 25, 2022

Summit and traverse are very similar, but no 100% identical.

(frnn) [rkube@traverse examples]$ lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  4
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        6
Model:               2.3 (pvr 004e 1203)
Model name:          POWER9, altivec supported
CPU max MHz:         3800.0000
CPU min MHz:         2300.0000
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            10240K
NUMA node0 CPU(s):   0-63
NUMA node8 CPU(s):   64-127
NUMA node252 CPU(s): 
NUMA node253 CPU(s): 
NUMA node254 CPU(s): 
NUMA node255 CPU(s): 
(frnn) [rkube@traverse examples]$ cat /proc/cpuinfo 
processor       : 0
cpu             : POWER9, altivec supported
clock           : 3683.000000MHz
revision        : 2.3 (pvr 004e 1203)

processor       : 1
cpu             : POWER9, altivec supported
clock           : 3683.000000MHz
revision        : 2.3 (pvr 004e 1203)

processor       : 2
cpu             : POWER9, altivec supported
clock           : 3683.000000MHz
revision        : 2.3 (pvr 004e 1203)
...
processor       : 127
cpu             : POWER9, altivec supported
clock           : 3533.000000MHz
revision        : 2.3 (pvr 004e 1203)

timebase        : 512000000
platform        : PowerNV
model           : 8335-GTH
machine         : PowerNV 8335-GTH
firmware        : OPAL
MMU             : Radix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants