Releases · aws/aws-parallelcluster

19 May 08:33

ddeidda

v2.7.0

c3bec08

AWS ParallelCluster v2.7.0

We're excited to announce the release of AWS ParallelCluster 2.7.0.

Upgrade

How to upgrade?

sudo pip install --upgrade aws-parallelcluster

ENHANCEMENTS

sqswatcher: The daemon is now compatible with VPC Endpoints so that SQS messages can be passed without traversing the public internet.

CHANGES

Upgrade NICE DCV to version 2020.0-8428.
Upgrade Intel MPI to version U7.
Upgrade NVIDIA driver to version 440.64.00.
Upgrade EFA installer to version 1.8.4:
- Kernel module: efa-1.5.1 (no change)
- RDMA core: rdma-core-25.0 (no change)
- Libfabric: libfabric-aws-1.9.0amzn1.1 (no change)
- Open MPI: openmpi40-aws-4.0.3 (updated from openmpi40-aws-4.0.2)
Upgrade CentOS 7 AMI to version 7.8
Configuration: base_os and scheduler parameters are now mandatory and they have no longer a default value.

BUG FIXES

Fix recipes installation at runtime by adding the bootstrapped file at the end of the last chef run.
Fix installation of FSx Lustre client on Centos 7
FSx Lustre: Exit with error when failing to retrieve FSx mountpoint
Fix sanity_check behavior when max queue_size > 1000

Support

Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192

Assets 2

09 Apr 22:58

tilne

v2.6.1

f80bf48

AWS ParallelCluster v2.6.1

We're excited to announce the release of AWS ParallelCluster 2.6.1.

Upgrade

How to upgrade?

sudo pip install --upgrade aws-parallelcluster

ENHANCEMENTS

Improved management of S3 bucket that gets created when awsbatch scheduler is selected.
Add validation for supported OSes when using FSx Lustre.
Change ProctrackType from proctrack/gpid to proctrack/cgroup in Slurm in order to better handle termination of stray processes when running MPI applications. This also includes the creation of a cgroup Slurm configuration in in order to enable the cgroup plugin.
Skip execution, at node bootstrap time, of all those install recipes that are already applied at AMI creation time.
Start CloudWatch agent earlier in the node bootstrapping phase so that cookbook execution failures are correctly uploaded and are available for troubleshooting.
Improved the management of SQS messages and retries to speed-up recovery times when failures occur.

CHANGES

FSx Lustre: remove x-systemd.requires=lnet.service from mount options in order to rely on default lnet setup provided by Lustre.
Enforce Packer version to be >= 1.4.0 when building an AMI. This is also required for customers using pcluster createami command.
Do not launch a replacement for an unhealthy or unresponsive node until this is terminated. This makes cluster slower at provisioning new nodes when failures occur but prevents any temporary over-scaling with respect to the expected capacity.
Increase parallelism when starting slurmd on compute nodes that join the cluster from 10 to 30.
Reduce the verbosity of messages logged by the node daemons.
Do not dump logs to /home/logs when nodewatcher encounters a failure and terminates the node. CloudWatch can be used to debug such failures.
Reduce the number of retries for failed REMOVE events in sqswatcher.
Omit cfn-init-cmd and cfn-wire from the files stored in CloudWatch logs.

BUG FIXES

Configure proxy during cloud-init boothook in order for the proxy to be configured for all bootstrap actions.
Fix installation of Intel Parallel Studio XE Runtime that requires yum4 since version 2019.5.
Fix compilation of Torque scheduler on Ubuntu 18.04.
Fixed a bug in the ordering and retrying of SQS messages that was causing, under certain circumstances of heavy load, the scheduler configuration to be left in an inconsistent state.
Delete from queue the REMOVE events that are discarded due to hostname collision with another event fetched as part of the same sqswatcher iteration.

Support

Assets 2

26 Feb 20:46

lukeseawalker

v2.6.0

a2b72c7

AWS ParallelCluster v2.6.0

We're excited to announce the release of AWS ParallelCluster 2.6.0.

Upgrade

How to upgrade?

sudo pip install --upgrade aws-parallelcluster

Enhancements

Add support for Amazon Linux 2
Add support for NICE DCV on Ubuntu 18.04
Add support for FSx Lustre on Ubuntu 18.04 and Ubuntu 16.04
New CloudWatch logging capability to collect cluster and job scheduler logs to CloudWatch for cluster monitoring and inspection
- Add --keep-logs flag to pcluster delete command to preserve logs at cluster deletion
Install and setup Amazon Time Sync on all OSs
Enable accounting plugin in Slurm for all OSes. Note: accounting is not enabled nor configured by default
Add retry on throttling from CloudFormation API, happening when several compute nodes are being bootstrapped
concurrently
Display detailed substack failures when pcluster create fails due to a substack error
Create additional EFS mount target in the AZ of compute subnet, if needed
Add validator for FSx Lustre Weekly Maintenance Start Time parameter
Add validator to the KMS key provided for EBS, FSx, and EFS
Add validator for S3 external resource
Support two new FSx Lustre features, Scratch 2 and Persistent filesystems
- Add two new parameters deployment_type and per_unit_storage_throughput to the fsx section
- Add new storage sizes storage_capacity, 1,200 GiB, 2,400 GiB and multiples of 2,400 are supported with SCRATCH_2
- In transit encryption is available via fsx_kms_key_id parameter when deployment_type = PERSISTENT_1
- New parameter per_unit_storage_throughput is available when deployment_type = PERSISTENT_1

Changes

Upgrade Slurm to version 19.05.5
Upgrade Intel MPI to version U6
Upgrade EFA installer to version 1.8.3:
- Kernel module: efa-1.5.1 (updated from efa-1.4.1)
- RDMA core: rdma-core-25.0 (distributed only) (no change)
- Libfabric: libfabric-aws-1.9.0amzn1.1 (updated from libfabric-aws-1.8.1amzn1.3)
- Open MPI: openmpi40-aws-4.0.2 (no change)
Install Python 2.7.17 on CentOS 6 and set it as default through pyenv
Install Ganglia from repository on Amazon Linux, Amazon Linux 2, CentOS 6 and CentOS 7
Disable StrictHostKeyChecking for SSH client when target host is inside cluster VPC for all OSs except CentOS 6
Pin Intel Python 2 and Intel Python 3 to version 2019.4
Automatically disable ptrace protection on Ubuntu 18.04 and Ubuntu 16.04 compute nodes when EFA is enabled.
This is required in order to use local memory for interprocess communications in Libfabric provider
as mentioned here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-ptrace
Packer version >= 1.4.0 is required for AMI creation
Use version 5.2 of PyYAML for python 3 versions of 3.4 or earlier.

Bug Fixes

Fix issue with slurmd daemon not being restarted correctly when a compute node is rebooted
Fix errors causing Torque not able to locate jobs, setting server_name to fqdn on master node
Fix Torque issue that was limiting the max number of running jobs to the max size of the cluster
Fix OS validation depending on the configured scheduler

Support

Assets 2

13 Dec 16:35

demartinofra

v2.5.1

3f4e2c3

AWS ParallelCluster v2.5.1

We're excited to announce the release of AWS ParallelCluster 2.5.1.

Upgrade

How to upgrade?

sudo pip install --upgrade aws-parallelcluster

Enhancements

Add --show-url flag to pcluster dcv connect command in order to generate a one-time URL that can be used to start a DCV session. This unblocks the usage of DCV when the browser cannot be launched automatically.

Changes

Upgrade CUDA library to version 10.2.
Using a Placement Group is not required anymore but highly recommended when enabling EFA.
Increase default root volume size in Centos 6 AMI to 25GB.
Increase the retention of CloudWatch logs produced when generating AWS Batch Docker images from 1 to 14 days.
Increase the total time allowed to build Docker images from 20 minutes to 30 minutes. This is done to better deal with slow networking in China regions.
Upgrade EFA installer to version 1.7.1:
- Kernel module: efa-1.4.1
- RDMA core: rdma-core-25.0
- Libfabric: libfabric-aws-1.8.1amzn1.3
- Open MPI: openmpi40-aws-4.0.2

Bug Fixes

Fix installation of NVIDIA drivers on Ubuntu 18.
Fix installation of CUDA toolkit on Centos 6.
Fix invalid default value for spot_price.
Fix issue that was preventing the cluster from being created in VPCs configured with multiple CIDR blocks.
Correctly handle failures when retrieving ASG in pcluster instances command.
Fix the default mount dir when a single EBS volume is specified through a dedicated ebs configuration section.
Correctly handle failures when there is an invalid parameter in the aws config section.
Fix a bug in pcluster delete that was causing the cli to exit with error when the cluster is successfully deleted.
Exit with status code 1 if pcluster create fails to create a stack.
Better handle the case of multiple or no network interfaces on FSX filesystems.
Fix pcluster configure to retain default values from old config file.
Fix bug in sqswatcher that was causing the daemon to fail when more than 100 DynamoDB tables are present in the cluster region.
Fix installation of Munge on Amazon Linux, Centos 6, Centos 7 and Ubuntu 16.

Support

Assets 2

15 Nov 22:37

rexcsn

v2.5.0

c4eab44

AWS ParallelCluster v2.5.0

We're excited to announce the release of AWS ParallelCluster 2.5.0.

Upgrade

How to upgrade?

sudo pip install --upgrade aws-parallelcluster

Enhancements

Add support for new OS: Ubuntu 18.04
Add support for AWS Batch scheduler in China partition and in eu-north-1.
Revamped pcluster configure command which now supports automated networking configuration.
Add support for NICE DCV on Centos 7 to setup a graphical remote desktop session on the Master node.
Add support for new EFA supported instances: c5n.metal, m5dn.24xlarge, m5n.24xlarge, r5dn.24xlarge, r5n.24xlarge
Add support for scheduling with GPU options in Slurm. Currently supports the following GPU-related options: -G/--gpus, --gpus-per-task, --gpus-per-node, --gres=gpu, --cpus-per-gpu.
Integrated GPU requirements into scaling logic, cluster will scale automatically to satisfy GPU/CPU requirements for pending jobs. When submitting GPU jobs, CPU/node/task information is not required but preferred in order to avoid ambiguity. If only GPU requirements are specified, cluster will scale up to the minimum number of nodes required to satisfy all GPU requirements.
Add new cluster configuration option to automatically disable Hyperthreading (disable_hyperthreading = true)
Install Intel Parallel Studio 2019.5 Runtime in Centos 7 when enable_intel_hpc_platform = true and share /opt/intel over NFS
Additional EC2 IAM Policies can now be added to the role ParallelCluster automatically creates for cluster nodes by simply specifying additional_iam_policies in the cluster config.

Changes

Ubuntu 14.04 is no longer supported
Upgrade Intel MPI to version U5.
Upgrade EFA Installer to version 1.7.0, this also upgrades Open MPI to 4.0.2.
Upgrade NVIDIA driver to Tesla version 418.87.
Upgrade CUDA library to version 10.1.
Upgrade Slurm to version 19.05.3-2.
Install EFA in China AMIs.
Increase default EBS volume size from 17GB to 25GB
FSx Lustre now supports new storage_capacity options 1,200 and 2,400 GiB
Enable flock user_xattr noatime Lustre mount options by default everywhere and
x-systemd.automount x-systemd.requires=lnet.service for systemd based systems.
Increase the number of hosts that can be processed by scaling daemons in a single batch from 50 to 200. This improves the scaling time especially with increased ASG launch rates.
Change default sshd config in order to disable X11 forwarding and update the list of supported ciphers.
Increase faulty node termination timeout from 1 minute to 5 in order to give some additional time to the scheduler to recover when under heavy load.
Extended pcluster createami command to specify the VPC and network settings when building the AMI.
Support inline comments in config file
Support Python 3.8 in pcluster CLI.
Deprecate Python 2.6 support
Add ClusterName tag to EC2 instances.
Search for new available version only at pcluster create action.
Enable sanity_check by default.

Bug Fixes

Fix sanity check for custom ec2 role. Fixes #1241.
Fix bug when using same subnet for both master and compute.
Fix bug when ganglia is enabled ganglia urls are shown. Fixes #1322.
Fix bug with awsbatch scheduler that prevented Multi-node jobs from running.
Fix jobwatcher behaviour that was marking nodes locked by the nodewatcher as busy even if they had been removed already from the ASG Desired count. This was causing, in rare circumstances, a cluster overscaling.
Fix bug that was causing failures in sqswatcher when ADD and REMOVE event for the same host are fetched together.
Fix bug that was preventing nodes to mount partitioned EBS volumes.
Implement paginated calls in pcluster list.
Fix bug when creating awsbatch cluster with name longer than 31 chars
Fix a bug that lead to ssh not working after ssh'ing into a compute node by ip address.

Support

Assets 2

29 Jul 10:42

demartinofra

v2.4.1

8f5359f

AWS ParallelCluster v2.4.1

We're excited to announce the release of AWS ParallelCluster 2.4.1.

Upgrade

How to upgrade?

sudo pip install --upgrade aws-parallelcluster

Docs

New docs are available here: https://docs.aws.amazon.com/parallelcluster/latest/ug/

Enhancements

Add support for ap-east-1 region (Hong Kong)
Add possibility to specify instance type to use when building custom AMIs with pcluster createami
Speed up cluster creation by having compute nodes starting together with master node
Enable ASG CloudWatch metrics for the ASG managing compute nodes
Install Intel MPI 2019u4 on Amazon Linux, Centos 7 and Ubuntu 1604
Upgrade Elastic Fabric Adapter (EFA) to version 1.4.1 that supports Intel MPI
Run all node daemons and cookbook recipes in isolated Python virtualenvs. This allows our code to always run with the required Python dependencies and solves all conflicts and runtime failures that were being caused by user packages installed in the system Python
Torque:
- Process nodes added to or removed from the cluster in batches in order to speed up cluster scaling
- Scale up only if required CPU/nodes can be satisfied
- Scale down if pending jobs have unsatisfiable CPU/nodes requirements
- Add support for jobs in hold/suspended state (this includes job dependencies)
- Automatically terminate and replace faulty or unresponsive compute nodes
- Add retries in case of failures when adding or removing nodes
- Add support for ncpus reservation and multi nodes resource allocation (e.g. -l nodes=2:ppn=3+3:ppn=6)
- Optimized Torque global configuration to faster react to the dynamic cluster scaling

Changes

Update EFA installer to a new version, note this changes the location of mpicc and mpirun. To avoid breaking existing code, we recommend you use the modulefile module load openmpi and which mpicc for anything that requires the full path
Eliminate Launch Configuration and use Launch Templates in all the regions
Torque: upgrade to version 6.1.2
Run all ParallelCluster daemons with Python 3.6 in a virtualenv. Daemons code now supports Python >= 3.5

Bug Fixes

Fix issue with sanity check at creation time that was preventing clusters from being created in private subnets
Fix pcluster configure when relative config path is used
Make FSx Substack depend on ComputeSecurityGroupIngress to keep FSx from trying to create prior to the SG allowing traffic within itself
Restore correct value for filehandle_limit that was getting reset when setting memory_limit for EFA
Torque: fix compute nodes locking mechanism to prevent job scheduling on nodes being terminated
Restore logic that was automatically adding compute nodes identity to SSH known_hosts file
Slurm: fix issue that was causing the ParallelCluster daemons to fail when the cluster is stopped and an empty compute nodes file is imported in Slurm config

Support

Assets 2

11 Jun 15:26

lukeseawalker

v2.4.0

1c53ad5

AWS ParallelCluster v2.4.0

We're excited to announce the release of AWS ParallelCluster 2.4.0.

Upgrade

How to upgrade?

sudo pip install --upgrade aws-parallelcluster

Docs

New docs are available here: https://docs.aws.amazon.com/parallelcluster/latest/ug/

Enhancements

Add support for EFA on Centos 7, Amazon Linux and Ubuntu 1604
Add support for Ubuntu in China region cn-northwest-1
SGE:
- process nodes added to or removed from the cluster in batches in order to speed up cluster scaling.
- scale up only if required slots/nodes can be satisfied
- scale down if pending jobs have unsatisfiable CPU/nodes requirements
- add support for jobs in hold/suspended state (this includes job dependencies)
- automatically terminate and replace faulty or unresponsive compute nodes
- add retries in case of failures when adding or removing nodes
- configure scheduler to handle rescheduling and cancellation of jobs running on failing or terminated nodes
Slurm:
- scale up only if required slots/nodes can be satisfied
- scale down if pending jobs have unsatisfiable CPU/nodes requirements
- automatically terminate and replace faulty or unresponsive compute nodes
- decrease SlurmdTimeout to 120 seconds to speed up replacement of faulty nodes
Automatically replace compute instances that fail initialization and dump logs to shared home directory.
Dynamically fetch compute instance type and cluster size in order to support updates in scaling daemons
Always use full master FQDN when mounting NFS on compute nodes. This solves some issues occurring with some networking setups and custom DNS configurations
List the version and status during pcluster list
Remove double quoting of the post_install args
awsbsub: use override option to set the number of nodes rather than creating multiple JobDefinitions
Add support for AWS_PCLUSTER_CONFIG_FILE env variable to specify pcluster config file

Changes

Update Open MPI library to version 3.1.4 on Centos 7, Amazon Linux and Ubuntu 1604. This also changes the default openmpi path to /opt/amazon/efa/bin/ and the openmpi module name to openmpi/3.1.4
Set soft and hard ulimit on open files to 10000 for all supported OSs
For a better security posture, we're removing AWS credentials from the parallelcluster config file. Credentials can be now setup following the canonical procedure used for the aws cli
When using FSx or EFS do not enforce in sanity check that the compute security group is open to 0.0.0.0/0
When updating an existing cluster, the same template version is now used, no matter the pcluster cli version
SQS messages that fail to be processed in sqswatcher are now re-queued only 3 times and not forever
Reset nodewatcher idletime to 0 when the host becomes essential for the cluster (because of min size of ASG or because there are pending jobs in the scheduler queue)
SGE: a node is considered as busy when in one of the following states "u", "C", "s", "d", "D", "E", "P", "o". This allows a quick replacement of the node without waiting for the nodewatcher to terminate it.
Do not update DynamoDB table on cluster updates in order to avoid hitting strict API limits (1 update per day).

Bug Fixes

Fix issue that was preventing Torque from being used on Centos 7
Start node daemons at the end of instance initialization. The time spent for post-install script and node initialization is not counted as part of node idletime anymore.
Fix issue which was causing an additional and invalid EBS mount point to be added in case of multiple EBS
Install Slurm libpmpi/libpmpi2 that is distributed in a separate package since Slurm 17
pcluster ssh command now works for clusters with use_public_ips = false
Slurm: add "BeginTime", "NodeDown", "Priority" and "ReqNodeNotAvail" to the pending reasons that trigger a cluster scaling
Add a timeout on remote commands execution so that the daemons are not stuck if the compute node is unresponsive
Fix an edge case that was causing the nodewatcher to hang forever in case the node had become essential to the cluster during a call to self_terminate.
Fix pcluster start/stop commands when used with an awsbatch cluster

Support

Assets 2

03 Apr 08:54

enrico-usai

v2.3.1

47b8751

AWS ParallelCluster 2.3.1

We're excited to announce the release of AWS ParallelCluster 2.3.1.

Upgrade

How to upgrade?

sudo pip install --upgrade aws-parallelcluster

Enhancements

Add support for FSx Lustre with Amazon Linux. In case of custom AMI,
The kernel will need to be >= 4.14.104-78.84.amzn1.x86_64
Slurm
- set compute nodes to DRAIN state before removing them from cluster. This prevents the scheduler from submitting a job to a node that is being terminated.
- dynamically adjust max cluster size based on ASG settings
- dynamically change the number of configured FUTURE nodes based on the actual nodes that join the cluster. The max size of the cluster seen by the scheduler always matches the max capacity of the ASG.
- process nodes added to or removed from the cluster in batches. This speeds up cluster scaling which is able to react with a delay of less than 1 minute to variations in the ASG capacity.
- add support for job dependencies and pending reasons. The cluster won't scale up if the job cannot start due to an unsatisfied dependency.
- set ReturnToService=1 in scheduler config in order to recover instances that were initially marked as down due to a transient issue.
Validate FSx parameters. Fixes #896 .

Changes

Slurm - Upgrade version to 18.08.6.2
NVIDIA - update drivers to version 418.56
CUDA - update toolkit to version 10.0
Increase default EBS volume size from 15GB to 17GB
Disabled updates to FSx File Systems, updates to most parameters would cause the filesystem, and all it's data, to be deleted

Bug Fixes

Cookbook wasn't fetched when custom_ami parameter specified in the config
Cfn-init is now fetched from us-east-1, this bug effected non-alinux custom ami's in regions other than us-east-1.
Account limit check not done for SPOT or AWS Batch Clusters
Account limit check fall back to master subnet. Fixes #910 .
Boto3 upperbound removed

Support

Assets 2

28 Feb 11:41

demartinofra

v2.2.1

9e76b10

AWS ParallelCluster 2.2.1

We're excited to announce the release of AWS ParallelCluster 2.2.1.

Upgrade

How to upgrade?

sudo pip install --upgrade aws-parallelcluster

Features

Support for FSx Lustre with Centos 7
Check AWS EC2 account limits before starting cluster creation
Allow users to force job deletion with SGE scheduler

Changes

Set default value to compute for placement_group option
pcluster ssh: use private IP when the public one is not available
pcluster ssh: now works also when stack is not completed as long as the master IP is available

Bugfixes

awsbsub: fix file upload with absolute path
pcluster ssh: fix issue that was preventing the command from working correctly when stack status is UPDATE_ROLLBACK_COMPLETE
Fix block device conversion to correctly attach EBS nvme volumes
Wait for Torque scheduler initialization before completing master node setup
pcluster version: now works also when no ParallelCluster config is present
Improve nodewatcher daemon logic to detect if a SGE compute node has running jobs

Support

Assets 2

08 Jan 14:04

sean-smith

v2.1.1

1e0fbc6

AWS ParallelCluster 2.1.1

We're excited to announce the release of AWS ParallelCluster 2.1.1.

Upgrade

How to upgrade?

sudo pip install --upgrade aws-parallelcluster

Features

Support for AWS Beijing Region (cn-north-1) and Ningxia Region (cn-northwest-1)

Bugfixes

No longer schedule jobs on compute nodes that are terminating

Support

Assets 2

Releases: aws/aws-parallelcluster

AWS ParallelCluster v2.7.0

Upgrade

ENHANCEMENTS

CHANGES

BUG FIXES

Support

AWS ParallelCluster v2.6.1

Upgrade

ENHANCEMENTS

CHANGES

BUG FIXES

Support

AWS ParallelCluster v2.6.0

Upgrade

Enhancements

Changes

Bug Fixes

Support

AWS ParallelCluster v2.5.1

Upgrade

Enhancements

Changes

Bug Fixes

Support

AWS ParallelCluster v2.5.0

Upgrade

Enhancements

Changes

Bug Fixes

Support

AWS ParallelCluster v2.4.1

Upgrade

Docs

Enhancements

Changes

Bug Fixes

Support

AWS ParallelCluster v2.4.0

Upgrade

Docs

Enhancements

Changes

Bug Fixes

Support

AWS ParallelCluster 2.3.1

Upgrade

Enhancements

Changes

Bug Fixes

Support

AWS ParallelCluster 2.2.1

Upgrade

Features

Changes

Bugfixes

Support

AWS ParallelCluster 2.1.1

Upgrade

Features

Bugfixes

Support