Add a vcluster-icon aimed at running mch_icon-ch1
Context
Adds a vcluster-icon geared towards running a validation of mch_icon-ch1. This vcluster defines the hardware and services to run the GPU version of ICON.
Here is a summary of the infrastructure deployed by this vCluster definition
- 2 Compute instances
- Machine type: a2_ultragpu_4g
- 1 login instances
- Machine type: n2-standard-2
- Shared storage
- 1 TiB
- vServices
- Slurm
- enroot
- pyxis
Impact
Validation that we have portable vcluster technology and can run ICON on GPUs. This is the final goal of SwissTwins WP4.
Test(s)
-
Deploy the shared services.
cd <path-to-sc-shared-services> terraform apply
-
Follow the instructions in the
README.md
of vs-slurm-simple to upload all necessary packages and images to the artifact repository. -
Deploy a suitable vCluster. Make sure the compute instances are of the correct type for the version of Icon you plan to run (
n2-standard-128
for CPU,a2-ultragpu-4g
for GPU). The vclusters repository contains theicon
vcluster definition, which is prepared to run the GPU version of ICON. You can deploy this definition like this:terraform apply -var 'vclusters=["icon"]'
The name of the deployed vcluster will follow the format
<terraform_workspace>-<vcluster>
. e.g. If I deploy theicon
vcluster in the workspacegerman
, the vcluster name will begerman-icon
.take note of this name, as you will need it for the following steps.NOTE: If you are deploying a GPU vcluster, notice that the machine will reboot shortly after it's created to finalize the installation of the GPU drivers.
-
Add a new user to LDAP using the adduser.sh script, which can be found in the
vc-shared-services
repo. You will use this user to queue the slurm job. This user must be added to a group of the same name as the vCluster. e.g.:VCLUSTER_NAME=german-icon SLURM_USER=germanslurm SSH_PUBLIC_KEY=$HOME/.ssh/id_ed25519.pub scripts/ldap/adduser.sh -u "$SLURM_USER" \ -g "$VCLUSTER_NAME" \ -f test \ -l test \ -m "$SLURM_USER"@epfl.ch \ -s "$SSH_PUBLIC_KEY"
-
Log into one of your compute nodes using your GCP credentials:
gcloud compute ssh <node_name> --zone=<node_zone>
-
Create a new slurm account with the name of your vCluster. Add your previously created LDAP user to that account:
SLURM_USER=germanslurm VCLUSTER_NAME=german-icon sudo sacctmgr -i create account account="$VCLUSTER_NAME"; \ sudo sacctmgr -i create user name="$SLURM_USER" account="$VCLUSTER_NAME"
-
Create local and shared folders for the use of your LDAP user:
SLURM_USER=germanslurm BASE_DIR=/custom-mount-point-directory/"$SLURM_USER" sudo mkdir -p /home/"$SLURM_USER"; \ sudo mkdir "$BASE_DIR"; \ sudo chown "$SLURM_USER" /home/"$SLURM_USER"; \ sudo chown "$SLURM_USER" "$BASE_DIR"
-
Login as your LDAP user:
sudo -u $SLURM_USER bash
-
Install
gsutil
to copy the input data fromgs://vc-demo-data
to the shared storagecurl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-linux-x86_64.tar.gz tar -xf google-cloud-cli-linux-x86_64.tar.gz ./google-cloud-sdk/install.sh
export BASE_DIR=/custom-mount-point-directory/"$SLURM_USER gsutil -m cp -r gs://vc-demo-data "$BASE_DIR"/testing-input-data/ mv "$BASE_DIR"/testing-input-data/vc-demo-data/c2sm/ "$BASE_DIR"/testing-input-data/
-
Finally, clone this repository into the base directory:
cd $BASE_DIR git clone git@gitlab.epfl.ch:swisstwins/icon.git
-
Build ICON
cd icon ./build.sh -c gcp_a2 -t gpu -d $BASE_DIR -m gs://vc-software-stack
Where
-c gcp_a2
loads the spack and slurm definitions for A2 machines in GCP, and-t gpu
ensures spack installs NVIDIA compatible packages. A cluster with access to thegs://vc-software-stack
bucket, you can use the-m
option to use it as a build cache. -
Upon a successful build, the
build.sh
script will tell you how to run a test ICON job. This usually boils down to:sbatch "build/icon-model/run/exp.mch_icon-ch1.run"
These instructions are taken from the Swisstwins icon repo. Check the GCP Setup section of the README.