Add a vcluster-icon aimed at running mch_icon-ch1 (!61) · Merge requests · SwissTwins / vclusters

Context

Adds a vcluster-icon geared towards running a validation of mch_icon-ch1. This vcluster defines the hardware and services to run the GPU version of ICON.

Here is a summary of the infrastructure deployed by this vCluster definition

2 Compute instances
- Machine type: a2_ultragpu_4g
1 login instances
- Machine type: n2-standard-2
Shared storage
- 1 TiB
vServices
- Slurm
- enroot
- pyxis

Impact

Validation that we have portable vcluster technology and can run ICON on GPUs. This is the final goal of SwissTwins WP4.

Test(s)

Deploy the shared services.

cd <path-to-sc-shared-services>
terraform apply

Follow the instructions in the README.md of vs-slurm-simple to upload all necessary packages and images to the artifact repository.
Deploy a suitable vCluster. Make sure the compute instances are of the correct type for the version of Icon you plan to run (n2-standard-128 for CPU, a2-ultragpu-4g for GPU). The vclusters repository contains the icon vcluster definition, which is prepared to run the GPU version of ICON. You can deploy this definition like this:
```
terraform apply -var 'vclusters=["icon"]'
```
The name of the deployed vcluster will follow the format <terraform_workspace>-<vcluster>. e.g. If I deploy the icon vcluster in the workspace german, the vcluster name will be german-icon.take note of this name, as you will need it for the following steps.

NOTE: If you are deploying a GPU vcluster, notice that the machine will reboot shortly after it's created to finalize the installation of the GPU drivers.

Add a new user to LDAP using the adduser.sh script, which can be found in the vc-shared-services repo. You will use this user to queue the slurm job. This user must be added to a group of the same name as the vCluster. e.g.:

VCLUSTER_NAME=german-icon
SLURM_USER=germanslurm
SSH_PUBLIC_KEY=$HOME/.ssh/id_ed25519.pub
scripts/ldap/adduser.sh -u "$SLURM_USER" \
  -g "$VCLUSTER_NAME" \
  -f test \
  -l test \
  -m "$SLURM_USER"@epfl.ch \
  -s "$SSH_PUBLIC_KEY"

Log into one of your compute nodes using your GCP credentials:
```
gcloud compute ssh <node_name> --zone=<node_zone>
```

Create a new slurm account with the name of your vCluster. Add your previously created LDAP user to that account:

SLURM_USER=germanslurm
VCLUSTER_NAME=german-icon
sudo sacctmgr -i create account account="$VCLUSTER_NAME"; \
sudo sacctmgr -i create user name="$SLURM_USER" account="$VCLUSTER_NAME"

Create local and shared folders for the use of your LDAP user:

SLURM_USER=germanslurm
BASE_DIR=/custom-mount-point-directory/"$SLURM_USER"

sudo mkdir -p /home/"$SLURM_USER"; \
sudo mkdir "$BASE_DIR"; \
sudo chown "$SLURM_USER" /home/"$SLURM_USER"; \
sudo chown "$SLURM_USER" "$BASE_DIR"

Install gsutil to copy the input data from gs://vc-demo-data to the shared storage

curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-linux-x86_64.tar.gz
tar -xf google-cloud-cli-linux-x86_64.tar.gz
./google-cloud-sdk/install.sh

export BASE_DIR=/custom-mount-point-directory/"$SLURM_USER
gsutil -m cp -r gs://vc-demo-data "$BASE_DIR"/testing-input-data/
mv "$BASE_DIR"/testing-input-data/vc-demo-data/c2sm/ "$BASE_DIR"/testing-input-data/

Finally, clone this repository into the base directory:

cd $BASE_DIR
git clone git@gitlab.epfl.ch:swisstwins/icon.git

Build ICON
```
cd icon
./build.sh -c gcp_a2 -t gpu -d $BASE_DIR -m gs://vc-software-stack
```
Where -c gcp_a2 loads the spack and slurm definitions for A2 machines in GCP, and -t gpu ensures spack installs NVIDIA compatible packages. A cluster with access to the gs://vc-software-stack bucket, you can use the -m option to use it as a build cache.
Upon a successful build, the build.sh script will tell you how to run a test ICON job. This usually boils down to:
```
sbatch "build/icon-model/run/exp.mch_icon-ch1.run"
```

These instructions are taken from the Swisstwins icon repo. Check the GCP Setup section of the README.

Edited Oct 15, 2025 by German Felipe Giraldo Villa

Admin message

Add a vcluster-icon aimed at running mch_icon-ch1

Context

Impact

Test(s)

Merge request reports