Skip to content

Add a vcluster-icon aimed at running mch_icon-ch1

Context

Adds a vcluster-icon geared towards running a validation of mch_icon-ch1. This vcluster defines the hardware and services to run the GPU version of ICON.

Here is a summary of the infrastructure deployed by this vCluster definition

  • 2 Compute instances
    • Machine type: a2_ultragpu_4g
  • 1 login instances
    • Machine type: n2-standard-2
  • Shared storage
    • 1 TiB
  • vServices
    • Slurm
    • enroot
    • pyxis

Impact

Validation that we have portable vcluster technology and can run ICON on GPUs. This is the final goal of SwissTwins WP4.

Test(s)

  1. Deploy the shared services.

    cd <path-to-sc-shared-services>
    terraform apply
  2. Follow the instructions in the README.md of vs-slurm-simple to upload all necessary packages and images to the artifact repository.

  3. Deploy a suitable vCluster. Make sure the compute instances are of the correct type for the version of Icon you plan to run (n2-standard-128 for CPU, a2-ultragpu-4g for GPU). The vclusters repository contains the icon vcluster definition, which is prepared to run the GPU version of ICON. You can deploy this definition like this:

    terraform apply -var 'vclusters=["icon"]'

    The name of the deployed vcluster will follow the format <terraform_workspace>-<vcluster>. e.g. If I deploy the icon vcluster in the workspace german, the vcluster name will be german-icon.take note of this name, as you will need it for the following steps.

    NOTE: If you are deploying a GPU vcluster, notice that the machine will reboot shortly after it's created to finalize the installation of the GPU drivers.

  4. Add a new user to LDAP using the adduser.sh script, which can be found in the vc-shared-services repo. You will use this user to queue the slurm job. This user must be added to a group of the same name as the vCluster. e.g.:

    VCLUSTER_NAME=german-icon
    SLURM_USER=germanslurm
    SSH_PUBLIC_KEY=$HOME/.ssh/id_ed25519.pub
    scripts/ldap/adduser.sh -u "$SLURM_USER" \
      -g "$VCLUSTER_NAME" \
      -f test \
      -l test \
      -m "$SLURM_USER"@epfl.ch \
      -s "$SSH_PUBLIC_KEY"
  5. Log into one of your compute nodes using your GCP credentials:

    gcloud compute ssh <node_name> --zone=<node_zone>
  6. Create a new slurm account with the name of your vCluster. Add your previously created LDAP user to that account:

    SLURM_USER=germanslurm
    VCLUSTER_NAME=german-icon
    sudo sacctmgr -i create account account="$VCLUSTER_NAME"; \
    sudo sacctmgr -i create user name="$SLURM_USER" account="$VCLUSTER_NAME"
  7. Create local and shared folders for the use of your LDAP user:

    SLURM_USER=germanslurm
    BASE_DIR=/custom-mount-point-directory/"$SLURM_USER"
    
    sudo mkdir -p /home/"$SLURM_USER"; \
    sudo mkdir "$BASE_DIR"; \
    sudo chown "$SLURM_USER" /home/"$SLURM_USER"; \
    sudo chown "$SLURM_USER" "$BASE_DIR"
  8. Login as your LDAP user:

    sudo -u $SLURM_USER bash
  9. Install gsutil to copy the input data from gs://vc-demo-data to the shared storage

    curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-linux-x86_64.tar.gz
    tar -xf google-cloud-cli-linux-x86_64.tar.gz
    ./google-cloud-sdk/install.sh
    export BASE_DIR=/custom-mount-point-directory/"$SLURM_USER
    gsutil -m cp -r gs://vc-demo-data "$BASE_DIR"/testing-input-data/
    mv "$BASE_DIR"/testing-input-data/vc-demo-data/c2sm/ "$BASE_DIR"/testing-input-data/
  10. Finally, clone this repository into the base directory:

    cd $BASE_DIR
    git clone git@gitlab.epfl.ch:swisstwins/icon.git
  11. Build ICON

    cd icon
    ./build.sh -c gcp_a2 -t gpu -d $BASE_DIR -m gs://vc-software-stack

    Where -c gcp_a2 loads the spack and slurm definitions for A2 machines in GCP, and -t gpu ensures spack installs NVIDIA compatible packages. A cluster with access to the gs://vc-software-stack bucket, you can use the -m option to use it as a build cache.

  12. Upon a successful build, the build.sh script will tell you how to run a test ICON job. This usually boils down to:

    sbatch "build/icon-model/run/exp.mch_icon-ch1.run"

These instructions are taken from the Swisstwins icon repo. Check the GCP Setup section of the README.

Edited by German Felipe Giraldo Villa

Merge request reports

Loading