Skip to content

Add a vcluster-icon aimed at running mch_icon-ch1

Context

Adds a vcluster-icon geared towards running a validation of mch_icon-ch1. This vcluster defines the hardware and services to run the GPU version of ICON.

Here is a summary of the infrastructure deployed by this vCluster definition

  • 2 Compute instances
    • Machine type: a2_ultragpu_4g
  • 1 login instances
    • Machine type: n2-standard-2
  • Shared storage
    • 1 TiB
  • vServices
    • Slurm
    • enroot
    • pyxis

Impact

Validation that we have portable vcluster technology and can run ICON on GPUs. This is the final goal of SwissTwins WP4.

Test(s)

  1. Deploy the shared services.

    cd <path-to-sc-shared-services>
    terraform apply
  2. Follow the instructions in the README.md of vs-slurm-simple to upload all necessary packages and images to the artifact repository.

  3. Deploy the icon vcluster definition, which is prepared to run the GPU version of ICON. You can deploy this definition like this:

    terraform apply -var 'vclusters=["icon"]'

    The name of the deployed vcluster will follow the format <terraform_workspace>-<vcluster>. e.g. If I deploy the icon vcluster in the workspace german, the vcluster name will be german-icon.take note of this name, as you will need it for the following steps.

    NOTE: If you are deploying a GPU vcluster, notice that the machine will reboot shortly after it's created to finalize the installation of the GPU drivers.

  4. Add a new user to LDAP using the adduser.sh script, which can be found in the vc-shared-services repo. You will use this user to queue the slurm job. This user must be added to a group of the same name as the vCluster. e.g.:

    VCLUSTER_NAME=german-icon
    SLURM_USER=germanslurm
    SSH_PUBLIC_KEY=$HOME/.ssh/id_ed25519.pub
    scripts/ldap/adduser.sh -u "$SLURM_USER" \
      -g "$VCLUSTER_NAME" \
      -f test \
      -l test \
      -m "$SLURM_USER"@epfl.ch \
      -s "$SSH_PUBLIC_KEY"
  5. Log into one of your compute nodes using your GCP credentials:

    gcloud compute ssh <node_name> --zone=<node_zone>
  6. Create a new slurm account with the name of your vCluster. Add your previously created LDAP user to that account:

    SLURM_USER=germanslurm
    VCLUSTER_NAME=german-icon
    sudo sacctmgr -i create account account="$VCLUSTER_NAME"; \
    sudo sacctmgr -i create user name="$SLURM_USER" account="$VCLUSTER_NAME"
  7. Create local and shared folders for the use of your LDAP user:

    SLURM_USER=germanslurm
    BASE_DIR=/custom-mount-point-directory/"$SLURM_USER"
    
    sudo mkdir -p /home/"$SLURM_USER"; \
    sudo mkdir "$BASE_DIR"; \
    sudo chown "$SLURM_USER" /home/"$SLURM_USER"; \
    sudo chown "$SLURM_USER" "$BASE_DIR"
  8. Login as your LDAP user:

    sudo -u $SLURM_USER bash
  9. Install gsutil to copy the input data from gs://vc-demo-data to the shared storage

    curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-linux-x86_64.tar.gz
    tar -xf google-cloud-cli-linux-x86_64.tar.gz
    ./google-cloud-sdk/install.sh
    export BASE_DIR=/custom-mount-point-directory/"$SLURM_USER
    gsutil -m cp -r gs://vc-demo-data "$BASE_DIR"/testing-input-data/
    mv "$BASE_DIR"/testing-input-data/vc-demo-data/c2sm/ "$BASE_DIR"/testing-input-data/
  10. Clone the Swisstwins icon repository into your base directory:

    cd $BASE_DIR
    git clone git@gitlab.epfl.ch:swisstwins/icon.git
  11. Build ICON

    cd icon
    ./build.sh -c gcp_a2 -t gpu -d $BASE_DIR -m gs://vc-software-stack

    Where -c gcp_a2 loads the spack and slurm definitions for A2 machines in GCP, and -t gpu ensures spack installs NVIDIA compatible packages. A cluster with access to the gs://vc-software-stack bucket, you can use the -m option to use it as a build cache.

  12. Upon a successful build, the build.sh script will tell you how to run a test ICON job. This usually boils down to:

    sbatch "build/icon-model/run/exp.mch_icon-ch1.run"

These instructions are taken from the Swisstwins icon repo. Check the GCP Setup section of the README.

Edited by German Felipe Giraldo Villa

Merge request reports

Loading