Commit 262a1a75 authored by Emmanuel Jaep's avatar Emmanuel Jaep
Browse files

debug

parent 534b6685
ansible-role-nvidia-driver @ 8fa78e47
Subproject commit 8fa78e47b2e974e1a6dde961ea3622f47c2647e5
Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of NVIDIA CORPORATION nor the names of its
contributors may be used to endorse or promote products derived
from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# ansible-role-nvidia-driver
An Ansible role to install the NVIDIA driver from the NVIDIA CUDA repositories.
## Requirements
In the process of installing the NVIDIA driver, this role will reboot the nodes where it runs.
Because of this, we strongly recommend that you run `ansible-playbook` from a separate node than the GPU nodes where you are installing the driver.
If you attempt to run Ansible on the same node where you are installing the driver, this role will either:
* Refuse to proceed with an error like `Running reboot with local connection would reboot the control node` (if running with the `local` connection)
* Reboot the node you're running on, interrupting the playbook execution! (if running the an `ssh` connection against localhost)
## Installing
This role can be installed using [Ansible Galaxy](https://galaxy.ansible.com/nvidia/nvidia_driver):
```
$ ansible-galaxy install nvidia.nvidia_driver
```
## Role variables
| Variable | Default value | Description |
| -------- | ------------- | ----------- |
| `nvidia_driver_package_state` | `"present"` | Package state for NVIDIA driver packages |
| `nvidia_driver_package_version` | `""` | Package version to install. Note that this should match the actual version of the deb or RPM package to be installed. |
| `nvidia_driver_persistence_mode_on` | `yes` | Whether to enable persistence mode (boolean) |
| `nvidia_driver_skip_reboot` | `no` | Whether to skip rebooting the node during the install |
| `nvidia_driver_module_file` | `"/etc/modprobe.d/nvidia.conf"` | Filename to use for NVIDIA driver parameters |
| `nvidia_driver_module_params` | `""` | Parameters to pass to the NVIDIA driver |
| `nvidia_driver_branch` | `"510"` | Default driver branch to install |
### Red Hat specific variables
| Variable | Default value | Description |
| -------- | ------------- | ----------- |
| `epel_package` | `"https://dl.fedoraproject.org/pub/epel/epel-release-latest-{{ ansible_distribution_major_version }}.noarch.rpm"` | Package to install to enable EPEL |
| `nvidia_driver_rhel_cuda_repo_baseurl` | `"https://developer.download.nvidia.com/compute/cuda/repos/{{ _rhel_repo_dir }}/"` | Base URL to use for CUDA repo |
| `nvidia_driver_rhel_cuda_repo_gpgkey` | `"https://developer.download.nvidia.com/compute/cuda/repos/{{ _rhel_repo_dir }}/7fa2af80.pub"` | GPG key for the CUDA repo |
### Ubuntu specific variables
For Ubuntu installs, you have the choice of installing from the Canonical repositories and the NVIDIA CUDA repositories.
By default, the Canonical repositories will be used, and the driver installed will be the headless server driver.
| Variable | Default value | Description |
| -------- | ------------- | ----------- |
| `nvidia_driver_ubuntu_install_from_cuda_repo` | `no` | Flag whether to use the CUDA repo |
| `nvidia_driver_ubuntu_cuda_repo_baseurl` | `"http://developer.download.nvidia.com/compute/cuda/repos/{{ _ubuntu_repo_dir }}"` | Base URL to use for CUDA repo |
| `nvidia_driver_ubuntu_cuda_package` | `"cuda-drivers"` | Package name to install from CUDA repo |
## Example playbook
```
- hosts: gpu_nodes
roles:
- nvidia.nvidia_driver
```
## Supported distributions
Currently, this role supports the following Linux distributions:
* NVIDIA DGX OS 4
* NVIDIA DGX OS 5
* Ubuntu 18.04 LTS
* Ubuntu 20.04 LTS
* CentOS 7
* Red Hat Enterprise Linux 7
* CentOS 8
* Red Hat Enterprise Linux 8
nvidia_driver_package_state: present
nvidia_driver_package_version: ''
nvidia_driver_persistence_mode_on: yes
nvidia_driver_skip_reboot: no
nvidia_driver_module_file: /etc/modprobe.d/nvidia.conf
nvidia_driver_module_params: ''
nvidia_driver_add_repos: yes
nvidia_driver_branch: "510"
##############################################################################
# RedHat family #
##############################################################################
epel_package: "https://dl.fedoraproject.org/pub/epel/epel-release-latest-{{ ansible_distribution_major_version }}.noarch.rpm"
epel_repo_key: "https://dl.fedoraproject.org/pub/epel/RPM-GPG-KEY-EPEL-{{ ansible_distribution_major_version }}"
nvidia_driver_rhel_cuda_repo_baseurl: "https://developer.download.nvidia.com/compute/cuda/repos/{{ _rhel_repo_dir }}/"
nvidia_driver_rhel_cuda_repo_gpgkey: "https://developer.download.nvidia.com/compute/cuda/repos/{{ _rhel_repo_dir }}/D42D0685.pub"
nvidia_driver_rhel_branch: "{{ nvidia_driver_branch }}"
##############################################################################
# Ubuntu #
##############################################################################
# Driver branch to install with Ubuntu
nvidia_driver_ubuntu_branch: "{{ nvidia_driver_branch }}"
# Determine if we should install from CUDA repo instead of Canonical repos
nvidia_driver_ubuntu_install_from_cuda_repo: no
# Installing with Canonical repositories
nvidia_driver_ubuntu_packages:
- "nvidia-headless-{{ nvidia_driver_ubuntu_branch }}-server"
- "nvidia-utils-{{ nvidia_driver_ubuntu_branch }}-server"
- "nvidia-headless-no-dkms-{{ nvidia_driver_ubuntu_branch }}-server"
- "nvidia-kernel-source-{{ nvidia_driver_ubuntu_branch }}-server"
# Installing with CUDA repositories
old_nvidia_driver_ubuntu_cuda_repo_gpgkey_id: "7fa2af80"
nvidia_driver_ubuntu_cuda_repo_baseurl: "https://developer.download.nvidia.com/compute/cuda/repos/{{ _ubuntu_repo_dir }}"
nvidia_driver_ubuntu_cuda_keyring_package: "cuda-keyring_1.0-1_all.deb"
nvidia_driver_ubuntu_cuda_keyring_url: "{{ nvidia_driver_ubuntu_cuda_repo_baseurl }}/{{ nvidia_driver_ubuntu_cuda_keyring_package }}"
nvidia_driver_ubuntu_cuda_package: "cuda-drivers-{{ nvidia_driver_ubuntu_branch }}"
[Service]
ExecStart=
ExecStart=/usr/bin/nvidia-persistenced --user root --persistence-mode --verbose
galaxy_info:
namespace: nvidia
role_name: nvidia_driver
author: Luke Yeager
company: NVIDIA
description: Install the NVIDIA driver
license: 3-Clause BSD
min_ansible_version: 2.7
platforms:
- name: Ubuntu
versions:
- 'xenial'
- 'bionic'
- 'focal'
- name: EL
versions:
- '7'
- '8'
galaxy_tags:
- 'nvidia'
- 'cuda'
- 'driver'
---
- name: Converge
hosts: all
tasks:
- name: "Include ansible-role-nvidia-driver"
include_role:
name: "ansible-role-nvidia-driver"
---
dependency:
name: galaxy
driver:
name: docker
platforms:
- name: ubuntu-1804-canonical
image: geerlingguy/docker-ubuntu1804-ansible
volumes:
- /sys/fs/cgroup:/sys/fs/cgroup:ro
command: /sbin/init
pre_build_image: true
privileged: true
groups:
- canonical_repo
- ubuntu
- name: ubuntu-1804-cuda
image: geerlingguy/docker-ubuntu1804-ansible
volumes:
- /sys/fs/cgroup:/sys/fs/cgroup:ro
command: /sbin/init
pre_build_image: true
privileged: true
groups:
- cuda_repo
- ubuntu
- name: ubuntu-2004-canonical
image: geerlingguy/docker-ubuntu2004-ansible
volumes:
- /sys/fs/cgroup:/sys/fs/cgroup:ro
command: /sbin/init
pre_build_image: true
privileged: true
groups:
- canonical_repo
- ubuntu
- name: ubuntu-2004-cuda
image: geerlingguy/docker-ubuntu2004-ansible
volumes:
- /sys/fs/cgroup:/sys/fs/cgroup:ro
command: /sbin/init
pre_build_image: true
privileged: true
groups:
- cuda_repo
- ubuntu
- name: centos-7
image: geerlingguy/docker-centos7-ansible
volumes:
- /sys/fs/cgroup:/sys/fs/cgroup:ro
command: /sbin/init
pre_build_image: true
privileged: true
# - name: centos-8
# image: geerlingguy/docker-centos8-ansible
# volumes:
# - /sys/fs/cgroup:/sys/fs/cgroup:ro
# command: /sbin/init
# pre_build_image: true
# privileged: true
provisioner:
name: ansible
ansible_args:
- -vv
inventory:
group_vars:
all:
nvidia_driver_skip_reboot: true
canonical_repo:
nvidia_driver_ubuntu_install_from_cuda_repo: false
cuda_repo:
nvidia_driver_ubuntu_install_from_cuda_repo: true
verifier:
name: ansible
---
- hosts: ubuntu
become: yes
tasks:
- name: update apt cache and install gpg-agent
apt:
update_cache: yes
name: gpg-agent
state: present
---
# This is an example playbook to execute Ansible tests.
- name: Verify
hosts: all
gather_facts: false
tasks:
- name: Example assertion
assert:
that: true
---
# We have to do this because the CentOS mirrors don't keep kernel-headers, etc
# for older kernels.
- name: ensure we have kernel-headers installed for the current kernel
block:
- name: attempt to install kernel support packages for current version
yum:
name:
- "kernel-headers-{{ ansible_kernel }}"
- "kernel-tools-{{ ansible_kernel }}"
- "kernel-tools-libs-{{ ansible_kernel }}"
- "kernel-devel-{{ ansible_kernel }}"
- "kernel-debug-devel-{{ ansible_kernel }}"
state: present
environment: "{{proxy_env if proxy_env is defined else {}}}"
rescue:
- name: update the kernel to latest version so we have a supported version
yum:
name:
- "kernel"
- "kernel-headers"
- "kernel-tools"
- "kernel-tools-libs"
- "kernel-devel"
- "kernel-debug-devel"
state: latest
environment: "{{proxy_env if proxy_env is defined else {}}}"
- name: reboot to pick up the new kernel
reboot:
when: not nvidia_driver_skip_reboot
- name: add epel repo gpg key
rpm_key:
key: "{{ epel_repo_key }}"
state: present
when: nvidia_driver_add_repos | bool
- name: add epel repo
become: true
yum:
name:
- "{{ epel_package }}"
state: latest
environment: "{{proxy_env if proxy_env is defined else {}}}"
when: nvidia_driver_add_repos | bool
- name: install dependencies
yum:
name: dkms
state: present
- name: blacklist nouveau
kernel_blacklist:
name: nouveau
state: present
- name: add repo
yum_repository:
name: cuda
description: NVIDIA CUDA YUM Repo
baseurl: "{{ nvidia_driver_rhel_cuda_repo_baseurl }}"
gpgkey: "{{ nvidia_driver_rhel_cuda_repo_gpgkey }}"
environment: "{{proxy_env if proxy_env is defined else {}}}"
when: nvidia_driver_add_repos | bool
- name: install driver packages RHEL/CentOS 7 and older
yum:
name: "{{ nvidia_driver_package_version | ternary('nvidia-driver-latest-dkms-'+nvidia_driver_package_version, 'nvidia-driver-branch-'+nvidia_driver_rhel_branch) }}"
state: "{{ nvidia_driver_package_state }}"
autoremove: "{{ nvidia_driver_package_state == 'absent' }}"
register: install_driver_rhel7
environment: "{{proxy_env if proxy_env is defined else {}}}"
when: ansible_distribution_major_version < '8'
- name: install driver packages RHEL/CentOS 8 and newer
dnf:
name: "{{ nvidia_driver_package_version | ternary('@nvidia-driver:'+nvidia_driver_package_version, '@nvidia-driver:'+nvidia_driver_rhel_branch+'-dkms') }}"
state: "{{ nvidia_driver_package_state }}"
autoremove: "{{ nvidia_driver_package_state == 'absent' }}"
register: install_driver_rhel8
environment: "{{proxy_env if proxy_env is defined else {}}}"
when: ansible_distribution_major_version > '7'
- name: Set install_driver.changed var for RHEL 7/8
debug:
msg: Driver installed for RHEL
when: install_driver_rhel7.changed or install_driver_rhel8.changed
register: install_driver
changed_when: install_driver_rhel7.changed or install_driver_rhel8.changed
---
- name: remove ppa
apt_repository:
repo: ppa:graphics-drivers/ppa
state: absent
- name: remove old signing key
apt_key:
id: "{{ old_nvidia_driver_ubuntu_cuda_repo_gpgkey_id }}"
state: absent
environment: "{{proxy_env if proxy_env is defined else {}}}"
when: nvidia_driver_add_repos | bool
- name: add CUDA keyring
apt:
deb: "{{ nvidia_driver_ubuntu_cuda_keyring_url }}"
state: "present"
environment: "{{proxy_env if proxy_env is defined else {}}}"
when: nvidia_driver_add_repos | bool
- name: force an apt update
apt:
update_cache: true
changed_when: false
- name: ensure kmod is installed
apt:
name: "kmod"
state: "present"
- name: blacklist nouveau
kernel_blacklist:
name: nouveau
state: present
- name: install driver packages
apt:
name: "{{ nvidia_driver_package_version | ternary(nvidia_driver_ubuntu_cuda_package+'='+nvidia_driver_package_version, nvidia_driver_ubuntu_cuda_package) }}"
state: "{{ nvidia_driver_package_state }}"
autoremove: "{{ nvidia_driver_package_state == 'absent' }}"
purge: "{{ nvidia_driver_package_state == 'absent' }}"
register: install_driver
environment: "{{proxy_env if proxy_env is defined else {}}}"
---
- name: remove ppa
apt_repository:
repo: ppa:graphics-drivers/ppa
state: absent
- name: install driver packages
apt:
name: "{{ nvidia_driver_package_version | ternary(item+'='+nvidia_driver_package_version, item) }}"
state: "{{ nvidia_driver_package_state }}"
autoremove: "{{ nvidia_driver_package_state == 'absent' }}"
purge: "{{ nvidia_driver_package_state == 'absent' }}"
with_items: "{{ nvidia_driver_ubuntu_packages }}"
register: install_driver
environment: "{{proxy_env if proxy_env is defined else {}}}"
---
- name: unload nouveau
modprobe:
name: nouveau
state: absent
ignore_errors: true
- name: ubuntu install tasks (canonical repos)
include_tasks: install-ubuntu.yml
when: ansible_distribution == 'Ubuntu' and (not nvidia_driver_ubuntu_install_from_cuda_repo)
- name: ubuntu install tasks (CUDA repo)
include_tasks: install-ubuntu-cuda-repo.yml
when: ansible_distribution == 'Ubuntu' and nvidia_driver_ubuntu_install_from_cuda_repo
- name: redhat family install tasks
include_tasks: install-redhat.yml
when: ansible_os_family == 'RedHat'
- name: create persistenced override dir
file:
path: /etc/systemd/system/nvidia-persistenced.service.d/
state: directory
recurse: yes
- name: configure persistenced service to turn on persistence mode
copy:
src: nvidia-persistenced-override.conf
dest: /etc/systemd/system/nvidia-persistenced.service.d/override.conf
when: nvidia_driver_persistence_mode_on
- name: remove persistenced service override
file:
path: /etc/systemd/system/nvidia-persistenced.service.d/override.conf
state: absent
when: not nvidia_driver_persistence_mode_on
- name: enable persistenced
systemd:
name: nvidia-persistenced
enabled: yes
when: nvidia_driver_package_state != 'absent'
- name: set module parameters
template:
src: nvidia.conf.j2
dest: "{{ nvidia_driver_module_file }}"
mode: '0644'
- name: reboot after driver install
reboot:
when: install_driver.changed and not nvidia_driver_skip_reboot
Test like this:
```
ansible-playbook --inventory tests/inventory.yml tests/playbook.yml
```
By default, the test inventory operates on the localhost.
You may want to change this.
---
- hosts: all
become: yes
roles:
- ../..
_ubuntu_repo_dir: "{{ ansible_distribution | lower }}{{ ansible_distribution_version | replace('.', '') }}/{{ ansible_architecture }}"
_rhel_repo_dir: "rhel{{ ansible_distribution_major_version }}/{{ ansible_architecture }}"
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment