Skip to content
Snippets Groups Projects
README.md 4.33 KiB
Newer Older
Mika Senghaas's avatar
Mika Senghaas committed
# DiLoCo-SWARM

This repository contains the full experiment code and results for validating DiLoCo-SWARM, a distributed training method that combines the pipeline and data parallelism of [SWARM](https://arxiv.org/abs/2301.11913) with [DiLoCo](https://arxiv.org/abs/2311.08105)'s reduced-frequency gradient synchronization.

Mika Senghaas's avatar
Mika Senghaas committed
![DiLoCo-SWARM](/presentation/figures/experiment1-1.png)

Mika Senghaas's avatar
Mika Senghaas committed
## 🔗 Shortcuts

- 📄 Read the [report](report.pdf)
Mika Senghaas's avatar
Mika Senghaas committed
- 🗣️ See my presentation [slides](https://docs.google.com/presentation/d/1QVSvN9tnBtOj6ur6uI9QgId63T3R2TkK1fA7dqz8z00/edit?usp=sharing)
- 📊 View experiment results in [W&B](https://wandb.ai/mikasenghaas/diloco-swarm) and this  [notebook](notebooks/3.0-results.ipynb)
- 🧠 Understand DP, PP, DiLoCo, SWARM, and their interplay in this pure PyTorch single-file [training script]() - distilled from this codebase
Mika Senghaas's avatar
Mika Senghaas committed
- 💻 Replicate the experiments following the [setup](#setup) and [replicating experiments](#replicating-experiments) sections below.
Mika Senghaas's avatar
Mika Senghaas committed

Mika Senghaas's avatar
Mika Senghaas committed
This project should run on any machine with GPU support, and PyTorch installed. To be safe, follow my setup, which is described below. Get yourself access to some GPUs via [Prime Compute](https://www.app.primeintellect.com/).

**1. Clone the Repository**

```bash
Mika Senghaas's avatar
Mika Senghaas committed
gh repo clone mikasenghaas/diloco-swarm && cd diloco-swarm
```

**2. Configure Environment**

Mika Senghaas's avatar
Mika Senghaas committed
Configure environment variables for W&B, HF, Git, and SSH. To make this simpler, copy the example files I prepared, and fill in your info.

```bash
cp .env.example .env
cp .gitconfig.example .gitconfig
cp .sshconfig.example .sshconfig
```

**3. Run the Setup Script**

Mika Senghaas's avatar
Mika Senghaas committed
Execute the `setup.sh` script to set up the environment directly from your machine. You will need to know the user, host name, path to the persistent directory, and optionally a port number. All of these will be provided by Prime Compute once you have a running instance. If you have problems, check the [docs](https://docs.primeintellect.ai/quickstart) for more information.
Mika Senghaas's avatar
Mika Senghaas committed
bash scripts/setup.sh <USER> <HOST> <PERSISTENT_DIR> [<PORT>]
Mika Senghaas's avatar
Mika Senghaas committed
The script will transfer the necessary files to the remote server and execute a setup script. It will clone the repository, create and activate a virtual environment, install dependencies, and finally log in to W&B and HF. For details, check the [setup script](scripts/setup.sh) and [remote setup script](scripts/setup_remote.sh).

**4. Connecting to Server**

After the setup script completes, you can connect to the prepared server using:
ssh -p <port> <user>@<host>
```

Once connected, navigate to the project directory and activate the virtual environment:
Mika Senghaas's avatar
Mika Senghaas committed
cd diloco-swarm && conda activate diloco-swarm
You are now ready to start training models! 🚀

Mika Senghaas's avatar
Mika Senghaas committed
**5. Verify Setup**

To verify that the setup was successful, you can try running any of the experiments in the [experiments](experiments) directory, as outlined in this this [section](#replicating-experiments). Otherwise, you can run part of the test suite. This requires `pytest` to be installed.

```bash
cd diloco-swarm && pip install -r requirements.dev.txt
```

Mika Senghaas's avatar
Mika Senghaas committed
Once completed, you can run a simple test that checks whether the setup was successful.
Mika Senghaas's avatar
Mika Senghaas committed

```bash
Mika Senghaas's avatar
Mika Senghaas committed
pytest tests/test_setup.py
Mika Senghaas's avatar
Mika Senghaas committed
```


**6. Cleaning Up (Optional)**

If you need to remove the setup, you can use the cleanup scripts from your local machine or the remote server. From your local machine, run:
mikasenghaas's avatar
mikasenghaas committed
```bash
bash scripts/cleanup.sh "<SSH_STRING>"
```

Or, from within the Prime instance, run:

```bash
bash ~/cleanup_remote.sh
```

Mika Senghaas's avatar
Mika Senghaas committed
These will remove the Miniconda installation, the project directory, and other setup files from the server.

## 🚀 Replicating Experiments

All experiments are bundled in simple bash scripts in the
[experiments](experiments) directory:

1. [Experiment 1](): Main experiment testing DiLoCo-SWARM against Single-GPU and SWARM baselines.
2. [Experiment 2](): Communication frequency ablation testing the impact of reduced-frequency gradient synchronization.
3. [Experiment 3](): Model size ablation testing the impact of model size 

To run any experiment, simply execute the corresponding script. For example, to
replicate the main experiment, execute:

```bash
bash experiments/experiment1.sh
```

*NB: Some of these experiments run for multiple hours, or even days. Make sure to check the output of the experiment script to ensure it is running as expected.*