# Training Model on multiple gpus

# Overview

Two ways to distribute training:

Model Parallelism
Data Parallelism

# Model Parallelism

Splitting a model layers or partial layers in multiple gpus. The problem is that this setup is not efficient due to time to lag.

# Data Parallelism

Completely mirror all the model parameters across all the GPUs and always apply the exact same parameter updates on every GPU.

The issue is we need to compute the mean and then propogate the gradient.

One way instead of all gpus completing and waiting, we can send the data to the parameter server. The parameter server is responsible for computing the mean and then propogating.

Parameter Server has two strategy: synchronous:

waits until all gradients are available before it computes the average gradients and passes them to the optimizer.
instead of waiting for 100% of nodes to comes back, you can require n replics and the other as spare replicas

asyncronous:

nodes send gradient but no syncronizing syncing
no synced reducing of gradients; might happen different times

According to 2016 Google Brain Paper (opens new window) , efficiently compute the mean of all the gradients from all the GPUs and distribute the result across all the GPUs.

CUDA_VISIBLE_DEVICES=0
tf.config.list_physical_devices("GPU")

distribution = tf.distribute.MirroredStrategy()

Tflite →