# Large Scale Training

algo hdfs

# Baisc Models

# Tree Based Model

Random Forest Models can be trained indenpendantly.

Xgboost, because it is iteratively can't build on parallel. What could be done is deciding which feature to split on, can be deon in parallel.

# Matrix Factorization

Alternating Least Square

Alternate between re-computing user-factors and item-factors, and each step is guaranteed to lower the value of the cost function.

# Logistic Regression

Each node compute mini batch on different set of data. Then the master node averages the gradient updates.

df = spark.read.parquet("cancellation.parquet")

from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(
    inputCols = ['month_interaction_count'
    ,'week_interaction_count','day_interaction_count']

)

df = asselmber.transform(df)

df = df.withColumn("label",F.col("cancelled_within_week"))


from pyspark.ml.classification import LogisiticRegression

df = df.cache()

train_data, test_data = df.randomSplit([0.6,0.4])

lr = LogisticRegression(maxIter=10)

lrModel = lr.fit(train_data)

predictions = lrModel.transform(test_data)

# Deep Learning Models

# GPU

  • gpu have stream processor which have a lot of cores

algo gpu stream processor

What if data can't fit in memory ?

For large models, we can have one parameter server and multiple gpus training on a portion of the data. The Gpu will send weight gradients to the master parameter server.

What if model can't fit in memory You could split portions of the network in multiple machines. Here we split among the layers algo partial models between layers

Here we split where each layer is in a machine algo partial models among layers

NVLink (opens new window) is a wire-based communications protocol for near-range semiconductor communications developed by Nvidia that can be used for data and control code transfers in processor systems between CPUs and GPUs and solely between GPUs.

InfiniBand (opens new window) is a computer networking communications standard used in high-performance computing that features very high throughput and very low latency. It is used for data interconnect both among and within computers

# Ring All-Reduce

Two Phases:

  1. Share Reduce Phase
  • each process p sends data to the process (p+1) % p ... So Process A will send to process B, etc.
  • the array of data of length n is divided by p, each one of these chunks will be indexed by i going forward.

all_ring_reduce_pt1 all_ring_reduce_pt2 all_ring_reduce_pt3 all_ring_reduce_pt4

  1. Share Only Phase
  • shares data similar to step one but doesn't apply reduction

all_ring_reduce_pt5

Reference (opens new window)

Uber's Horovord implements this.

# Code

# Using multiple gpus

Here is code sinppet to use multiple gpus in tensoeflow tf_mirrored_strategy