Pipelined Parallelism
🏗️ Infrastructure
🔴 Advanced
👁 2 views
📖 Quick Definition
Pipelined parallelism splits model layers across devices, allowing simultaneous processing of different micro-batches to improve training efficiency.
## What is Pipelined Parallelism?
Pipelined parallelism is a distributed training strategy designed to handle machine learning models that are too large to fit into the memory of a single GPU. Instead of replicating the entire model across multiple devices (as in Data Parallelism) or splitting individual operations across devices (as in Tensor Parallelism), pipelined parallelism divides the model’s layers sequentially across a chain of GPUs. Each device holds a specific subset of the model’s layers, creating a physical pipeline where data flows from one stage to the next.
Imagine an assembly line in a car factory. In traditional synchronous training, every worker waits for the previous step to finish completely before starting their task, leading to significant idle time. Pipelined parallelism changes this dynamic. As soon as the first layer finishes processing a small chunk of data (a micro-batch), it passes that data to the second layer. While the second layer works on that chunk, the first layer immediately begins processing the *next* chunk. This overlapping of computation allows the hardware to remain busy, significantly improving throughput and reducing the overall time required to train massive models.
This technique is particularly crucial in the era of Large Language Models (LLMs), where parameter counts reach into the hundreds of billions. Without pipelining, the memory constraints of current hardware would make training such models impossible or prohibitively expensive. By distributing the memory load and keeping compute units active, engineers can scale training to sizes that were previously unattainable.
## How Does It Work?
Technically, the model is partitioned into $N$ stages, where each stage resides on a different GPU (or group of GPUs). The training process involves forward and backward passes. To maximize efficiency, the input batch is split into smaller **micro-batches**.
1. **Forward Pass**: Micro-batch 1 enters Stage 1. Once Stage 1 computes its output, it sends it to Stage 2. Meanwhile, Stage 1 immediately starts computing on Micro-batch 2. This continues until all micro-batches have passed through all stages.
2. **Backward Pass**: After the loss is calculated, gradients flow backward. Stage $N$ computes gradients for its parameters and sends them back to Stage $N-1$. Again, this happens concurrently with other stages processing earlier micro-batches.
A common challenge here is the "bubble" problem—idle times when stages wait for dependencies. Advanced scheduling algorithms like **1F1B** (One Forward, One Backward) are used to minimize these bubbles by interleaving forward and backward computations more efficiently.
```python
# Conceptual pseudocode for pipelined execution
for micro_batch in micro_batches:
# Stage 1 processes current batch while Stage 2 processes previous
output_stage_1 = stage_1.forward(micro_batch)
send_to_next_stage(output_stage_1)
# Simultaneously, receive input for Stage 2
input_stage_2 = receive_from_prev_stage()
output_stage_2 = stage_2.forward(input_stage_2)
```
## Real-World Applications
* **Training LLMs**: Used extensively in training models like GPT-3 and BLOOM, where single-GPU memory is insufficient.
* **Recommendation Systems**: Handling massive embedding tables that exceed single-device capacity by splitting embeddings across devices.
* **Multi-Modal Models**: Distributing vision and language encoders across different stages to balance computational loads.
## Key Takeaways
* **Memory Efficiency**: Allows training of models larger than single-GPU memory by splitting layers vertically.
* **Throughput vs. Latency**: Improves overall training speed (throughput) but may increase latency per individual sample due to pipeline overhead.
* **Complexity**: Requires careful management of communication between devices and sophisticated scheduling to avoid idle time.
* **Hybrid Approach**: Often combined with Data Parallelism (splitting data across pipeline groups) for maximum scalability.
## 🔥 Gogo's Insight
**Why It Matters**: As AI models grow exponentially, memory bandwidth and capacity become the primary bottlenecks. Pipelined parallelism is the key infrastructure enabler that allows us to continue scaling model size without being strictly limited by physical hardware constraints. It transforms a memory-bound problem into a communication-bound one, which is often easier to optimize with high-speed interconnects like NVLink.
**Common Misconceptions**: Many believe pipelining simply makes training faster linearly. In reality, it introduces communication overhead and synchronization complexity. If not tuned correctly (e.g., poor micro-batch sizing), it can actually be slower than simpler methods due to pipeline bubbles.
**Related Terms**:
1. **Data Parallelism**: The most common baseline method; often combined with pipelining.
2. **Tensor Parallelism**: Splits individual operations/weights; often used within a single pipeline stage.
3. **Gradient Accumulation**: A technique often used alongside pipelining to simulate larger batch sizes.