Splitwise: Efficient Generative LLM Inference Using Phase Splitting

1. Introduction
2. Background and Motivation
- 2.1 LLM Inference Phases
- 2.2 Hardware Limitations
3. Splitwise Design
- 3.1 Architecture Overview
- 3.2 Phase-Specific Resource Management
4. Technical Implementation
- 4.1 Mathematical Foundation
- 4.2 Code Implementation
5. Experimental Results
6. Analysis and Discussion
7. Future Applications
8. References

1. Introduction

Generative large language models (LLMs) have revolutionized natural language processing, but their computational demands pose significant challenges for efficient inference. The Splitwise approach addresses these challenges by recognizing and exploiting the distinct computational characteristics of the two main phases in LLM inference.

2. Background and Motivation

2.1 LLM Inference Phases

LLM inference consists of two distinct phases:

Prompt Computation Phase: Computationally intensive parallel processing of all input tokens
Token Generation Phase: Memory-intensive sequential generation of output tokens

2.2 Hardware Limitations

GPU Specification Comparison

A100 vs H100: 3.43× compute increase but only 1.64× memory bandwidth improvement

Modern GPUs show disproportionate scaling between computational power and memory capabilities, creating inefficiencies in LLM inference.

3. Splitwise Design

3.1 Architecture Overview

Splitwise deploys prompt computation and token generation on separate machines optimized for each phase's requirements.

3.2 Phase-Specific Resource Management

High-compute GPUs (H100) for prompt phase, cost-effective GPUs for token generation phase.

4. Technical Implementation

4.1 Mathematical Foundation

The attention mechanism in transformers can be represented as:

$Attention(Q, K, V) = softmax(\\frac{QK^T}{\\sqrt{d_k}})V$

Where $Q$, $K$, $V$ represent queries, keys, and values respectively, and $d_k$ is the dimension of keys.

4.2 Code Implementation

class SplitwiseScheduler:
    def schedule_request(self, request):
        if request.phase == "prompt":
            return self.assign_to_prompt_machine(request)
        else:
            return self.assign_to_token_machine(request)
    
    def transfer_state(self, prompt_output, token_machine):
        # Efficient state transfer using RDMA
        return token_machine.load_state(prompt_output)

5. Experimental Results

Splitwise achieves:

1.4× higher throughput at 20% lower cost
2.35× more throughput under same power and cost budgets
Improved latency consistency and resource utilization

6. Analysis and Discussion

Splitwise represents a significant advancement in LLM inference optimization by addressing the fundamental mismatch between computational requirements and hardware capabilities. The approach draws inspiration from distributed systems principles similar to those used in MapReduce and other parallel processing frameworks. By recognizing that the token generation phase is memory-bound rather than compute-bound, Splitwise enables more efficient resource allocation that aligns with the actual computational demands of each inference phase.

This work builds upon established principles in computer architecture, particularly the memory-wall problem identified by Wulf and McKee in 1995, which highlighted the growing disparity between processor speed and memory performance. The transformer architecture's attention mechanism, first introduced in Vaswani et al.'s 2017 paper "Attention is All You Need," inherently creates these two distinct computational phases, but previous optimization efforts focused primarily on model compression and quantization rather than architectural separation.

Compared to traditional monolithic deployment, Splitwise's phase separation approach demonstrates how specialized hardware can be more effectively utilized, similar to how Google's TPU pods are optimized for specific ML workloads. The 1.4× throughput improvement and 20% cost reduction are particularly significant given the massive scale of modern LLM deployments, where even small percentage improvements translate to substantial operational savings.

The methodology aligns with recent trends in heterogeneous computing, where systems combine different types of processors optimized for specific tasks. As LLMs continue to grow in size and complexity, approaches like Splitwise will become increasingly important for sustainable AI deployment, addressing both economic and environmental concerns associated with large-scale model inference.

7. Future Applications

Future directions include:

Multi-modal model inference optimization
Edge computing deployments
Real-time adaptive resource allocation
Integration with emerging hardware architectures

8. References

Vaswani, A., et al. "Attention is All You Need." NeurIPS 2017.
Brown, T., et al. "Language Models are Few-Shot Learners." NeurIPS 2020.
Wulf, W. A., & McKee, S. A. "Hitting the memory wall: implications of the obvious." ACM SIGARCH Computer Architecture News, 1995.
NVIDIA Corporation. "NVIDIA H100 Tensor Core GPU Architecture." 2022.
Dean, J., & Ghemawat, S. "MapReduce: Simplified data processing on large clusters." OSDI 2004.

Table of Contents