Balcony: A Lightweight Approach to Dynamic Inference of Generative Language Models

📝 Paper Summary

Dynamic Inference Efficient LLM Deployment Model Compression

Balcony enables efficient dynamic inference by attaching lightweight, trainable exit layers to a frozen pre-trained LLM, allowing adaptive computation without degrading the base model's performance.

Core Problem

Existing dynamic inference methods often require extensive retraining or modification of the base model, leading to performance degradation in the full model and conflicting gradients between intermediate and final layer objectives.

Why it matters:

Real-world deployments face fluctuating computational constraints (latency, budget) that static models cannot handle efficiently
Prior depth-based methods force intermediate layers to serve dual purposes (representation for next layer vs. final output), causing performance drops in the full model
Retraining massive LLMs for dynamic inference is computationally expensive and risks catastrophic forgetting of pre-trained capabilities

Concrete Example: In methods like LayerSkip or Flextron, the base model weights are altered to support early exits. This creates a 'jack-of-all-trades' problem where the full model becomes worse than the original pre-trained checkpoint (e.g., LLaMA-2-7B accuracy drops from 46.1% to 42.1% in Flextron-Dynamic) because intermediate layers are pulled in different directions.

Key Novelty

Frozen Base Model with Balcony Exits

Instead of retraining the main LLM, Balcony keeps it completely frozen and attaches a single transformer decoder layer (the 'Balcony') at specific exit points.
These added layers are trained via self-distillation to map intermediate hidden states to the final output distribution, acting as lightweight adapters that translate partial processing into final predictions.

Architecture

Conceptual framework of Balcony showing the frozen base model with attached trainable exit layers.

Evaluation Highlights

Outperforms state-of-the-art Flextron and LayerSkip on LLaMA-2-7B and LLaMA-3-8B across 8 benchmarks while using significantly less training data (0.2% of pretraining tokens)
Maintains 100% of the original base model's performance (lossless), whereas baselines degrade the full model by up to ~4 percentage points
Achieves ~2.8x speedup with minimal accuracy loss on LLaMA-3-8B by exiting at earlier layers compared to the full model

Breakthrough Assessment

8/10

Simple yet highly effective solution to the 'conflicting gradients' problem in dynamic inference. By freezing the base model, it guarantees no degradation of the full model—a critical advantage over prior work—while achieving superior sub-model performance.

⚙️ Technical Details

Problem Definition

Setting: Depth-based dynamic inference where a model M with N layers can exit at a subset of layers E to produce a prediction

Inputs: Input tokens X

Outputs: Next-token probability distribution P(y|X)

Pipeline Flow

Input Processing (Standard LLM Layers 1 to j)
Balcony Exit (At layer j: Intermediate State -> Balcony Layer -> Norm -> Head -> Output)
Optional Continuation (If not exiting: Standard LLM Layers j+1 to N)

System Modules

Base LLM Layers

Process input sequentially to generate hidden states

Model or implementation: LLaMA-3-8B or LLaMA-2-7B (Frozen)

Balcony Layer (Exit Mechanism)

Refine intermediate hidden state X_j to be compatible with the final LM head

Model or implementation: Single Transformer Decoder Layer + RMSNorm

Shared LM Head (Exit Mechanism)

Project refined state to vocabulary space

Model or implementation: Linear projection (Frozen)

Novel Architectural Elements

Side-car architecture: Attaching independent, trainable transformer layers (Balconies) to intermediate points of a frozen backbone
Non-nested training: Unlike MatFormer or Flextron, sub-models do not share weights in a nested manner that compromises the base model; they share the backbone but have independent exit adapters

Modeling

Base Model: LLaMA-3-8B, LLaMA-2-7B, and a custom trained LLM-1B

Training Method: Supervised fine-tuning of Balcony modules via Self-Distillation

Objective Functions:

Purpose: Align the output distribution of the Balcony exit with the full frozen model's output.

Formally: Minimize KL( p(. | W_1:j, W'_j) || p(. | W_1:N) ) where W'_j are Balcony parameters.

Adaptation: Balcony modules (approx 2.5% of full model params per exit)

Training Data:

Cosmopedia V2 dataset
Sampled 31.5B tokens for LLaMA models (0.2% of pretraining data)

Key Hyperparameters:

learning_rate: 5e-4 (max, cosine schedule)
batch_size: 256
training_steps: 30,000
+ 2 more
sequence_length: 4,096 tokens
optimizer: Not explicitly named (likely AdamW given context)

Compute: Trained on NVIDIA V100 32GB GPUs. Training is lightweight due to freezing base model.

Comparison to Prior Work

vs. Flextron: Balcony freezes the base model preventing full-model degradation, whereas Flextron modifies the base model causing accuracy drops. Balcony uses far less training data.
vs. LayerSkip: Balcony adds a learnable adapter layer at exits rather than forcing intermediate layers to directly predict output. Balcony preserves base model performance.
vs. SortedLLaMA: Balcony freezes the backbone; SortedLLaMA fine-tunes the entire model using sorted loss.

Limitations

Inference memory overhead: Requires loading extra Balcony layer parameters (though small, ~2.5% per exit) alongside the base model.
Fixed exit points: Unlike some token-level adaptive methods, Balcony typically sets fixed architectural exit points (e.g., every 4 layers).
Training from scratch trade-off: When training from scratch (not freezing), Balcony performs worse than a standard baseline, though better than nested architectures.

Reproducibility

Code: https://github.com/benyaminjami/Balcony-LLaMA/tree/finetuning

Publicly available code at GitHub. Uses open datasets (Cosmopedia V2, FineWebEDU). Models evaluated: LLaMA-2-7B, LLaMA-3-8B. Baselines Flextron (not open source) and LayerSkip (open source) used for comparison.

📊 Experiments & Results

Evaluation Setup

Zero-shot and few-shot evaluation on standard NLP benchmarks.

Benchmarks:

MMLU (Multi-task Language Understanding (5-shot))
ARC-Challenge (Reasoning (25-shot))
HellaSwag (Commonsense Reasoning (10-shot))
BoolQ (Question Answering)
WinoGrande (Commonsense Reasoning)
PIQA (Reasoning)

Metrics:

Accuracy (%)
Inference Speedup (Latency reduction)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison on LLaMA-2-7B shows Balcony outperforms baselines on sub-models while preserving full model accuracy.
MMLU (5-shot)	Accuracy	42.1	46.6	+4.5
MMLU (5-shot)	Accuracy	35.2	40.9	+5.7
MMLU (5-shot)	Accuracy	27.5	40.9	+13.4
Comparison on LLaMA-3-8B confirms trends on stronger base models.
MMLU (5-shot)	Accuracy	57.2	61.7	+4.5
Inference Latency	Seconds	122.9	44.6	-78.3
Average of 9 tasks	Accuracy	35.0	40.0	+5.0

Experiment Figures

Speedup comparison between Depth Pruning and Width Pruning on LLaMA-3-8B.

Accuracy vs. Parameters trade-off curves for LLaMA-2-7B family models (Balcony vs Flextron vs LayerSkip vs Compression methods).

Main Takeaways

Freezing the base model is crucial: It prevents the 'conflicting gradients' issue where intermediate layers are pulled in multiple directions, maintaining the full model's original capabilities.
A single transformer layer is sufficient: Adding just one decoder layer at the exit point effectively bridges the gap between intermediate representations and the final output space.
Depth pruning is superior to width pruning: Empirical analysis confirms that for a fixed parameter budget, reducing layers (depth) yields better speedups on GPUs than reducing dimensions (width).
Training efficiency: Balcony requires only 0.2% of the original pretraining data to achieve state-of-the-art dynamic inference results.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (layers, attention, normalization)
Knowledge Distillation (specifically self-distillation)
Dynamic Inference / Early Exiting concepts

Key Terms

dynamic inference: The ability of a model to adjust its computational usage (e.g., number of layers processed) at runtime based on resource constraints or sample difficulty

depth-based inference: A type of dynamic inference where the model stops processing after a certain number of layers (early exit) rather than running the full depth

exit point: A specific layer in the neural network where computation can stop, and a prediction can be generated

Balcony module: A lightweight auxiliary module (one transformer block + norm) attached to an exit point to convert intermediate representations into final predictions

self-distillation: A training process where the model's own full-depth output serves as the target (teacher) for its shallower sub-models (students)

KL divergence: Kullback-Leibler divergence—a statistical distance metric used here as a loss function to align the probability distribution of early exits with the full model's output

width-based inference: Adjusting model size by pruning neurons or attention heads (reducing width) rather than layers (reducing depth)