dynamic inference: The ability of a model to adjust its computational usage (e.g., number of layers processed) at runtime based on resource constraints or sample difficulty
depth-based inference: A type of dynamic inference where the model stops processing after a certain number of layers (early exit) rather than running the full depth
exit point: A specific layer in the neural network where computation can stop, and a prediction can be generated
Balcony module: A lightweight auxiliary module (one transformer block + norm) attached to an exit point to convert intermediate representations into final predictions
self-distillation: A training process where the model's own full-depth output serves as the target (teacher) for its shallower sub-models (students)
KL divergence: Kullback-Leibler divergence—a statistical distance metric used here as a loss function to align the probability distribution of early exits with the full model's output
width-based inference: Adjusting model size by pruning neurons or attention heads (reducing width) rather than layers (reducing depth)