TTFT: Time-To-First-Token—the latency from the moment a user sends a query until the model generates the first word of the response
HNSW: Hierarchical Navigable Small World—a graph-based approximate nearest neighbor search algorithm known for high speed and accuracy but high memory usage
IVF: Inverted File—a search index that clusters vectors to speed up search; more memory-efficient than HNSW but often less accurate
retrieval stride: The frequency at which the system performs a new retrieval operation during the generation process (e.g., retrieving new context every 4 tokens)
scalar quantization (SQ): A compression technique that reduces the precision of vector numbers (e.g., from 32-bit float to 8-bit integer) to save memory
product quantization (PQ): A compression technique that splits vectors into sub-vectors and quantizes them separately, offering higher compression than scalar quantization
recall: The fraction of relevant documents successfully retrieved by the system compared to the total number of relevant documents available
tail latency: The response time for the slowest percentage of requests (e.g., p99), often much higher than the average due to system stalls or complex queries
QPS: Queries Per Second—a measure of the throughput of the retrieval system
re-ranking: A second stage in retrieval where a more accurate (but slower) model re-scores the initial set of retrieved documents to improve relevance