LLM-as-a-Judge: Using a strong language model (like GPT-4) to evaluate the outputs of another model based on a specific rubric
RAGAS: A popular framework for reference-free evaluation of RAG systems using metrics like faithfulness and answer relevance
Identifier Integrity: A proposed metric measuring whether specific technical identifiers (error codes, file paths, versions) are preserved exactly without corruption
Resolution Alignment: A proposed metric assessing whether the provided steps comply with operational constraints and are likely to resolve the specific support case
Case-Aware: Evaluation that explicitly conditions on structured metadata (case subject/description) and conversation history rather than treating the query in isolation
Severity-Aware Scoring: An aggregation method where weights are assigned based on organizational risk (e.g., hallucinations are penalized more heavily than style issues)
Grounding Fidelity: The extent to which the claims in the generated answer are supported by the retrieved evidence
Context Sufficiency: A metric evaluating whether the retrieved documents actually contain the information necessary to answer the user's query