Dynamic Context-Aware Prompt Recommendation for Domain-Specific AI Applications

📝 Paper Summary

Prompt Engineering Domain-Specific AI Assistants

This system generates context-aware prompt suggestions for domain-specific AI by combining retrieval-augmented skill discovery with a telemetry-based ranking model to align user intent with available system capabilities.

Core Problem

Users in specialized domains (like cybersecurity) struggle to formulate precise prompts that align with complex system skills, while static recommendation lists lack context and scalability.

Why it matters:

Ineffective prompts in high-stakes environments (e.g., security operations) lead to missed insights and operational inefficiencies
Static prompt lists require manual curation and fail to scale as AI systems add new plugins and skills
Generic suggestions ignore user history and active session context, failing to leverage behavioral telemetry for personalization

Concrete Example: In a cybersecurity session analyzing a specific threat, a static system might suggest a generic 'Scan network' prompt. The proposed system, detecting the user is analyzing an 'Intune' device entity, retrieves the specific 'Intune Device Query' skill and synthesizes a precise prompt like 'List recent configuration changes for device [DeviceID]'.

Key Novelty

Context-Aware Hierarchical Prompt Synthesis

Utilizes a two-stage hierarchical retrieval process that first identifies relevant 'plugins' (groups of skills) and then specific skills, similar to schema refinement
Integrates a predictive model trained on behavioral telemetry (user clicks/history) to dynamically rank skills before generating the final natural language prompt
Synthesizes the final prompt using an LLM that acts as an interpreter mapping user context and ranked skills to executable instructions

Architecture

The end-to-end architecture of the dynamic prompt recommendation system

Evaluation Highlights

98.0% of suggested prompts were rated as 'Useful' or better by security researchers when using the full GPT-4o pipeline
75.0% of prompts generated by the full pipeline were rated 'Extremely Useful' by experts, significantly outperforming hybrid configurations
Achieved 88.4% average usefulness score across 12,432 automated evaluations in real-world security customer sessions

Breakthrough Assessment

7/10

Strong practical application resolving a major usability bottleneck in domain-specific AI (discovery of complex skills). While components (RAG, Ranking) are known, their hierarchical integration for prompt suggestion is novel and effective.

⚙️ Technical Details

Problem Definition

Setting: Prompt recommendation in a domain-specific conversational AI system

Inputs: Natural language query, session context, conversation history, user profile

Outputs: Ranked list of executable prompt suggestions (meta-prompts)

Pipeline Flow

Contextual Query Processor
Knowledge Retrieval Engine
Hierarchical Skill Organization
Skill Ranking Engine
Information Synthesis & Prompt Generation

System Modules

Contextual Query Processor

Enriches user query with session state, history, and user profile

Model or implementation: Not explicitly specified (likely rule-based + embedding)

Knowledge Retrieval Engine (Retrieval & Selection)

Retrieves relevant plugins, skills, and domain documentation using RAG

Model or implementation: Not explicitly specified (RAG based)

Skill Ranking Engine (Retrieval & Selection)

Prioritizes candidate skills based on historical effectiveness and user behavior

Model or implementation: Predictive model trained on behavioral telemetry (or Markov model in hybrid variant)

Prompt Generator

Synthesizes final prompt suggestions using templates and few-shot examples

Model or implementation: GPT-4o (in full pipeline) or GPT-4o-mini

Novel Architectural Elements

Hierarchical two-stage retrieval: First selects relevant 'Plugins', then refines to specific 'Skills' within those plugins
Integration of a telemetry-based predictive ranking model directly into the prompt generation pipeline to filter RAG results

Modeling

Base Model: GPT-4o (Full Pipeline configuration)

Comparison to Prior Work

vs. Static Lists: Dynamic generation scales with system capabilities and adapts to context
vs. CRS/Open-book: Focuses on recommending *actions/skills* (prompts) in a functional AI assistant rather than recommending *items* (movies/restaurants)
vs. General RAG [not cited in paper]: Explicitly models hierarchical 'skills' and 'plugins' rather than just retrieving unstructured text chunks

Limitations

Dependency on the quality of the underlying skill definitions and descriptions
Cost latency trade-offs: Full GPT-4o pipeline provides best results but is more expensive/slower than hybrid approaches
Domain specificity: The system relies on a curated knowledge base of domain-specific skills (e.g., cybersecurity), limiting zero-shot transfer to new domains without setup

📊 Experiments & Results

Evaluation Setup

Real-world customer sessions from a commercial security AI assistant

Benchmarks:

Security Copilot Customer Sessions (Prompt Recommendation in Cybersecurity) [New]

Metrics:

Relevance
Clarity
Novelty
Grounding
Overall Usefulness
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Automated evaluation across 12,432 suggestions shows consistent performance across model configurations.
Security Copilot Sessions (Automated)	Overall Usefulness	0.870	0.884	+0.014
Security Copilot Sessions (Automated)	Novelty	0.905	0.933	+0.028
Manual expert evaluation reveals a significant quality gap in 'Extremely Useful' suggestions between full and hybrid models.
Security Copilot Sessions (Manual)	Extremely Useful %	53.1	75.0	+21.9
Security Copilot Sessions (Manual)	Not Useful %	3.5	2.0	-1.5

Main Takeaways

Full GPT-4o pipeline excels at generating 'Extremely Useful' prompts compared to hybrid approaches, justifying the higher compute cost for premium experiences
Hybrid models (Markov + GPT-4o) offer a cost-effective alternative with very high general usefulness (98.9%) but fewer 'breakthrough' suggestions (53.1% extremely useful)
The hierarchical skill selection and ranking mechanism is robust across different underlying model architectures (GPT-4o vs. Mini vs. Markov)

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
Large Language Models (LLMs)
Recommender Systems concepts (Collaborative Filtering/Ranking)

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents or skills

Plugins: Thematically related groups of granular skills (e.g., a 'Security' plugin containing 'Scan' and 'Report' skills)

Telemetry: Data collected from user interactions, such as click rates on suggestions and skill invocation frequencies

Meta-prompt: A higher-level prompt constructed by the system to instruct the LLM on how to generate the final user-facing prompt suggestions

Few-shot learning: Providing the model with a small number of example inputs and outputs within the prompt to guide its performance

NL2KQL: Natural Language to Kusto Query Language—a specific skill mentioned for translating text into database queries

Grounding: Ensuring the generated output is based strictly on the retrieved information and conversation history, avoiding hallucinations