Modeling Trial-and-Error Navigation With a Sequential Decision Model of Information Scent

📝 Paper Summary

Agentic AI Memory recall

This paper models human navigation as a resource-rational sequential decision process where agents balance information scent against memory decay and capacity limits to replicate trial-and-error behaviors.

Core Problem

Existing models of information scent assume users myopically choose the best visible link, failing to explain why users scan partially, make premature errors, or backtrack when cues are ambiguous.

Why it matters:

Predicting user struggles in complex information architectures (e.g., websites, menus) is crucial for automated interface optimization
Prior models like SNIF-ACT or CoLiDeS cannot simulate error recovery (backtracking) or non-greedy exploration because they lack memory dynamics and long-term planning
Understanding navigation requires modeling the cognitive costs (forgetting, time) that force users to accept 'good enough' options rather than searching exhaustively

Concrete Example: A user looking for 'Return Policy' might see a link for 'Customer Service', quickly select it without reading the rest of the page (premature commitment due to time cost), realize it's wrong, and then have to recall previous options which may have decayed from memory, forcing a backtrack—a sequence myopic models fail to predict.

Key Novelty

Sequential Decision Model of Information Scent (POMDP formulation)

Frames navigation not as a series of isolated greedy choices, but as a Partially Observable Markov Decision Process (POMDP) where the agent plans ahead to minimize wasted time
Integrates 'Resource Rationality' by explicitly modeling memory as a constrained resource: cues fade over time (decay) and only a limited number can be retained (capacity)
Distinguishes between 'Local Panel' (current screen) and 'Global Memory' (retained cues), allowing the agent to decide when to stop scanning and select or return based on accumulated belief

Evaluation Highlights

Qualitatively reproduces three key human behaviors: partial scanning of pages, backtracking after errors, and revisiting previously seen items
Replicates known empirical effects of information architecture: task difficulty adaptation, hierarchy depth effects, and positional layout biases
Demonstrates robustness of the learned policy under parameter perturbations of ±5%, ±10%, and ±25% for memory and noise values

Breakthrough Assessment

7/10

Significant theoretical advance by unifying Information Foraging Theory with POMDPs and memory constraints. It moves beyond static/myopic models to explain dynamic error recovery, though it is a simulation study rather than a new SOTA LLM.

⚙️ Technical Details

Problem Definition

Setting: Partially Observable Markov Decision Process (POMDP) for hierarchical information retrieval

Inputs: Hierarchical menu structure with latent semantic labels (information scent)

Outputs: Sequence of navigation actions: Visit (inspect item), Select (commit to item), or Return (go back)

Pipeline Flow

Environment (Menu) → Observation (Local Panel + Global Memory)
Observation → Policy Network → Action (Visit/Select/Return)
Action → State Update (Memory Decay, Position Change) → Reward

System Modules

Scent Encoder

Compute semantic similarity between link labels and goal

Model or implementation: paraphrase-multilingual-MiniLM-L12-v2

Memory Gate

Filter observed cues based on activation threshold and capacity

Model or implementation: Activation-based decay function (Eq. 5)

Policy Network

Select navigation action based on bounded observation

Model or implementation: Reinforcement Learning Agent (Policy Network)

Novel Architectural Elements

Integration of a global-memory panel that retains only the Top-K most diagnostic cues across the session, decoupled from the current screen view
Resource-rational reward structure penalizing every step, forcing the agent to learn efficiency trade-offs rather than exhaustive search

Modeling

Base Model: paraphrase-multilingual-MiniLM-L12-v2 (for embeddings)

Training Method: Reinforcement Learning (Policy Gradient based implied)

Objective Functions:

Purpose: Maximize expected utility under cognitive costs.

Formally: Argmax E[Sum(gamma^t * r_t)] where r_t includes terminal success reward and negative step costs.

Training Data:

Simulated hierarchical menu environments
Target locations hidden from agent

Key Hyperparameters:

N_max: 12 (max items per local panel)
embedding_dim: 384
perturbation_levels: ±5%, ±10%, ±25%

Compute: Not reported in the paper

Comparison to Prior Work

vs. CoLiDeS: Captures partial inspection and backtracking via sequential POMDP vs. deterministic myopic choice
vs. SNIF-ACT: Incorporates global probabilistic planning and explicit memory decay vs. local satisficing heuristics
vs. DeepNav [not cited in paper]: Uses cognitively plausible memory constraints (decay/capacity) rather than generic LSTM/Transformer memory for navigation

Limitations

No direct comparison to recent deep learning based navigation agents (e.g., purely neural approaches)
Relies on simulated sentence-transformer similarity as a proxy for human semantic judgment
The specific reinforcement learning algorithm (e.g., PPO, DQN) is not explicitly detailed in the text
Evaluation is based on reproducing behavioral effects rather than raw performance metrics on a standard benchmark dataset

Reproducibility

No replication artifacts mentioned in the paper. Code, training scripts, and specific parameter values for the learned policy are not provided. The embedding model is open source.

📊 Experiments & Results

Evaluation Setup

Simulated navigation in hierarchical information structures (menus)

Benchmarks:

Simulated Hierarchies (Target search in deep vs. broad menus) [New]

Metrics:

Behavioral patterns (Backtracking, Revisits, Partial Scanning)
Effect reproduction (Difficulty, Hierarchy depth, Position)
Statistical methodology: Sensitivity analysis via parameter perturbation (±5% to ±25%)

Main Takeaways

The POMDP model successfully reproduces partial scanning: agents stop inspecting items when the cost of time outweighs the expected gain of finding a better scent.
Backtracking emerges naturally as an optimal strategy under uncertainty: when a selected path yields low scent (revealed after inspection), the agent returns to explore previous high-scent options stored in global memory.
The model replicates the 'position effect' (users favor top/left items) and 'hierarchy effect' (depth increases difficulty non-linearly) observed in human data.
Resource rationality explains errors: premature selection is not a bug but a feature of optimizing for time under noisy perception.

📚 Prerequisite Knowledge

Prerequisites

Information Foraging Theory (Information Scent)
Reinforcement Learning (POMDPs)
Cognitive Psychology (Working Memory, Decay)

Key Terms

Information Scent: The user's imperfect perception of the value or relevance of a link based on proximal cues like labels or icons

POMDP: Partially Observable Markov Decision Process—a framework for decision-making where the agent cannot see the full state of the world (e.g., hidden target location) and must act based on probabilistic beliefs

Resource Rationality: The theory that humans optimize their behavior not for perfect accuracy, but to maximize utility given their limited cognitive resources (time, memory, attention)

Myopic: A decision-making strategy that considers only the immediate payoff of the next step without planning for future consequences

Diagnosticity: How useful a specific cue is for identifying the correct path to the target

Memory Decay: The process by which memory traces fade over time unless reinforced by repeated attention (visits/clicks)

IFT: Information Foraging Theory—a framework explaining how users navigate information environments analogous to animals foraging for food