Re-TASK: Revisiting LLM tasks from capability, skill, and knowledge perspectives

📝 Paper Summary

Prompt Engineering Chain-of-Thought Reasoning Domain-Specific LLM Adaptation

Re-TASK improves LLM reasoning by decomposing tasks into capability items (knowledge and skills) derived from Bloom’s Taxonomy and injecting specific demonstrations for each via in-context learning.

Core Problem

Standard Chain-of-Thought (CoT) fails on domain-specific tasks because LLMs lack necessary domain knowledge or the specific skills to apply that knowledge effectively during subtask decomposition and execution.

Why it matters:

LLMs struggle with complex reasoning in specialized fields like law and finance despite general capability improvements
Existing retrieval methods (RAG) inject knowledge but often fail to teach the model *how* to apply that knowledge (skill adaptation)
Failures in CoT decomposition lead to compounding errors where models cannot execute generated subtasks

Concrete Example: In a legal sentencing task, a standard CoT model might fail to predict a sentence because it lacks the specific sentencing guidelines (knowledge) or cannot map the victim's injury severity to the guideline criteria (skill). Re-TASK explicitly injects the guideline knowledge and a demonstration of injury assessment before asking for the final sentence.

Key Novelty

Chain-of-Learning (CoL) via Educational Theory Integration

Re-models LLM tasks using Bloom's Taxonomy and Knowledge Space Theory, viewing tasks as a dependency chain of 'capability items' (specific pairs of knowledge + skills)
Treats knowledge not just as context but as a 'capability item' (recalling) that must be followed by skill adaptation items (understanding/applying) in the prompt
Constructs prompts that explicitly sequence these capability demonstrations—retrieving knowledge first, then demonstrating its application—before the model attempts the target task

Architecture

Comparison between Standard CoT and the Re-TASK Framework. Panel (a) shows CoT failing due to lack of capability. Panel (b) shows Re-TASK decomposing the task into Capability Items (Knowledge & Skill). Panel (c) shows the dependency graph of these items.

Evaluation Highlights

+45.00% improvement on legal tasks using Yi-1.5-9B compared to standard prompting baselines
+24.50% improvement on legal tasks using Llama3-Chinese-8B
Significant gains across diverse domains (finance, law, STEM) and multiple languages (Chinese, English), validating the framework's generality

Breakthrough Assessment

7/10

Strong empirical gains in domain-specific tasks by formalizing prompt engineering through educational theory. While primarily a prompting strategy rather than a new architecture, it offers a systematic alternative to standard CoT.

⚙️ Technical Details

Problem Definition

Setting: Enhancing Large Language Model performance on complex, domain-specific reasoning tasks via prompting

Inputs: Task instruction I, input x, and optional context ctx

Outputs: Output y, where T(ctx; I; x) = y

Pipeline Flow

Task Analysis (Decompose task into subtasks/capabilities)
Capability Item Identification (Knowledge Retrieval → Conceptual Understanding → Procedural Application)
Prompt Construction (Order demonstrations sequentially)
Inference (LLM generates solution)

System Modules

Capability Identifier

Identify necessary knowledge (K) and skills (S) for the task based on Bloom's Taxonomy

Model or implementation: N/A (Conceptual step/Human design)

Prompt Constructor

Arrange capability items into a 'Chain-of-Learning' sequence within the context

Model or implementation: N/A (Algorithmic/Heuristic)

LLM Inference

Generate the final response using the constructed context

Model or implementation: Various open-source LLMs (Yi-1.5, Llama3-Chinese, etc.)

Novel Architectural Elements

Re-TASK Prompting Strategy: A specific structural ordering of in-context examples based on educational dependencies (Knowledge -> Skill -> Application) rather than just random or similar examples

Modeling

Base Model: Evaluated on multiple models: Yi-1.5-9B/34B, Llama3-Chinese-8B, Qwen1.5-14B/32B/72B/110B, Llama3-8B/70B

Training Method: Prompting / In-Context Learning only

Compute: Not reported in the paper

Comparison to Prior Work

vs. CoT: Re-TASK focuses on the *capabilities* (knowledge+skill) required to perform the steps, explicitly injecting demonstrations for those capabilities, rather than just decomposing the workflow
vs. RAG: RAG injects knowledge (context); Re-TASK injects knowledge AND demonstrations of how to apply that knowledge (skill adaptation)
vs. Least-to-Most Prompting [not cited in paper]: Least-to-Most decomposes into easy-to-hard subquestions; Re-TASK decomposes into dependency-based capability items (knowledge vs. skill) based on educational theory

Limitations

Relies on the availability of relevant domain knowledge and the ability to construct accurate capability demonstrations
Prompt construction may require manual effort or domain expertise to identify the correct capability items
Context window limits may restrict the number of capability demonstrations that can be included

Reproducibility

Code: https://github.com/Uylee/Re-TASK

Code and data are publicly available at https://github.com/Uylee/Re-TASK. The paper describes the conceptual framework for prompt construction but relies on domain expertise to identify specific capability items for new tasks.

📊 Experiments & Results

Evaluation Setup

Zero-shot or Few-shot prompting on domain-specific datasets in Finance, Law, and STEM

Benchmarks:

CAIL2018 (Legal judgment prediction (Chinese))
FinNA (Financial news analysis (Chinese))
MATH (Mathematics problems (English))
GaoKao (College entrance exam questions (Chinese))
MMLU (Multi-task language understanding (English))

Metrics:

Accuracy (Exact Match or equivalent)
F1 score
Rouge-L
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Legal (CAIL2018)	Accuracy/Score	34.00	49.30	+15.30
Legal (CAIL2018)	Accuracy/Score	30.60	38.10	+7.50
Consistent improvements observed across different model scales on the Legal dataset.
Legal	Accuracy	45.00	50.10	+5.10
Legal	Accuracy	56.10	61.30	+5.20

Experiment Figures

The Re-TASK Prompting strategy structure.

Main Takeaways

Re-TASK consistently outperforms standard CoT and other prompting baselines across diverse domains (Law, Finance, STEM).
The framework effectively scales, providing performance benefits to both smaller (8B) and larger (110B) models.
Improvements are particularly notable in domain-specific tasks where specialized knowledge and specific procedural skills are required, validating the hypothesis that CoT fails due to capability gaps.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
In-Context Learning (ICL)
Retrieval-Augmented Generation (RAG)

Key Terms

Bloom’s Taxonomy: A framework for categorizing educational goals into levels of complexity and specificity, specifically separating knowledge types from cognitive processes (skills)

Knowledge Space Theory (KST): A mathematical framework for modeling knowledge that emphasizes sequential dependencies between learning items (learning pathways)

Capability Item: A specific unit in the Re-TASK framework representing a demonstration of applying a specific skill to specific knowledge (e.g., applying procedural knowledge to a case)

Chain-of-Learning (CoL): The proposed paradigm that highlights task dependencies on specific capability items, structured sequentially from foundational knowledge to mastery

Skill Adaptation: The process of enabling an LLM to effectively utilize available knowledge, often through in-context demonstrations of 'understanding' or 'applying'

In-Context Learning (ICL): A method where LLMs learn to perform tasks by observing examples (demonstrations) provided within the prompt context without parameter updates