Sabiá-4 Technical Report

📝 Paper Summary

Brazilian Portuguese LLMs Legal domain adaptation Agentic capabilities

Sabiá-4 enhances Brazilian Portuguese performance capabilities through a four-stage pipeline involving legal-domain continued pre-training, context expansion to 128k, and agent-focused alignment.

Core Problem

Generalist language models often lack the specific cultural, linguistic, and legislative knowledge required for high-stakes Brazilian legal tasks and complex agentic workflows in Portuguese.

Why it matters:

Generic models struggle with the nuances of Brazilian federal legislation (over 50,000 acts), leading to inaccuracies in drafting legal documents or judicial decisions
Previous generations of Portuguese models showed degradation in multi-turn dialogues and inability to handle zero-shot instruction following effectively
High-performance models are often cost-prohibitive; there is a need for specialized models that offer a better cost-performance trade-off for production retrieval-augmented generation (RAG) workflows

Concrete Example: In legal drafting, a generic model might fail to correctly identify the specific law corresponding to a legislative excerpt among 50,000+ norms, or struggle to maintain format constraints across a 3-turn instruction sequence (Multi-IF benchmark).

Key Novelty

Four-Stage Domain Specialization Pipeline

Uses 'Continued Pre-training' on a massive Portuguese and Brazilian legal corpus to specialize a generalist base model before fine-tuning
Implements a dedicated 'Long-context extension' phase to reach 128k tokens using naturally long documents, preventing the 'lost-in-the-middle' phenomenon common in naive extensions
Combines supervised fine-tuning (SFT) on synthetic agentic data with preference alignment to strictly enforce formatting for tool use and legal writing

Architecture

The four-stage training pipeline developed for Sabiá-4.

Evaluation Highlights

Achieves >98% accuracy on the Needle in a Haystack (NIAH) benchmark, saturating the metric and prompting the use of harder tests like MRCR
Demonstrates favorable cost-performance trade-off, positioning in the upper-left region of pricing-accuracy charts compared to state-of-the-art models
Shows qualitative improvements in drafting civil and criminal judgments over previous Sabiá-3 generations

Breakthrough Assessment

7/10

Strong engineering report demonstrating how domain-specific continued pre-training effectively specializes LLMs. While architectural novelty is low, the pipeline's effectiveness for regional/legal adaptation is significant.

⚙️ Technical Details

Problem Definition

Setting: Generative language modeling optimized for Brazilian Portuguese, specifically legal drafting and agentic tool use

Inputs: Natural language prompts (chat, legal queries) or agentic trajectories (up to 128k tokens)

Outputs: Text completions, structured legal documents, or function calls

Pipeline Flow

Input Processing (Prompt/Context)
Sabiá-4 Inference (Transformer)
Output Generation (Text/Tool Call)

System Modules

Sabiá-4 / Sabiazinho-4

Core language model performing generation and reasoning

Model or implementation: Transformer-based LLM (Base architecture unspecified in text, likely adapted from open weights)

Modeling

Base Model: General-purpose base model (specific architecture not disclosed in text)

Training Method: 4-Stage Pipeline: Continued Pre-training, Long-Context Training, SFT, Preference Alignment

Trainable Parameters: Full model adaptation implied (not LoRA)

Training Data:

Large-scale Portuguese corpus combined with Brazilian legal corpus (>50,000 normative acts)
Naturally long documents for context extension
Synthetic data pipeline for function calling examples
Multi-turn conversation data for SFT

Compute: Google Cloud TPUs v5p and v6e using JAX (distributed training)

Comparison to Prior Work

vs. Sabiá-3: Sabiá-4 adds specific long-context training (128k) and agentic synthetic data pipeline
vs. GPT-4o: Sabiá-4 targets specific Brazilian legal nuances via continued pre-training rather than just prompt engineering
vs. Generic Open Models (e.g., Llama-3): Sabiá-4 incorporates Brazilian legal corpus directly into pre-training weights, not just fine-tuning

Limitations

Benchmarks for long context (NIAH) are saturated (>98%), making it hard to distinguish fine-grained improvements without harder tests like MRCR
Specific model architecture and parameter counts are not disclosed in the text
No direct comparison of training compute costs vs. performance gains provided in the text

Reproducibility

Low reproducibility. No code URL provided. Model weights not explicitly linked. Benchmarks (Magis-Bench, Brazilian Federal Laws, MARCA) are described as 'will be published soon'. Specific numeric results are in tables not included in the provided text.

📊 Experiments & Results

Evaluation Setup

Evaluation across 6 categories including conversational, legal, and agentic tasks using both automated metrics and pairwise comparison.

Benchmarks:

BRACEval (Brazilian Conversational Evaluation (150 multi-turn samples))
OAB-Bench (Legal writing/reasoning (Bar Exam))
Magis-Bench (Judicial decision drafting (Judge exams)) [New]
Brazilian Federal Laws (Knowledge retrieval of 50,000+ acts) [New]
MRCR (Multi-Round Co-reference Resolution (Long Context))
Ticket-Bench / Pix-Bench (Agentic Tool Use)

Metrics:

Win rate (vs GPT-4o)
Pass^k (probability of success in k runs)
Success@1 (accuracy on first try)
Score 0-10 (Legal drafting)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Specific numeric scores for legal and agentic benchmarks were located in Tables 2 and 3 of the paper, which were not provided in the input text. However, long-context saturation results were mentioned in the text.
Needle in a Haystack (NIAH)	Accuracy	98.0	98.0	0.0

Experiment Figures

Cost-performance trade-off chart comparing Sabiá-4 against state-of-the-art models.

Main Takeaways

Domain Specialization Efficacy: Continued pre-training on Brazilian legal corpora enables specialized tasks (drafting judgments) that generic models struggle with, without the cost of training from scratch.
Cost-Performance Sweet Spot: The models are explicitly designed to occupy the 'upper-left' of the cost-accuracy chart, offering competitive performance to SOTA models at lower inference costs.
Agentic Readiness: The inclusion of synthetic function-calling data and formatting alignment allows the model to handle multi-step agentic workflows (Ticket-Bench, Pix-Bench) reliably.
Long Context Utility: Extending context to 128k allows for processing entire legal codes or long judicial histories, supported by robust retrieval capabilities (MRCR).

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Model (LLM) training pipelines (Pre-training vs. SFT vs. Alignment)
Familiarity with Retrieval-Augmented Generation (RAG)
Knowledge of context window scaling techniques

Key Terms

Continued Pre-training: Training an already pre-trained model on a specific domain corpus (e.g., legal text) to adapt its internal knowledge before fine-tuning

RAG: Retrieval-Augmented Generation—systems that improve model answers by retrieving relevant documents from an external database

SFT: Supervised Fine-Tuning—training a model on labeled instruction-response pairs to teach it how to follow user commands

Needle in a Haystack (NIAH): A benchmark testing if a model can find a specific fact hidden within a very large amount of unrelated text

OAB: Ordem dos Advogados do Brasil—The Brazilian Bar Association exam, used here as a benchmark for legal reasoning and drafting

TPU: Tensor Processing Unit—specialized hardware by Google designed to accelerate machine learning workloads

JAX: A high-performance numerical computing library used for machine learning research, particularly on TPUs

Function Calling: The ability of an LLM to generate structured outputs (like JSON) that can be executed by external code or APIs