Microsoft AI Foundry Case Study

RAG Agent – Improving Groundedness in Azure AI Search

How a Microsoft Azure AI Foundry RAG deployment began hallucinating despite a connected knowledge base — and how retrieval diagnostics, semantic ranking, and chunk strategy optimization eliminated fragmented context retrieval and stabilized grounded responses.

200 Internal Policy Documents
Azure AI Search
Semantic Ranking
Foundry Evaluation
Azure AI Foundry RAG Evaluation

The Problem

A Retrieval-Augmented Generation (RAG) agent built in Azure AI Foundry was connected to a 200-document internal policy corpus through Azure AI Search.

Although the deployment successfully retrieved relevant policy references, the model frequently hallucinated numerical figures, procedural exceptions, and fabricated policy interpretations that did not exist anywhere inside the source documents.

At first glance, the issue appeared difficult to diagnose because Azure AI Foundry Groundedness scores consistently remained between 3 and 4 out of 5 — seemingly acceptable during evaluation.

However, investigation revealed that the model was synthesizing incomplete information across fragmented retrieval chunks rather than grounding answers in a single authoritative passage.

Root Cause Analysis

Chunk Fragmentation

A manually configured 256-token chunk strategy split policy clauses mid-sentence, causing incomplete semantic context retrieval.

BM25-Only Retrieval

Semantic Ranking had been disabled, forcing Azure AI Search to rely entirely on keyword matching rather than contextual relevance scoring.

Top-K Retrieval Loss

The most contextually complete chunk frequently ranked outside the retrieval window passed to the language model.

Misleading Groundedness Scores

Groundedness metrics only validated consistency with retrieved content — not whether retrieved content itself was semantically complete.

Retrieval Pipeline — Before vs After Optimization

RAG Retrieval Pipeline

Investigation Process

Step 1 — Inspecting Retrieval Chunks

Engineers reviewed Azure AI Search Explorer output and discovered that complete policy clauses were fragmented across multiple retrieval chunks.

Step 2 — Reviewing Chunk Configuration

The team identified that chunk size had been manually reduced from 1024 tokens to 256 tokens to lower retrieval latency.

Step 3 — Enabling Semantic Ranking

Azure AI Search Semantic Ranking was activated to improve contextual retrieval relevance instead of relying solely on BM25 keyword scoring.

Step 4 — Re-Running Evaluation

The updated deployment was re-evaluated using Groundedness and Retrieval metrics across the same 40-query benchmark dataset.

Chunk Strategy Optimization

Chunk Strategy Optimization

To preserve semantic continuity, the engineering team replaced the fixed 256-token strategy with a larger overlapping chunk configuration.

The updated configuration significantly improved retrieval quality by ensuring that policy clauses remained contextually intact across chunk boundaries.

Configuration Before After
Chunk Size 256 Tokens 1024 Tokens
Chunk Overlap None 128 Tokens
Retrieval Method BM25 Semantic Ranking

Evaluation & Validation

Azure AI Foundry evaluation workflows were used to compare retrieval quality before and after retrieval optimization.

The team monitored both automated metrics and manual response reviews to ensure hallucinations had been fully eliminated.

  • Groundedness evaluation metrics
  • Retrieval quality scoring
  • Manual low-score inspection workflows
  • Strictness threshold optimization
  • Citation enforcement validation
Azure AI Foundry Evaluation

Results After Optimization

Metric Before Optimization After Optimization
Groundedness Score 3.4 / 5 4.7 / 5
Hallucinated Responses Present Eliminated
Retrieval Quality Fragmented Contextually Complete
Semantic Ranking Disabled Enabled

Key Learnings

Groundedness Alone Is Not Enough

Automated evaluation metrics can mask retrieval fragmentation issues if retrieved chunks are semantically incomplete.

Chunk Design Matters

Aggressive chunk reduction strategies can significantly damage retrieval reliability in enterprise RAG systems.

Manual Review Remains Essential

Low-scoring evaluation responses still require human inspection to reliably identify partial-context hallucinations.

Technologies Used

Azure AI Foundry
Azure AI Search
Semantic Ranking
Search Explorer
Foundry Model Evaluation
Continue Reading
Related Azure AI Engineering Case Studies

Final Outcome

By redesigning retrieval architecture around larger overlapping chunks, Semantic Ranking, strict retrieval filtering, and structured evaluation workflows, the engineering team successfully eliminated hallucinated policy synthesis while significantly improving grounded response reliability in Azure AI Foundry.

Book a Free Consultation

Schedule a no-obligation consultation to discuss your unique needs and how Luminous can elevate your business technology.