Microsoft AI Foundry Case Study

RAG Agent – Improving Groundedness in Azure AI Search

How a Microsoft Azure AI Foundry RAG deployment began hallucinating despite a connected knowledge base — and how retrieval diagnostics, semantic ranking, and chunk strategy optimization eliminated fragmented context retrieval and stabilized grounded responses.

200 Internal Policy Documents

Azure AI Search

Semantic Ranking

Foundry Evaluation

The Problem

A Retrieval-Augmented Generation (RAG) agent built in Azure AI Foundry was connected to a 200-document internal policy corpus through Azure AI Search.

Although the deployment successfully retrieved relevant policy references, the model frequently hallucinated numerical figures, procedural exceptions, and fabricated policy interpretations that did not exist anywhere inside the source documents.

At first glance, the issue appeared difficult to diagnose because Azure AI Foundry Groundedness scores consistently remained between 3 and 4 out of 5 — seemingly acceptable during evaluation.

However, investigation revealed that the model was synthesizing incomplete information across fragmented retrieval chunks rather than grounding answers in a single authoritative passage.

Root Cause Analysis

Chunk Fragmentation

A manually configured 256-token chunk strategy split policy clauses mid-sentence, causing incomplete semantic context retrieval.

BM25-Only Retrieval

Semantic Ranking had been disabled, forcing Azure AI Search to rely entirely on keyword matching rather than contextual relevance scoring.

Top-K Retrieval Loss

The most contextually complete chunk frequently ranked outside the retrieval window passed to the language model.

Misleading Groundedness Scores

Groundedness metrics only validated consistency with retrieved content — not whether retrieved content itself was semantically complete.

Retrieval Pipeline — Before vs After Optimization

Investigation Process

Step 1 — Inspecting Retrieval Chunks

Engineers reviewed Azure AI Search Explorer output and discovered that complete policy clauses were fragmented across multiple retrieval chunks.

Step 2 — Reviewing Chunk Configuration

The team identified that chunk size had been manually reduced from 1024 tokens to 256 tokens to lower retrieval latency.

Step 3 — Enabling Semantic Ranking

Azure AI Search Semantic Ranking was activated to improve contextual retrieval relevance instead of relying solely on BM25 keyword scoring.

Step 4 — Re-Running Evaluation

The updated deployment was re-evaluated using Groundedness and Retrieval metrics across the same 40-query benchmark dataset.

Chunk Strategy Optimization

To preserve semantic continuity, the engineering team replaced the fixed 256-token strategy with a larger overlapping chunk configuration.

The updated configuration significantly improved retrieval quality by ensuring that policy clauses remained contextually intact across chunk boundaries.

Configuration	Before	After
Chunk Size	256 Tokens	1024 Tokens
Chunk Overlap	None	128 Tokens
Retrieval Method	BM25	Semantic Ranking

Evaluation & Validation

Azure AI Foundry evaluation workflows were used to compare retrieval quality before and after retrieval optimization.

The team monitored both automated metrics and manual response reviews to ensure hallucinations had been fully eliminated.

Groundedness evaluation metrics
Retrieval quality scoring
Manual low-score inspection workflows
Strictness threshold optimization
Citation enforcement validation

Results After Optimization

Metric	Before Optimization	After Optimization
Groundedness Score	3.4 / 5	4.7 / 5
Hallucinated Responses	Present	Eliminated
Retrieval Quality	Fragmented	Contextually Complete
Semantic Ranking	Disabled	Enabled

Key Learnings

Groundedness Alone Is Not Enough

Automated evaluation metrics can mask retrieval fragmentation issues if retrieved chunks are semantically incomplete.

Chunk Design Matters

Aggressive chunk reduction strategies can significantly damage retrieval reliability in enterprise RAG systems.

Manual Review Remains Essential

Low-scoring evaluation responses still require human inspection to reliably identify partial-context hallucinations.

Technologies Used

Azure AI Foundry

Azure AI Search

Semantic Ranking

Search Explorer

Foundry Model Evaluation

Related Azure AI Engineering Case Studies

Fine Tuning

Preventing Overfitting During Azure AI Foundry Fine-Tuning

Diagnosing catastrophic forgetting and restoring model stability using validation datasets, loss monitoring, and safer hyperparameter tuning strategies.

Content Safety

Reducing False Positives in Azure AI Content Safety

How domain-aware moderation policies and allow lists improved healthcare AI assistant accuracy without weakening safety protections.

Final Outcome

By redesigning retrieval architecture around larger overlapping chunks, Semantic Ranking, strict retrieval filtering, and structured evaluation workflows, the engineering team successfully eliminated hallucinated policy synthesis while significantly improving grounded response reliability in Azure AI Foundry.

Book a Free Consultation

Schedule a no-obligation consultation to discuss your unique needs and how Luminous can elevate your business technology.

View Solution

Data & AI

Cloud Engineering

Application Development

Web Development

Integration

Mobile / Game Development