DAG Replication Timeout/Corruption

DAG Replication Timeout & Corruption - Fix Guide 2025

Complete troubleshooting guide for Exchange Server Database Availability Group (DAG) replication timeout and corruption issues. Learn how to diagnose network problems, reseed database copies, and restore high availability in your Exchange environment.

Medha Cloud
Medha Cloud Exchange Server Team
Exchange Database Recovery Team15 min read

Table of Contents

Reading Progress
0 of 9

Database Availability Group (DAG) replication timeout and corruption threaten Exchange high availability by preventing passive database copies from staying synchronized with the active copy. When replication fails, failover protection is lost, and data loss becomes a real risk. This guide shows you how to diagnose replication issues and restore full DAG functionality.

Our Exchange High Availability Services team has restored hundreds of DAG environments with zero data loss. This guide provides the same diagnostic and recovery process we use.

Understanding DAG Replication

DAG provides automatic database-level failover for Exchange mailbox databases. The active copy serves client requests while passive copies on other DAG members maintain synchronized copies through continuous log shipping and replay.

DAG Replication Flow

Active Copy
Writes transaction logs
→ Logs →
Passive Copy
Copies & replays logs
Healthy State: Copy Queue Length = 0, Replay Queue Length ≤ 10
Key Replication Metrics
# Check replication health
Get-MailboxDatabaseCopyStatus -Identity "DB01" | Format-List Name,Status,
  CopyQueueLength,ReplayQueueLength,ContentIndexState,LastInspectedLogTime

# Key metrics to monitor:
# CopyQueueLength: Logs waiting to be copied (should be 0-2)-2)
# ReplayQueueLength: Logs waiting to be replayed (should be < 10)
# ContentIndexState: Should be "Healthy"
# Status: Should be "Mounted" (active) or "Healthy" (passive)"Healthy" (passive)

Replication Timeout

Passive copy cannot receive logs fast enough. CopyQueueLength grows continuously. Network or disk I/O bottleneck.

Replication Corruption

Log divergence or database inconsistency detected. Status shows "FailedAndSuspended". Requires reseed to fix.

Symptoms & Business Impact

Replication Timeout Symptoms:

  • CopyQueueLength steadily increasing (10+, 50+, 100+ logs behind)
  • Get-MailboxDatabaseCopyStatus shows "Seeding" or "SeedingSource"
  • Event ID 4113: "The copy of database has fallen behind replication"
  • Cluster events about heartbeat failures between DAG members

Corruption Symptoms:

  • Status shows "FailedAndSuspended" or "Failed"
  • ContentIndexState shows "Failed" or "FailedAndSuspended"
  • Event ID 3154: "The Active Manager was unable to mount database"
  • Event ID 474/475: Database corruption detected
  • Resume-MailboxDatabaseCopy fails repeatedly

⚠️ Critical Business Impact: While replication is broken, you have NO automatic failover protection. If the active copy fails, you risk data loss equal to the copy queue length (each log = ~1MB of transactions). Treat DAG replication failures as high-priority incidents.

Common Causes of DAG Replication Issues

1. Network Latency/Bandwidth (35% of cases)

Most Common Cause: Replication network saturated, high latency between DAG members (should be <1ms), or MAPI network used instead of dedicated replication network.

Identified by: Network traces show packet loss, latency >10ms, or bandwidth saturation

2. Disk I/O Bottleneck (25% of cases)

Storage Issue: Slow disk write performance on passive copy prevents log replay from keeping up. SAN congestion, RAID rebuild, or insufficient IOPS.

Identified by: High disk queue length, Event ID 1018 (database I/O errors)

3. Log File Corruption (20% of cases)

Data Integrity Issue: Transaction log file corrupted during copy or storage failure. Passive copy cannot replay corrupted log.

Identified by: Event ID 454 "Log file signature mismatch", Status "FailedAndSuspended"

4. Content Index Issues (15% of cases)

Search Catalog Problem: Content index database corrupted or out of sync. Often occurs after storage issues or Exchange service crashes.

Identified by: ContentIndexState "Failed", search not returning results

5. Cluster Communication Failure (5% of cases)

Infrastructure Issue: Windows Failover Cluster heartbeat failures, witness server unreachable, or cluster database corruption.

Identified by: Cluster events in System log, DAG members showing as "Down"

Quick Diagnosis: PowerShell Commands

📌 Version Compatibility: This guide applies to Exchange 2016, Exchange 2019, Exchange 2022. Commands may differ for other versions.

Run these commands in Exchange Management Shell (run as Administrator) to identify replication issues:

Step 1: Check All Database Copy Status
# Overview of all database copies
Get-MailboxDatabaseCopyStatus * | Sort-Object DatabaseName | Format-Table `
  DatabaseName, MailboxServer, Status, CopyQueueLength, ReplayQueueLength, ContentIndexState

# Identify problematic copies
Get-MailboxDatabaseCopyStatus * | Where-Object {
    $_.Status -notmatch "Mounted|Healthy" -or
    $_.CopyQueueLength -gt 10 -or
    $_.ContentIndexState -ne "Healthy"
}

What to look for:

  • Status should be "Mounted" (active) or "Healthy" (passive)
  • CopyQueueLength should be 0-2 (higher = replication behind)
  • ReplayQueueLength should be <10 (higher = replay behind)
  • ContentIndexState should be "Healthy"
Step 2: Run Replication Health Test
# Comprehensive replication health check
Test-ReplicationHealth | Format-Table Server, Check, Result, Error -AutoSize

# Check specific DAG member
Test-ReplicationHealth -Identity EXCH01 | Where-Object {$_.Result -ne "Passed"}
Step 3: Check Network Connectivity
# Test replication network between DAG members
Test-NetConnection -ComputerName EXCH02 -Port 64327 # Replication port

# Check network latency
$servers = (Get-DatabaseAvailabilityGroup).Servers
foreach ($server in $servers) {
    $ping = Test-Connection -ComputerName $server -Count 4
    "$server - Avg: $([math]::Round(($ping.ResponseTime | Measure-Object -Average).Average, 2))ms"$ping.ResponseTime | Measure-Object -Average).Average, 2))ms"
}
Step 4: Check for Recent Errors
# Exchange replication events
Get-EventLog -LogName Application -Source "MSExchangeRepl" -Newest 50 |
    Where-Object {$_.EntryType -eq "Error"} |
    Format-Table TimeGenerated, EventID, Message -AutoSize

# Cluster events
Get-EventLog -LogName System -Source "FailoverCluster*" -Newest 30 |
    Where-Object {$_.EntryType -eq "Error"}

Quick Fix (15 Minutes) - Resume Suspended Copy

⚠️ Only use this if:

  • Status shows "Suspended" (not "FailedAndSuspended")
  • CopyQueueLength is manageable (<100 logs)
  • No corruption indicators in event logs

Solution: Resume and Monitor

Resume Suspended Database Copy
# Check current status
$dbCopy = "DB01\EXCH02"
Get-MailboxDatabaseCopyStatus -Identity $dbCopy | Format-List Status,*Queue*,LastInspectedLogTime

# Resume the copy
Resume-MailboxDatabaseCopy -Identity $dbCopy

# Monitor replication progress (run every 30 seconds)
while ($true) {
    $status = Get-MailboxDatabaseCopyStatus -Identity $dbCopy
    Write-Host "$(Get-Date -Format 'HH:mm:ss') - Copy: $($status.CopyQueueLength) | Replay: $($status.ReplayQueueLength) | Status: $($status.Status)"-Format 'HH:mm:ss') - Copy: $($status.CopyQueueLength) | Replay: $($status.ReplayQueueLength) | Status: $($status.Status)"
    if ($status.CopyQueueLength -eq 0 -and $status.Status -eq "Healthy") {
        Write-Host "Replication caught up!" -ForegroundColor Green
        break
    }
    Start-Sleep -Seconds 30
}

✅ Expected Result:

  • CopyQueueLength decreases steadily toward 0
  • Status changes from "Suspended" to "Healthy"
  • ContentIndexState remains or becomes "Healthy"
  • No new error events in Application log

Detailed Solution: Reseed Database Copy

If resume fails or status shows "FailedAndSuspended", you need to reseed the database copy. This creates a fresh copy from another healthy source.

⚠️ Important: Reseeding copies the entire database over the network. During this time, the copy provides no failover protection. Schedule reseeding during low-usage periods if possible.

Scenario 1: Reseed from Active Copy

Reseed Database Copy
# Step 1: Suspend the problematic copy
$dbCopy = "DB01\EXCH02"
Suspend-MailboxDatabaseCopy -Identity $dbCopy -Confirm:$false

# Step 2: Remove existing database files (optional, speeds up reseed)
# WARNING: This deletes the local copy - ensure other copies exist!
# Run this on the target server (EXCH02)
$dbPath = (Get-MailboxDatabase DB01).EdbFilePath.PathName
$logPath = (Get-MailboxDatabase DB01).LogFolderPath.PathName
# Remove-Item "$dbPath" -Force"$dbPath" -Force
# Remove-Item "$logPath\*.log" -Force"$logPath\*.log" -Force

# Step 3: Start reseed
Update-MailboxDatabaseCopy -Identity $dbCopy -DeleteExistingFiles

# Step 4: Monitor reseed progress
while ($true) {
    $status = Get-MailboxDatabaseCopyStatus -Identity $dbCopy
    Write-Host "$(Get-Date -Format 'HH:mm:ss') - Status: $($status.Status) | $($status.SeedingProgress)%"-Format 'HH:mm:ss') - Status: $($status.Status) | $($status.SeedingProgress)%"
    if ($status.Status -eq "Healthy") {
        Write-Host "Reseed complete!" -ForegroundColor Green
        break
    }
    Start-Sleep -Seconds 60
}

Scenario 2: Reseed from Specific Source

Reseed from Specific DAG Member
# Use a specific server as seed source (useful when active copy is busy)
$dbCopy = "DB01\EXCH02"
$sourceServer = "EXCH03"  # Another healthy passive copy

Update-MailboxDatabaseCopy -Identity $dbCopy -SourceServer $sourceServer -DeleteExistingFiles

# Or use a specific network for faster seeding
Update-MailboxDatabaseCopy -Identity $dbCopy -Network "DAGNetwork02" -DeleteExistingFiles

Scenario 3: Fix Content Index Only

If database replication is healthy but ContentIndexState is "Failed", you can reseed just the content index:

Reseed Content Index Catalog
# Reseed only the search catalog (much faster than full reseed)
$dbCopy = "DB01\EXCH02"

Update-MailboxDatabaseCopy -Identity $dbCopy -CatalogOnly

# Monitor content index status
Get-MailboxDatabaseCopyStatus -Identity $dbCopy | Select-Object ContentIndexState,ContentIndexErrorMessage

Scenario 4: Network Performance Fix

Configure Dedicated Replication Network
# Check current DAG network configuration
Get-DatabaseAvailabilityGroupNetwork -Identity DAG01 | Format-List Name,Subnets,ReplicationEnabled

# Disable client traffic on replication network
Set-DatabaseAvailabilityGroupNetwork -Identity "DAG01\ReplicationNetwork" -ReplicationEnabled $true -IgnoreNetwork $false

# Enable compression for WAN replication
Set-DatabaseAvailabilityGroup -Identity DAG01 -NetworkCompression Enabled

# Enable encryption for secure replication
Set-DatabaseAvailabilityGroup -Identity DAG01 -NetworkEncryption Enabled

💡 Pro Tip: For large databases (500GB+), use the -ManualResume parameter with Update-MailboxDatabaseCopy to prevent automatic resumption after seeding. This lets you verify the seed completed successfully before enabling replication.

Verify the Fix

After reseeding or resuming, verify full replication health:

Verification Commands
# 1. Check all database copy status
Get-MailboxDatabaseCopyStatus * | Format-Table DatabaseName,MailboxServer,Status,CopyQueueLength,ReplayQueueLength,ContentIndexState

# 2. Run full replication health test
Test-ReplicationHealth | Format-Table Server, Check, Result -AutoSize

# 3. Verify no pending failures
Get-MailboxDatabaseCopyStatus * | Where-Object {$_.Status -notmatch "Mounted|Healthy"}

# 4. Test failover capability (optional - causes brief disruption)
# Move-ActiveMailboxDatabase -Identity DB01 -ActivateOnServer EXCH02 -Confirm:$false-Identity DB01 -ActivateOnServer EXCH02 -Confirm:$false
# Then move back:
# Move-ActiveMailboxDatabase -Identity DB01 -ActivateOnServer EXCH01 -Confirm:$false-Identity DB01 -ActivateOnServer EXCH01 -Confirm:$false

# 5. Check event logs for any new errors
Get-EventLog -LogName Application -Source "MSExchangeRepl" -Newest 20 |
    Where-Object {$_.EntryType -eq "Error"}

✅ Success Indicators:

  • All copies show Status "Mounted" or "Healthy"
  • CopyQueueLength = 0 on all passive copies
  • ReplayQueueLength < 10 on all passive copies
  • ContentIndexState = "Healthy" on all copies
  • Test-ReplicationHealth shows all checks "Passed"
  • Manual failover test succeeds (if performed)

Prevention: Maintain Healthy DAG Replication

1. Monitor Replication Metrics

DAG Replication Monitoring Script
# Set up scheduled monitoring (run every 5 minutes)
$threshold = 10  # Alert if copy queue exceeds this

$badCopies = Get-MailboxDatabaseCopyStatus * | Where-Object {
    $_.CopyQueueLength -gt $threshold -or
    $_.Status -notmatch "Mounted|Healthy" -or
    $_.ContentIndexState -ne "Healthy"
}

if ($badCopies) {
    $body = $badCopies | Format-Table DatabaseName,MailboxServer,Status,CopyQueueLength | Out-String
    Send-MailMessage -To "admin@company.com" -Subject "DAG Replication Alert" `
        -Body $body -SmtpServer "mail.company.com"
}

2. Use Dedicated Replication Network

  • Configure separate VLAN/subnet for DAG replication traffic
  • Ensure minimum 1Gbps bandwidth between DAG members
  • Keep network latency under 1ms for same-site DAG
  • Enable compression for WAN-based DAG replication

3. Proper Storage Configuration

  • Use separate physical disks for database and transaction logs
  • Ensure storage provides consistent IOPS (not burst)
  • Monitor disk queue length - should be under 20
  • Plan storage capacity for 25% growth

4. Regular Health Checks

Weekly DAG Health Audit
# Comprehensive DAG health check
Write-Host "=== DAG Health Report ===" -ForegroundColor Cyan

# Check DAG configuration
Get-DatabaseAvailabilityGroup | Format-List Name,WitnessServer,WitnessDirectory

# Check all members
Get-DatabaseAvailabilityGroupServer | Format-Table Name,DatabaseCopyAutoActivationPolicy

# Run replication health
Test-ReplicationHealth | Format-Table Server,Check,Result

# Check for long replay queues
Get-MailboxDatabaseCopyStatus * | Where-Object {$_.ReplayQueueLength -gt 100}

# Verify cluster health
Get-ClusterNode | Format-Table Name,State

DAG Issues Beyond Reseeding?

Complex DAG failures involving cluster quorum issues, split-brain scenarios, or multi-copy corruption require expert intervention to prevent data loss. Our Exchange high availability specialists can restore your DAG and implement monitoring to prevent recurrence.

Get DAG Expert Support

Average Resolution Time: 60 Minutes

Frequently Asked Questions

DAG replication timeouts occur when the passive database copy cannot receive transaction logs from the active copy fast enough. Common causes include network latency/bandwidth issues, disk I/O bottlenecks, storage failures, or the replication network being overloaded with client traffic.

Can't Resolve DAG Replication Timeout/Corruption?

Exchange errors can cause data loss or extended downtime. Our specialists are available 24/7 to help.

Emergency help - Chat with us
Medha Cloud

Medha Cloud Exchange Server Team

Microsoft Exchange Specialists

Our Exchange Server specialists have 15+ years of combined experience managing enterprise email environments. We provide 24/7 support, emergency troubleshooting, and ongoing administration for businesses worldwide.

15+ Years ExperienceMicrosoft Certified99.7% Success Rate24/7 Support