Innundated with 1864 and 1862 replication errors after introduction of "lag-site" (longish)

T

Trust No One

Hi all,

I'm after assistance from your knowledgeable AD folks regarding an
issue which seems related to our ongoing upgrade of our domain
controllers worldwide (120 ) to Windows 2003 SP1, as well as the
introduction of a couple of "replication lag-sites"

Basically these lag-sites are used for disaster recovery and contain a
domain controller from each domain in the forest. The domain
controllers in the primary lag-site replicate once a week with the main
AD while the domain controllers in the secondary lag-site replicate
once a week with the primary lag-site. The domain controllers in the
lag-sites are configured such that they are not used by AD clients or
reachable through WINS. The idea is that we can recover from a forest
corruption or major deletion provided it has not replicated to both
lag-sites. The concept is covered further in

Now we have a hub and spoke replication topology based on a central
datacentre with around 80 worldwide sites. The "bridge all site-links"
option is switched off, and the domain controllers worldwide replicate
with "bridgehead" domain controllers in the hub datacentre. Prior to
the introduction of Windows 2003 SP1 a remote domain controller woul
only log a replication error/warning if it was unable to replicate with
a _direct_ replication partner in the datacentre.

Once we introduced the 2 lag-sites we noticed Windows 2003 SP1 domain
controllers in our main forest began to log replication errors daily in
their Directory Services log. The errors are event id 1864 "The domain
controller has not recently received replication information from a
number of domain controllers..." and event id 1862 "The local domain
controller has not received replication information from a number of
domain controllers in other sites within the configured latency
interval". These messages are repeated for each naming context for all
4 domains in the forest. These message go away briefly once the
lag-site domain controllers perform their scheduled replication.

When DCDIAG is run on the erroring Windows 2003 SP1 DCs, it lists the
culprits as the lag-site domain controllers due to their long
replication interval - it does this even though the majority of the DCs
are not direct replication partners of the "offending" lag-site DCs. I
notice DCDIAG also lists the replication latencies for all other DCs in
the forest (in other words the transient replication partners). This
seems to be a feature introduced with Windows 2003 SP1. Windows 2000
domain controllers do not log these replication errors as they are not
direct replication partners of the lag-site domain controllers.
Unfortunately we've now upgraded 98% of our domain controller estate to
2003 SP1.

So.. while I understand why the errors are being logged ( SP1 DCs
apparently monitor replication latencies of _all_ replication partners,
including transient ones), these messages are causing a big problem as
they are being alerted on by our monitoring program. The first line
support folks are muttering about lynching me :)

Is there any way of suppressing these replication latency
errors/warnings other than changing the replication latency interval of
the entire forest to 1 week? I'm reluctant to do this as it would
suppress errors relating to genuine replication failures. Ideally it
would be nice if the monitoring of replication latencies for transient
replication partners (apparently introduced with 2003 SP1) could be
disabled.

Any thoughts/suggestions on this one?


Kind Regards,
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top