Active Directory in a huge single forest

J

jfprieur

Hello,

I just got asked to provide a 'worst-case' report for our enterprise
active directory.

The architecture chosen was a single forest/multiple domain model. At
that time, that it was MS was recommending for enterprises. Since then
that recommendation has changed, but this is already in production and
migration has started. Win2K servers are the current infrastructure
servers (DC', FSMO's, etc.) Eventually we are talking 50000+
workstations in this forest.

For reasons that I won't get into here, there are/will be 2000+ domain
controllers spread across the multiple domains, spread all over the
world.

Reading the best practices recommendations for AD recovery published by
Microsoft, it lists in its recovery steps that you must switch off
every DC. You can well see that this would be a significant impact,
with business continuity implications.

Now there are mitigating factors: Only 3 enterprise admins, very
strenuous change control and testing for the schema (Microsoft called
it one of the best implementations it has seen). MS stated that a full
forest meltdown has only occured three times, all related to poor
planning and implementation.

I guess what I am asking is, do you see anything in Windows 2003 that
would mitigate this? A migration is planned but not in the near future.
Is there anything (high-level) that we can do right now to reduce the
(miniscule) risk even further? A cost-benefit analysis was performed on
migrating to a multiple forest model, but this would cost more than the
current NT-> 2000/XP migration that we are going through right now.

I know my questions are pretty broad, just a good discussion on this
subject would be very helpful.

Thanks,
 
H

Herb Martin

Hello,

I just got asked to provide a 'worst-case' report for our enterprise
active directory.

The architecture chosen was a single forest/multiple domain model. At
that time, that it was MS was recommending for enterprises. Since then
that recommendation has changed, but this is already in production and
migration has started.

It is still correct in many instances.
Win2K servers are the current infrastructure
servers (DC', FSMO's, etc.) Eventually we are talking 50000+
workstations in this forest.

That is not "huge" -- it's on the low side of large for AD.
For reasons that I won't get into here, there are/will be 2000+ domain
controllers spread across the multiple domains, spread all over the
world.

Reading the best practices recommendations for AD recovery published by
Microsoft, it lists in its recovery steps that you must switch off
every DC. You can well see that this would be a significant impact,
with business continuity implications.

What KB? Most people never have to do that.
Now there are mitigating factors: Only 3 enterprise admins, very
strenuous change control and testing for the schema (Microsoft called
it one of the best implementations it has seen). MS stated that a full
forest meltdown has only occured three times, all related to poor
planning and implementation.

I guess what I am asking is, do you see anything in Windows 2003 that
would mitigate this? A migration is planned but not in the near future.

Improved replication is one of the main improvements of
Win2003.
Is there anything (high-level) that we can do right now to reduce the
(miniscule) risk even further? A cost-benefit analysis was performed on
migrating to a multiple forest model, but this would cost more than the
current NT-> 2000/XP migration that we are going through right now.

You are likely better off the way you are IF it is currently
replicating with no significant problems (I would bet.)
I know my questions are pretty broad, just a good discussion on this
subject would be very helpful.

What sort of WANS?

Why so many DCs?

How many Sites?

How are your Site Links and Site Link Bridge (groups) setup?
 
J

Joe Richards [MVP]

Don't want to bust your bubble but 50k workstations isn't really too large. I
have run single domains larger than that. You can chase my resume if you want,
it isn't a line. My last ops position was 250,000 users in a single forest which
I had migrated from hundreds of NT4 domains over the course of several years.

Also be careful about what MS says, what they depends entirely on who says it.
If it MCS people, I would take the statements with a grain of salt. If you were
talking with the Dev people you might have gotten more substantial responses.
Don't get me wrong, some of the MCS people are pretty good. But you need to
balance everything you hear from them. Most of them really don't have large
scale experience.

I would tend to agree that large scale meltdowns aren't all that common. I don't
think they can state how many there have been. That isn't info that is generally
published and broadcast inside of MS or out. I would also agree that you have to
tend to do some pretty bad things to get into that situation.

I would be a bit concerned with the number of DCs. That seems excessive and
probably isn't needed. Every DC is admin overhead and the more you get the more
pain it is monitoring replication and such. For those 250k users I mentioned
above we had ~400 DCs in 11 domains, that was even too much to be honest. We
didn't need anywhere near that overhead and coverage. Every chance we got we
shut down extra DCs.

If you aren't deployed yet, I wouldn't consider deploying anything but Windows
Server 2003. There are massive changes throughout that correct issues the large
deployments such as mine ran headlong into and made MS fix.

I don't generally recommend multiple forests except for DMZ, Extranet, Test/R&D,
and if you have a centralized Exchange deployment I would seriously consider a
separate forest for Exchange.

I wouldn't be overly concerned with complete disaster failover scenarios.
Definitely keep it in mind, but don't burn the midnight oil on it. The solution
we came up with that has now become a popular solution is to use some virtual
servers to maintain some virtual DCs for each domain. Those DCs are shut down
daily and the files backed up. In the event of a disaster, you can have those
DCs up and running quickly because you don't have to worry about hardware and
doing AD recovery.

Over the course of the 5 years now that company has been running Windows 2000,
we never restored a single object. If something was deleted that shouldn't have
been, tough, recreate it. If a DC's database went bad due to disk or motherboard
failure the DC was deleted from AD and rebuilt from the ground up and
repromoted. We ran with three EA's that were also the DA's (obviously).
Delegations to other admins with minimal, user provisioning was handled by a
provisioning system which did a ton to prevent issues in the directory. At the
point you start giving out FC to any objects to non-DAs, you have started to
lose control of the directory.

joe
 
H

Herb Martin

Also be careful about what MS says, what they depends entirely on who says
it.
If it MCS people, I would take the statements with a grain of salt. If you were
talking with the Dev people you might have gotten more substantial
responses.

What Joe said with this addition -- It can go the other
way too: Sometimes the Dev people don't know the
real world (only the technical truth about the product)
and the MCS are at times people with experience like
Joe.

Sometimes what MS says, is what a 1st or 2nd level
support person wrote for the KB (they are graded on
how many they write) and sometimes that support
person nails it perfectly.

In other words, it is a big place and everyone has an
opinion. Opinions are like noses, some smell better
than others.
 
J

jfprieur

Herb said:

First of all, thanks for the great answers and I now retract my 'huge'
qualifier' ;)

What KB? Most people never have to do that.

http://www.microsoft.com/downloads/...FamilyID=3EDA5A79-C99B-4DF9-823C-933FEBA08CFE

I know most people never have to do that but I have to report on a full
forest meltdown.
now.

You are likely better off the way you are IF it is currently
replicating with no significant problems (I would bet.)

Everything is working great (famous last words hehe) and I am confident
in the processes put in place. It is just that some Business Continuity
manager in our enterprise read this somewhere and is know trumpeting
that AD is not safe.

I have been tasked with the rebuttal. I want to present how minimal the
risk is (already have done that), that we have planned for this and I
also want to suggest some improvements.

What sort of WANS?

Worlwide network, no site has less than 1MB connectivity, 2MB most of
the time.
Why so many DCs?

I have to be careful but let us just say that we have one domain that
has a lot of offices (1900+) all over France. They need to be able to
work even if they lose connectivity so the architects (not me) put a DC
in each one. If it would have been me, i would have put the offices in
workgroups, since they only use an intranet application and don't need
all the shared drives, SMS, and other customisations that we have done
for our standard enterprise XP product.
How many Sites?

Well they also decided that each office above would be a site, so many
many sites.
How are your Site Links and Site Link Bridge (groups) setup?

Haven't been technical for a while, but I believe that we don't use the
KCC per Microsoft recommendations (until we get to 2003) since we have
so many sites. Two connection documents are created for each site.
Basically we have a double-ring replication topology, in opposite
directions and they replicate differently (even hours on one and odd on
the other IIRC)

From Joe Richards (MVP)
The solution we came up with that has now become a popular solution
is to > use some virtual servers to maintain some virtual DCs for each
domain.
Those DCs are shut down daily and the files backed up. In the event
of a > disaster, you can have those DCs up and running quickly because
you don't > have to worry about hardware and doing AD recovery.

That's a neat idea. Although, in case of a forest wide crash, we would
still have to shut down every DC per the MS document I referred to
above, which is the big time waster in our case. Agreed though that it
saves on the DC build time.

Stupid question, but users that were logged on while the forest crashed
would still have access to all ressources right? Since they have the
Kerberos token (probably wrong terminolgy sorry) already they would
still access their shares correct? Just want to prove that while it
would be a giant PITA, it would not mean the end of the business as
some people here would like to believe.
Once again, thanks for the comments,
JF
 
H

Herb Martin

Everything is working great (famous last words hehe) and I am confident
in the processes put in place. It is just that some Business Continuity
manager in our enterprise read this somewhere and is know trumpeting
that AD is not safe.

I would (politely) ask him to provide the reference and
then you and some of us can take a look at it and understand
the context in which it was written and check it for obvious
errors.

There is at least one single domain production forest on
the order of 10 million users. It however doesn't have
your number of sites.
I have been tasked with the rebuttal. I want to present how minimal the
risk is (already have done that), that we have planned for this and I
also want to suggest some improvements.

You cannot (really) rebut something that is an unspecified
second hand report.

I might say, 'How are your going to handle the FritzGana
bug problem reported by several of the computer technical
websites after the release of Win2003?' If you don't have
those reports, it is difficult to even say: There is no such
thing.
Worlwide network, no site has less than 1MB connectivity, 2MB most of
the time.

Available (minus) AD is more important than raw bandwidth.

So if you have an emptry T1, it might be better than a full
45 Mbps. (Ok, that is likely an extreme.)
I have to be careful but let us just say that we have one domain that
has a lot of offices (1900+) all over France. They need to be able to
work even if they lose connectivity so the architects (not me) put a DC
in each one.

No, that is precisely the best reason for them. If they
have local resources then when access to those resources
is critical they need a local DC (GC/DNS probably).
If it would have been me, i would have put the offices in
workgroups, since they only use an intranet application and don't need
all the shared drives, SMS, and other customisations that we have done
for our standard enterprise XP product.

Well, if the shares are NOT local and they would lose
access to them (over a WAN) everytime they lost a
remote DC, then the local DCs may not make (as much)
sense.
Well they also decided that each office above would be a site, so many
many sites.

That is usually correct. But you are in the area where Site
Link Bridge (Grouping) is probably critical.
Haven't been technical for a while, but I believe that we don't use the
KCC per Microsoft recommendations (until we get to 2003) since we have
so many sites.

You likely would still be following the MS recommendations
for exception (number of Sites) like you have.

You CAN use the KCC however if you create custom
SiteLinkBridge(Groups) as islands of transitive replication,
separate from other groups (turning off the default which
groups them all into one SiteLinkBridge-group).
Two connection documents are created for each site.
Basically we have a double-ring replication topology, in opposite
directions and they replicate differently (even hours on one and odd on
the other IIRC)

That's a good manual choice (likely), although I
would probably have looked at the SiteLinkBridge-group
idea -- although practically no books have explained that
correctly and most consultants haven't a clue.
is to > use some virtual servers to maintain some virtual DCs for each
domain.
of a > disaster, you can have those DCs up and running quickly because
you don't > have to worry about hardware and doing AD recovery.

That's a neat idea. Although, in case of a forest wide crash, we would
still have to shut down every DC per the MS document I referred to
above, which is the big time waster in our case. Agreed though that it
saves on the DC build time.

What is this reference?
Stupid question, but users that were logged on while the forest crashed
would still have access to all ressources right?

All or probably SOME (unpredictable) until the tickets
expires (Kerberos ticket life is configurable per domain.)
Since they have the
Kerberos token (probably wrong terminolgy sorry) already they would
still access their shares correct?

Kerberos calls it a 'ticket' but Microsoft calls their portion
a (security access) token so the words make sense either
way in this case.
Just want to prove that while it
would be a giant PITA, it would not mean the end of the business as
some people here would like to believe.
Once again, thanks for the comments,

I think it would help us (help you) if we read that document.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top