DHCP, dual NIC, DHCP failure

H

Henry Markov

With my target devices I've found there is about a 50% probability that dynamic
IP addresses are not obtained for at least 6 minutes. The devices are industry
standard Compact PCI blades (PICMG 2.16) that use an Intel dual port network
controller (82546EB). Many well known vendors including Kontron, Momentum, DTI
and others have the same blade architecture. With a basic remote boot load
containing only standard components except for the XPe NIC driver obtained from
the Intel support site, there is an apparent race that causes the target to
abandon the DHCP protocol about 50% of the time. The problem appears about
equally likely in SP1 and SP2.

I conducted many tests with a 2.0GHz Pentium-M blade supplied by DTI, a cPCI
backplane, and a Win2003 server that was both boot server and DHCP server. The
proper sequence of DHCP messages for the client to obtain an IP address is:
Client Server
------ ------
Discover
Offer
Request
Ack
It appears to me that the DHCP client runs independent threads to execute this
protocol for each interface. In about 1/2 the cases the protocol is abandoned
by the client after a server offer. When the protocol is abandoned, it is
always abandoned for both interfaces. The client then restarts the protocol 5
to 6 minutes later and it typically succeeds for both interfaces however I have
seen one case where it failed on the second attempt and in this case the target
had no IP addresses for 14 minutes after boot.

Note that in a PXE boot scenario a DHCP address is obtained for an interface
twice -- once by the PXE BIOS client and once when the downloaded OS takes
control. DHCP never fails when executed under PXE, it only fails when executed
under XPe.

My client has just ordered $800,000 worth of equipment including 224 of the CPU
blades that I have tested. It is completely unacceptable that it can take 6
minutes or more for these blades to be network enabled after a boot. I really
need help on this one.

TIA,
Henry
 
K

KM

Henry,

I have a bad feeling that the 5 minutes for the DHCP client to retry getting an ip address if the initial attempt failed is
hardcoded.
I haven't seen any related registry flags and Googling showed similar opinion coming from number of people.

If getting IP at the boot time is so inconcistent on your system, have you throught of postponing the dhcp client service load until
some major part of the boot process is done? Basically, you have the DHCP service disabled (or set to manual) and then start it with
your own application/script/command, backing up with ipconfig /renew, somewhere at the boot process (either in a service that
depends on most of other services, or through a [HKLM\...\Run] command, or etc.). Then you can add some Sleeps in your application
code before you start the DHCP service to make sure the network stack is up and running. You can experiment with the code to get the
best and consistent result.

KM
 
H

Henry Markov

KM,
Thanks for the response but part of my responsibility to my client is to
determine if XPe is robust enough for them to make a multi-billion dollar bet
on. Their entire service, which is now counted in hundreds of millions of
dollars and rapidly growing, will depend on the robustness of a few hundred
servers. How do you think a CTO would feel if his entire business depended on
machines incapable of executing a protocol such as DHCP reliably? This isn't a
good indicator for availability that must be essentially 100%.

I don't think I've discovered a unique problem. I found a NG article posted
yesterday by someone having essentially the same problem with Win2K and Win2003
servers. There are other cases available to the interested googler. However I
haven't found any evidence that MS is interested in resolving this problem.
These are the types of issues that pushes one to open source alternatives where
I don't think a problem like this would be ignored for very long.

HM
Henry,

I have a bad feeling that the 5 minutes for the DHCP client to retry getting an ip address if the initial attempt failed is
hardcoded.
I haven't seen any related registry flags and Googling showed similar opinion coming from number of people.

If getting IP at the boot time is so inconcistent on your system, have you throught of postponing the dhcp client service load until
some major part of the boot process is done? Basically, you have the DHCP service disabled (or set to manual) and then start it with
your own application/script/command, backing up with ipconfig /renew, somewhere at the boot process (either in a service that
depends on most of other services, or through a [HKLM\...\Run] command, or etc.). Then you can add some Sleeps in your application
code before you start the DHCP service to make sure the network stack is up and running. You can experiment with the code to get the
best and consistent result.

KM
With my target devices I've found there is about a 50% probability that dynamic
IP addresses are not obtained for at least 6 minutes. The devices are industry
standard Compact PCI blades (PICMG 2.16) that use an Intel dual port network
controller (82546EB). Many well known vendors including Kontron, Momentum, DTI
and others have the same blade architecture. With a basic remote boot load
containing only standard components except for the XPe NIC driver obtained from
the Intel support site, there is an apparent race that causes the target to
abandon the DHCP protocol about 50% of the time. The problem appears about
equally likely in SP1 and SP2.

I conducted many tests with a 2.0GHz Pentium-M blade supplied by DTI, a cPCI
backplane, and a Win2003 server that was both boot server and DHCP server. The
proper sequence of DHCP messages for the client to obtain an IP address is:
Client Server
------ ------
Discover
Offer
Request
Ack
It appears to me that the DHCP client runs independent threads to execute this
protocol for each interface. In about 1/2 the cases the protocol is abandoned
by the client after a server offer. When the protocol is abandoned, it is
always abandoned for both interfaces. The client then restarts the protocol 5
to 6 minutes later and it typically succeeds for both interfaces however I have
seen one case where it failed on the second attempt and in this case the target
had no IP addresses for 14 minutes after boot.

Note that in a PXE boot scenario a DHCP address is obtained for an interface
twice -- once by the PXE BIOS client and once when the downloaded OS takes
control. DHCP never fails when executed under PXE, it only fails when executed
under XPe.

My client has just ordered $800,000 worth of equipment including 224 of the CPU
blades that I have tested. It is completely unacceptable that it can take 6
minutes or more for these blades to be network enabled after a boot. I really
need help on this one.

TIA,
Henry
 
H

Henry Markov

I can reproduce this problem with XP Pro however the timeouts are much shorter.
I get typical 1st retry timeouts of 16 seconds and 2nd try timeouts of 33
seconds. Is it likely that different hardcoded values would be put in XP Pro
and XPe? If yes, why? I still think there is a bug to be fixed but much
shorter timeouts in XPe would at least mean that XPe is not a non-starter for my
application as it is beginning to appear to be.

HM
 
K

KM

Henry,
I can reproduce this problem with XP Pro however the timeouts are much shorter.
I get typical 1st retry timeouts of 16 seconds and 2nd try timeouts of 33 seconds.

Are you sure about these numbers? A long time ago I observed different numbers on NT and 2K.
Are you testing this with DHCP server disconected? (that is how I'd test the retry interval)
Is it likely that different hardcoded values would be put in XP Pro
and XPe? If yes, why?

I doubt that. We all know the XPe binaries are coming from XP Pro and therefore the hardcoded values are the same.
Only difference I can imagine is in the SP level. You didn't mention what SP of the XP Pro you tried but I suspect it is either SP1
or Sp2.

The reason for me saying that the 5 minutes is the hardcoded value was that I saw similar behaviour (retry on initial failed request
for IP happens after ~ 5 minutes) on Win2000. I do not recall hard testing of this on XP so I cannot confirm.
Someone with access to the XP sources may help us to figure out whether the value is hardcoded in the DHCP client of XP.
I still think there is a bug to be fixed but much
shorter timeouts in XPe would at least mean that XPe is not a non-starter for my
application as it is beginning to appear to be.

If you see much shorter timeouts on XP Pro you can do that on XPe. It is probably a matter of either missing dependencies (unlikely)
or missing or different default registry settings of the DHCP client or some other components in IP stack.

I must admit the DHCP client API is extremely limitted and there is no way to control the DHCP client behaviour except from registry
(not much there either) :-(
 
R

Roger Levy

KM,
Are you sure about these numbers? A long time ago I observed different numbers on NT and 2K.
Are you testing this with DHCP server disconected? (that is how I'd test the retry interval)

I am testing with the DHCP server connected. Remember that I am describing a
problem that I consider to be a Windows DHCP client bug because the server is
doing everything it is supposed to do but the client is giving up on the
protocol, timing out, and restarting the protocol. About 50% of the time this
results in my client not having IP addresses for 6 minutes or more. Whatever
the timeout is when the server is not connected may be interesting but it is a
different problem. If you want to see the details of this problem as it
manifests itself in XP Pro then see my posting today in another NG:
http://groups-beta.google.com/groups?q=dhcp+offer+ignored+-+ethereal+trace
This posting demonstrates the same problem I have with XPe but the timeouts are
much shorter. I'd be glad to send the actual Ethereal files to anyone who is
skeptical or who wants to help solve this issue.
I doubt that. We all know the XPe binaries are coming from XP Pro and therefore the hardcoded values are the same.
Only difference I can imagine is in the SP level. You didn't mention what SP of the XP Pro you tried but I suspect it is either SP1
or Sp2.

Everything I've posted about pertains to SP2 but I've verified that the same
problem exists in SP1.

Roger
 
K

KM

Roger,

I was actually focusing on that hardcoded value in the DHCP client that is responsible for the retry getting the IP if the initial
attempt failed.
Although it is not exactly what you are trying to resolve but, I believe, it may be a part of the problem.

I think only one who has access to the sources code of the DHCP client may be able to help you figure out (and fix) the actual
algorithm of the IP handshake implemented on XP.
Let us know here if you get interesting replies in other newsgroups on the issue.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top