Very Strange Network Problem HELP!!!

C

Craig N.

I have a very strange problem here, been working on it for 4 months now,
hopefully someone can give me some advice.

Anyways, I have a client with 200 users running Citrix. The network
configuration is:

all main servers are HP DL380's, Dual xeon 3.0's, 1 gig of ram
2 domain controllers - win'2000
4 Citrix servers
1 SQL server
1 Print Server
1 EMC Clarian SAN
1 Lotus
1 HP-UX running BPCS
Switches:
4 Intel Express Hubs 12 Port
2 3COM hubs
2 HP 2524 Switches
2 Cisco Catalyst 2900 Switches
1 Cisco Catalyst 3500 Switch
4 Old SMC Hubs for the phone system

Anyways, the client was running old servers, so they upgraded to HP DL380
servers, and Citrix ran great, then a few days later it went to crap. The
logins were slow, trying to open office documents through office was slow
(but opening through explorer was not), and when typing, it would lag real
bad. (This is Citrix Specific)

I came in to try to fix it, and we tried everything until we finally rebuilt
the domain controllers and Citrix farm, since the problem really seemed like
a server issue. Well, after rebuilding from scratch, we stil had the same
problem, so we tested everything, and it looked good. I then got an idea, it
felt like it might be network traffic.

I took a catalyst 2900 switch that had never been connected to the network,
and ran the servers to it, and 6 thin clients. We turned it on, and the
thing cruised. It rocked, logins were instant, it was the first time it
worked in 4 months. So, I figured I would track down the problem, I started
plugging in switches one at a time until the problem surfaced. I got all the
Intel hubs connected, nothing, then I connected the 3coms, still nothing,
and then the HP switches, so far so good.

By now, its 1:00 AM, and I decide to go home, I quickly connect in the rest
of the switches at once, and BOOM, the problem shows up, so I figure I have
it down to only a few switches, lets go home and try tomorrow. I move the
network cables over to the old servers, then I Connect this switch into the
network, and leave. (It was easier, than moving cables everywhere, since the
switch room is seperate from the server room, and I was tired)

Next Day, I'm excited, so I decide to show it off. WELL, I move my network
cables to the old servers, and unplug the switch from the rest of the
network. The problem is back!!! I reboot the switch, I reboot the servers,
nothing. I even swap network cables. This is the exact same configuration as
last night when it was running great. I dont get it. I guess maybe something
from one cisco switch has infected this one, who knows.

I do know that the HP switches didnt cause a problem, so I empty them out,
and plug in only my servers and thin clients, nothing else, completely
seperate from the network. STILL THE SAME PROBLEM!!

This is weird, and I need to fix it. Its like whenever you add a switch, it
acquires this problem, but a brand new switch runs great until it is
connected to the network for any length of time. It ran great on the other
switch for a good 5 hours, so I know its not the servers, gotta be the
switch. Does anyone have any ideas?

I am resetting my catslyst to factory default then going to try it. Someone
please help, I need this up by monday.
 
D

Dave

i would get out the packet sniffer and see just what the network traffic is.
it could be anything from some kind of virus that is scanning the network
and as it finds more machines degrades the operation of everything, to some
kind of router/switch communication problem. a packet sniffer should be
able to point you in the right direction at least.
 
S

serverguy

I agree with Dave, and I would add:

- why hubs?? Any possibility you could use just switches?
- checked the speed/duplex settings of the ports and nics? What else can
you describe about the nics in the servers and cabling? teamed? fiber?
- have you definitely ruled out software issues? Are your citrix profiles
local or roaming? I would eliminate any roaming profiles b/c they can be
slow, you can do this with policies.
- any other policies that might be loading to the clients? Maybe try
disabling policies temporarily and build a fresh client. Any event log
errors on the citrix servers or DCs?
- when you said that you tested the servers separate from the network, how?
did you have a domain controller also or just citrix server(s)? I am making
many assumptions here - that you have cloned the citrix servers from
identical builds? that you tried removing servers indivually from the farm?
that no other network problems exist (w/SQL or domain replication, for
example), and that it is definitely isolated to citrix only.
- what version of citrix? have you called them yet?
 
R

Rick Chisholm

find a client to server path that goes through the most switches - copy
or ftp a LARGE file across 50MB min and see what throughput you get.
You can also use the managed switches OS to see if any port are getting
considerable errors or if the buffers are filling up.

I would try the file transfer thing through various switches. You could
also ping all over the subnet, latency should not vary much from one
host to the other and packet loss should be near non-existent.

All switches spanning tree capable? Any potential loops?

If sniffing - try packetyzer - easy to use in Windows environment.
Ethereal is good too.


Rick
 
C

Craig N.

1. I dont know why hubs, this is just what they have, theyre older. All the
workstations run into the hubs, none of the servers do, they all go to cisco
switches.

2. Yeah, and I have a speed and duplex match on both the servers and
switches, all running 100 meg full duplex. The nics are 100/gigabit, stock
built in for the HP DL380. Using standard Cat5, no fiber except for a NetApp
Filer.

3. I have basically ruled out all software issues, just because it worked on
that switch, AND I brought up the old servers that we replaved that had
always worked great, and they had the same problem. Although my Citrix guy
thinks it could be from loading off the HP install disks, he says its a
different kernel, and could cause this issue, but I am pesimistic, or the
old server would have worked, and it wouldnt have worked using the new
switch.

4. No roaming profiles, I have one single terminal services profile that
everyone uses, that way no one can change the desktop. The problem is also
on thin clients, so that doesnt make too much sense. We have tried roaming
and non-roaming previously, and nothing.

5. We have disabled all the policies, and no event logs errors at all.

6. Basically, I built a brand new Domain Controller, er, 2 anyways, one for
redundancy. Then a SQL Server for licensing, and the Citrix Server. I also
have a brand new EMC SAN system. This is all seperate on a different
network. When I test it, I unplug the current domain controllers, sql, NAS,
and citrix, and plug in this network. The servers are from scratch this
time, nothing cloned, nothing from the old network has touvhed this, I even
manually entered all the users into active directory.

7. I have removed the dc's one by one, and sql one by one, still nothing.

8. Citrix XPE feature relese 3.0, and I have had 2 certified Citrix experts
in here already, one is working with me now, and he is baffled, other than
this Kernel that HP uses. So he is going to rebuild Citrix AGAIN using the
actual Windows 2000 Disk.

Also, I ran a sniffer, and during the hang, I have 0 Network connectivity on
the box, until the hang stops, then network connectivity restarts.

Hope that helps.
 
R

Rick Chisholm

clarify something for me - if you plug in your DC, a Citrix server and a
thin client all into the same switch (or hub) with not other
connections or uplinks - does it work?

All the "new server, old server, rebuilt and replaced" has me a little
confused. Are we at a point where no matter which combination you try
you get no connectivity or just poor performance - or does it always
start out fine then 'collapse' at some point?

your traces with the sniffer show no abnormal traffic, broadcast storms
or other net quirks?

Rick
 
C

Craig N.

We have tried a few things. First, a single user, with the servers on just a
linksys 8-port seemed to work fine, but its actually hard to say because one
user doesnt really encounter problems. BUT, today we actually did the same
thing again, and with only one user, all connected through the linksys
switch, and we do have the problem again. This is weird, although now that I
think about it, yesterday, in between when it was running great, and running
like crap, this citrix server was connected to the current network and some
files were transfered. Virus scans dont show anything.

See, when we had it running great, it was just the servers, and 6 thin
clients connected to a brand new switch. The only two things that changed
were that the switch was on the network for a day, and this server did touch
the current environment.

I am at a point now where no matter what combination, it doesnt work (and it
is instant, not gradual), never has until we put up that new switch and had
it all seperate, but now even that doesnt work. What we have here:
-Old generic brand servers with citrix, which use to run great.
- Current citrix servers on HP Dl380 G3 servers
-Brand new Citrix servers built from scratch, running on HP DL380 G3
servers.

The current servers worked fine for about 3 days, then the problem happened,
before that with the real old servers, we never had any problems, SO.
I hooked up the old old servers, and still the same problem. I have 3
different sets of servers, all built from scratch by different people that
all have the same problem. So it doesnt seem like a server issue, plus, It
did work on a brand new switch, until something happened that next day.

My citrix expert thinks it could be the Kernel that is on the HP install
disks, he says that causes this problem, but honestly, that doesnt make
sense since the non HP servers have the problem, and we DID have it working
through that switch. He's rebuiloding it now, so if it is a virus, its gone,
and if his "solution" doesnt work, then I at least have a clean slate to
start on, and I did take that cisco switch and blow the config off of it, so
i'll try all that agian,

Possibly a virus though? I run norton on every system.

The sniffer that I ran can only analyze traffic coming in and out of the box
its run on, and what we saw was 0 connectivity during the problem. Any
recommendations for a full network sniffer?
 
D

Dave

'zero connectivity' and worries about 'touching' the old network seem odd.
what were you sniffing for, just normal network traffic or all protocols?
there are some worms/spyware/addware that generate ping storms, and other
non-typical traffic that can bring switches and routers to their knees
quickly. some of these are not detected by av programs, you have to scan
with things like spybot and adaware. if there is really 'no connectivity'
when this happens there should either be an event log message showing that
something has timed out waiting for dns, a master browser change, loss of
cable connection, or some such error... or some indication of the system
getting stuck at 100% utilization or some other odd condition so that it
stops serving the network, that should show on the normal task manager... i
would try setting up one system to ping the others constantly and then cause
the problem to occur, watching the pings may give some idea which end (or
the middle) stops responding.
 
C

Craig N.

Here are a few things I caught using ethereal, if anyone can tell me what it
means. Everything looks like normal traffic, except from one PC, and the
citrix boxes.I can export the file to text and e-mail it if anyone wants
some more detail, just email me at (e-mail address removed).

This is only half of it, but you get the idea, out of hundreds of pc's amd
about 15 random servers, these are the only ones doing this particulaar
thing.

Source Destination
Info

Colleen-pc.company.int 192,168.102.6 TCP
3480 > 5321 [PSH, ACK] Seq=1 Ack=0 Win=54512 Len=120.
Colleen-pc.company.int 192,168.102.6 TCP
3480 > 5321 [ACK] Seq=1 Ack=0 Win=54512 Len=120
Colleen-pc.company.int 192,168.102.6 TCP
3480 > 5321 [SYN] Seq=0 Ack=0 Win=54512 Len=120 MSS=1460
Colleen-pc.company.int 192.168.102.14 TCP
3479 > 1352 [ACK] Seq=1 Ack=0 Win=64512 Len=0
Colleen-pc.company.int 192.168.102.14 TCP
3479 > 1352 [SYN] Seq=0 Ack=0 Win=64512 Len=0 MSS=1460
---------------
Then on Citrix, I have a bunch of these, on all the servers:

Cxp03.company.int 192.168.102.150 TCP
1494 > 1041 [ACK] Seq=0 Ack=0 Win=63412 Len=0
-----------------------------------
Along with a LOT of these:

Cxp03.company.int 192.168.102.150 TCP
[TCP Previous segment lost] 1494 > 1041 [PSH, ACK] Seq=121622 Ack=4049
Win=63783 Len=1459

Cxp03.company.int 192.168.102.150 TCP
[TCP Previous segment lost] 1494 > 1041 [PSH, ACK] Seq=2045131 Ack=22451
Win=63783 Len=1459
 
A

Ace Fekay [MVP]

In
Craig N. said:
Here are a few things I caught using ethereal, if anyone can tell me
what it means. Everything looks like normal traffic, except from one
PC, and the citrix boxes.I can export the file to text and e-mail it
if anyone wants some more detail, just email me at (e-mail address removed).

This is only half of it, but you get the idea, out of hundreds of
pc's amd about 15 random servers, these are the only ones doing this
particulaar thing.

Source Destination
Info

Colleen-pc.company.int 192,168.102.6
TCP 3480 > 5321 [PSH, ACK] Seq=1 Ack=0 Win=54512 Len=120.
Colleen-pc.company.int 192,168.102.6
TCP 3480 > 5321 [ACK] Seq=1 Ack=0 Win=54512 Len=120
Colleen-pc.company.int 192,168.102.6
TCP 3480 > 5321 [SYN] Seq=0 Ack=0 Win=54512 Len=120 MSS=1460
Colleen-pc.company.int 192.168.102.14
TCP 3479 > 1352 [ACK] Seq=1 Ack=0 Win=64512 Len=0
Colleen-pc.company.int 192.168.102.14
TCP 3479 > 1352 [SYN] Seq=0 Ack=0 Win=64512 Len=0 MSS=1460
---------------
Then on Citrix, I have a bunch of these, on all the servers:

Cxp03.company.int 192.168.102.150
TCP 1494 > 1041 [ACK] Seq=0 Ack=0 Win=63412 Len=0
-----------------------------------
Along with a LOT of these:

Cxp03.company.int 192.168.102.150
TCP [TCP Previous segment lost] 1494 > 1041 [PSH, ACK] Seq=121622
Ack=4049 Win=63783 Len=1459

Cxp03.company.int 192.168.102.150
TCP [TCP Previous segment lost] 1494 > 1041 [PSH, ACK] Seq=2045131
Ack=22451 Win=63783 Len=1459


What are:
Colleen-pc.company.int
and
192,168.102.6 ?

Is it just a workstation?
Is 102.6 a server or another workstation?

How about
Cxp03.company.int
and
192.168.102.150?

Are they DCs, workstations, etc?

Curious, may I assume they are all virus, trojan, BHO and spyware free?


--
Regards,
Ace

Please direct all replies ONLY to the Microsoft public newsgroups
so all can benefit.

This posting is provided "AS-IS" with no warranties or guarantees
and confers no rights.

Ace Fekay, MCSE 2003 & 2000, MCSA 2003 & 2000, MCSE+I, MCT, MVP
Microsoft Windows MVP - Active Directory

HAM AND EGGS: A day's work for a chicken;
A lifetime commitment for a pig.
 
A

Ace Fekay [MVP]

In
Rick Chisholm said:
find a client to server path that goes through the most switches -
copy or ftp a LARGE file across 50MB min and see what throughput you
get. You can also use the managed switches OS to see if any port are
getting considerable errors or if the buffers are filling up.

I would try the file transfer thing through various switches. You
could also ping all over the subnet, latency should not vary much
from one host to the other and packet loss should be near
non-existent.

All switches spanning tree capable? Any potential loops?

If sniffing - try packetyzer - easy to use in Windows environment.
Ethereal is good too.


Rick

I was going to ask the same thing, if they are spanning tree
capable/stackable or individual switches.


--
Regards,
Ace

Please direct all replies ONLY to the Microsoft public newsgroups
so all can benefit.

This posting is provided "AS-IS" with no warranties or guarantees
and confers no rights.

Ace Fekay, MCSE 2003 & 2000, MCSA 2003 & 2000, MCSE+I, MCT, MVP
Microsoft Windows MVP - Active Directory

HAM AND EGGS: A day's work for a chicken;
A lifetime commitment for a pig.
 
C

Craig N.

Problem seems to be fixed. This is very very strange. It is either the HP
Lights out board, or, it is the Kernel that HP installs if you use the HP
install disks. We disabled the board, then reinstalled the server with an
actual Win2k disk, and I cant recreate the problem. Of course, I had it
working once before and well that went to crap.

Anyways, I still dont get why it was working great the other day on that one
single switch, and bad when I plugged in the others. BUT, it works now
plugged into all switches.

Thanks for your help.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top