RAID questions

  • Thread starter Thread starter Atlas
  • Start date Start date
A

Atlas

I have a Windows 2003 server running on a 3Ware 9500S-4LP with RAID 5 on 4
hard disks; no battery backup unit

1) Even if the unit misses a BBU, the PC is depending on two cascaded UPS'.
What am I risking? Do you think I should avoid tha write cache feature?

2) What if a disk breaks in a few years time? Should I buy now a few disks
identical to the installed ones or can i use different brands/types?

3) Is sufficient to schedule the verification of the array, once a week, for
6 hours? Currently the array has a szie of about 650GB.

Thanks
 
I have a Windows 2003 server running on a 3Ware 9500S-4LP with RAID 5 on 4
hard disks; no battery backup unit

1) Even if the unit misses a BBU, the PC is depending on two cascaded UPS'.
What am I risking?

With a BBU, just about any non-raid component can fail without loosing
raid data. The UPS strategy is fine as long as everything is
operating properly.

A UPS does not protect against the tech behind the rack unplugging the
wrong power cable, or power cycling the wrong switch port. Or any of
a number of variations on that theme

Normally UPSs don't provide enough uptime for a rebuild if it begins
just before a long power outage. IF the controller has problems
stopping and resuming the rebuild a BBU is strongly indicated. You
might also consider looking at the maximum time the data resides in
cache (during worst case scenario i.e. rebuild) and the minimum time
the BBU will keep the controller alive.
Do you think I should avoid tha write cache feature?

You should definitely disable the write cache on the individual disks
regardless. Frankly I'd disable the controller cache also if there
was no BBU.
2) What if a disk breaks in a few years time?

If it is 3-5 years, count yourself lucky and plan to upgrade the
entire array. The service life of individual drives does not change
when they are RAIDed
Should I buy now a few disks
identical to the installed ones or can i use different brands/types?

You should have spare(s) on hand. Using only identical drives and
firmware is generally unnecessary today. Although it is often a good
idea to use the same model drives.

If all your online drives are identical models from an identical lot
with identical use they should have similar life spans unless there is
a grossly defective unit, integration problem, or handling issue. You
could rotate the media with spares, but that would jeopardize the
array's data with each rebuild. You could buy the disks from
different sources, or you could simply take your chances with disks
from a single order and a spare or 2 on hand. With a 4 disk raid 5
I'd probably try the last strategy.
3) Is sufficient to schedule the verification of the array, once a week, for
6 hours? Currently the array has a szie of about 650GB.

Generally yes. Although you should not let more than a week pass.
There is no harm in doing it more frequently unless you are noticing a
performance penalty. But I'd try throttling down the checks before
stretching the timeframe between them.
 
Oh. I just realize that's a 2 part question. I'm recommending at
least one complete verification per week. To check the 6 hr window
run a manual verify and time it.

Since the verify feature checks parity the duration depends on your
data. You have to allow for future growth in your scheduling. If the
machine isn't accessed 24/7 it might just be easier to schedule say 6
hrs each night or whenever the machine isn't being used.
 
With a BBU, just about any non-raid component can fail without loosing
raid data.

Define "any non-raid component".
The UPS strategy is fine as long as everything is operating properly.

Define operating 'properly'.
Obviously, when both UPSs run out of juice there is no 'operating'
at all anymore and when they're working there is no difference ex-
cept for the batteries providing the power rather than the grid.
A UPS does not protect against the tech behind the rack unplugging the
wrong power cable, or power cycling the wrong switch port. Or any of
a number of variations on that theme.
Normally UPSs don't provide enough uptime for a rebuild if it begins
just before a long power outage.

But he has 2 of them.
And obviously it doesn't make a difference whether you have a BBU if
the system stops during a rebuild -and it is an interrupted rebuild that is
the problem- not the actual data that supposedly got rebuild, but wasn't,
afterwards. Whether it's the UPS stopping or the grid stopping (w/o UPS).
IF the controller has problems stopping and resuming the rebuild a
BBU is strongly indicated.
Huh?

You might also consider looking at the maximum time the data resides in
cache (during worst case scenario i.e. rebuild)
and the minimum time the BBU will keep the controller alive.

The BBU doesn't keep the controller alive. It keeps the data alive.

[snip]
 
Guys, thanks for answering.

Can anyone tell me what tipically happens in case of a power failure on a
RAID 5 (4 disks) system? Will the array become unaccessible or will it be a
matter of slight to moderate NTFS corruption? If this is the case I'm
risking nothing more than a common power failure on a non-RAID system, and
I'had faced so many in the past..... almost 100% fixed simply with a "chkdsk
/f " pass.........
 
Atlas said:
Guys, thanks for answering.

Can anyone tell me what tipically happens in case of a power failure
on a RAID 5 (4 disks) system?
Will the array become unaccessible

What's the point of RAID if it's resistance against failure is so
easily undone by a simple power failure?
 
Guys, thanks for answering.

Can anyone tell me what tipically happens in case of a power failure on a
RAID 5 (4 disks) system? Will the array become unaccessible or will it be a
matter of slight to moderate NTFS corruption? If this is the case I'm
risking nothing more than a common power failure on a non-RAID system, and
I'had faced so many in the past..... almost 100% fixed simply with a "chkdsk
/f " pass.........

It's not something for chkdsk. You first need the raid's verification
or data scrubbing utility to clean up unsynchronized or stale parity.
It all depends on the specific even just how bad things are.
 
Define "any non-raid component".
self-explanatory.


Define operating 'properly'.
self-explanatory

Obviously, when both UPSs run out of juice there is no 'operating'
at all anymore

and with a BBU whatever is in the cache at that time is not lost.
and when they're working there is no difference ex-
cept for the batteries providing the power rather than the grid.

Only it is quite common for UPS systems alone to not provide many
hours of operating time.
But he has 2 of them.

So? If he has 2x 15 min runtime and a 4 hour power outage that
changes nothing. And that expects that it is a proper failover
situation and not a naive person daisy chaining an ordinary UPS.
And obviously it doesn't make a difference whether you have a BBU if
the system stops during a rebuild -and it is an interrupted rebuild that is
the problem- not the actual data that supposedly got rebuild, but wasn't,
afterwards. Whether it's the UPS stopping or the grid stopping (w/o UPS).

No a BBU does make a difference because it will keep the cache alive

Having a bad day Folkert?
The BBU doesn't keep the controller alive. It keeps the data alive.

It keeps the cache on the controller alive and therefore whatever data
has not yet been committed to disk.
 
What's the point of RAID if it's resistance against failure is so
easily undone by a simple power failure?

That's not the case. RAID just requires a little extra engineering to
be bulletproof.
 
Folkert said:
What's the point of RAID if it's resistance against failure is so
easily undone by a simple power failure?


Simple. RAID protects against the failure of a *single* (or two, in
RAID 6) drive failure.

It does not take power failure into account. It is assumed (surely?)
that power failures would be taken care of in the form of battery backup
for the entire system.

I would have expected you to know better, Folkert.



Odie
 
A UPS does not protect against the tech behind the rack unplugging the
So? If he has 2x 15 min runtime and a 4 hour power outage that
changes nothing. And that expects that it is a proper failover
situation and not a naive person daisy chaining an ordinary UPS.

Yeah but the UPS forces a shutdown on the server when reaching a 5 minutes
battery time limit, so nothing should happen. The only real threat is due to
power supply failure. I could buy a redundant one but I'm not using a rack
so there's no space in the PC cabinet. And the 3ware 9500S-LP doesn't have
any BBU option at all.
 
Yeah but the UPS forces a shutdown on the server when reaching a 5 minutes
battery time limit, so nothing should happen.

As I mentioned before, when there is a graceful shutdown in the middle
of a rebuild, the controller is *supposed* to be able to restart it
without problems when you power up again. I've been resisting 3Ware
products, so cannot tell you 100% that it would never hiccup in that
situation with cache enabled. IMHO if the data is important it's a
good idea to have a good BBU anyway.
The only real threat is due to power supply failure.

Accidents happen and parts fail. The PSU may be an obvious or perhaps
significant risk to you. I'm not sure I agree it is the "only real
threat". If you factor out stupidity, defect, or accident you are not
planning for the real world.
I could buy a redundant one but I'm not using a rack
so there's no space in the PC cabinet.

There are mini redundant PSUs that are similar dimensions as a regular
ATX PSU. Although I concede they tend not to work in cheaper PC
cases.
And the 3ware 9500S-LP doesn't have
any BBU option at all.

The 9000 series are all BBU ready. The compatible part is
BBU-9500S-01.

Take a look at the data sheet. Is the 9500S-4LP (what I think you
meant to write) with BBU option installed:
http://www.3ware.com/products/pdf/9000_DS_012605.pdf


Bottom Line:
If you data is very important there needs to be multiple layers of
failsafes to protect it. BBU specifically protects data on the RAID
controller cache that has not yet been committed to disk. It is not
totally redundant or superfluous when you have a UPS. Frankly, a BBU
is an insignificant additional expense and typically makes a lot of
sense for RAID4/5/6 which relies heavily on the controller cache.
 
without problems when you power up again. I've been resisting 3Ware
products, so cannot tell you 100% that it would never hiccup in that
situation with cache enabled. IMHO if the data is important it's a
good idea to have a good BBU anyway.


Accidents happen and parts fail. The PSU may be an obvious or perhaps
significant risk to you. I'm not sure I agree it is the "only real
threat". If you factor out stupidity, defect, or accident you are not
planning for the real world.


There are mini redundant PSUs that are similar dimensions as a regular
ATX PSU. Although I concede they tend not to work in cheaper PC
cases.


The 9000 series are all BBU ready. The compatible part is
BBU-9500S-01.

Take a look at the data sheet. Is the 9500S-4LP (what I think you
meant to write) with BBU option installed:
http://www.3ware.com/products/pdf/9000_DS_012605.pdf


Bottom Line:
If you data is very important there needs to be multiple layers of
failsafes to protect it. BBU specifically protects data on the RAID
controller cache that has not yet been committed to disk. It is not
totally redundant or superfluous when you have a UPS. Frankly, a BBU
is an insignificant additional expense and typically makes a lot of
sense for RAID4/5/6 which relies heavily on the controller cache.


Ok George,
it looks like the part should be compatible with my controller, so I will
probabily go for it!

Nevertheless, can you explain what's the difference between a PSU failure on
the controller's cache side compared with a PSU failure on the operating
system side? Windows 2003 manages it's own disk cache and a PSU failure
usually means from slight, if nothing, disk corruption to moderate damage to
active directory, sql or exchange server data.... .
 
Ok George,
it looks like the part should be compatible with my controller, so I will
probabily go for it!

Nevertheless, can you explain what's the difference between a PSU failure on
the controller's cache side compared with a PSU failure on the operating
system side? Windows 2003 manages it's own disk cache and a PSU failure
usually means from slight, if nothing, disk corruption to moderate damage to
active directory, sql or exchange server data.... .

I hope you are asking this out of curiosity rather than a perspective
that these are equivalent and *acceptable* problems. Most ppl expect
RAID to make a more bullet proof storage than a generic single disk
scenario. So for the same vulnerability to persist, and to be caused
by something stupid like power fail, is very undesirable. Unless you
are talking about something like your WSUS database SQL databases, for
example tend to be VERY IMPORTANT and need to be *Highly Available*.
It generally isn't a good day when SQL, Exchange, or Active Directory
is borked or even suspicious.

So, from my perspective, I'm not sure any detailed comparison really
matters. Both are vulnerable even though the potential damage occurs
on different levels, damage is not 100% guaranteed, and there are
(limited) recovery tools. AFAIK the Windows cache manager and NTFS
have some rudimentary ability to recover from power fail, albeit far
from foolproof. The RAID controller has a utility to check the
parity. However it is not 100% foolproof either. The busier the
controller is at the time of failure (think rebuild), the bigger the
potential problem. IMHO nether situation are desirable if the data is
important & a non-disposable copy.

Now there are RAIDs that use an almost journalizing-like flagging
scheme that provide such power fail protection. I don't believe 3Ware
does that. On the entry-level side of things I think this is easier
to find in OS-level software that protects that way.

I sense you are trying to determine the importance of n+1 PSUs. I
think the decision is more involved and based on different factors
than the BBU/controller cache protection issue we're discussing.
 
George,
thanks for spending your time on these questions.

Yes, once sticked to the BBU approach and looking for other potential ways
of failure, it's natural to look at the PSU.

Indeed a CPU or Mobo/Chipset can occur, but giving a percentage to each
potential problem, I would say that the highest is assigned to a PSU
failure.

Having said so, I recall you were talking about a redundant PSU that could
fit in a ordinary PC case. The server is hosted in a Dell Dimension 4400....
can you spot the brand/model you were talking about?

Thanks again
 
George,
thanks for spending your time on these questions.

Yes, once sticked to the BBU approach and looking for other potential ways
of failure, it's natural to look at the PSU.

Indeed a CPU or Mobo/Chipset can occur, but giving a percentage to each
potential problem, I would say that the highest is assigned to a PSU
failure.

Depends on the PSU it's load and environmental factors. If it is a
quality PSU with low load (compared to it's abilities) run in a clean,
dry, cool computer room, chances are (if it doesn't fail prematurely)
it will last longer than you want/need it to.
Having said so, I recall you were talking about a redundant PSU that could
fit in a ordinary PC case. The server is hosted in a Dell Dimension 4400....
can you spot the brand/model you were talking about?

Maybe I was unclear. Mini-redundant PSU's are similar sized but not
really identically compatible. I don't know of one you can simply
drop into a Dimension. But there are many server or workstation type
cases that can accept either a single ATX *or* the Mini redundant. You
need the proper opening in the back and typically extra mounting &
extra depth in the case.

At some point you have to determine just how important/valuable your
data and uptime are as well as what can/should be handled via backup
instead and what is economical in the long run. Redundancies and
improvements can go on forever. If things are really important move
away from that desktop and migrate to, say, a full-fledged Supermicro
server. Use only drives with the best track record. (you may
consider SCSI raid or tiered storage). Evaluate closely your backup
and onsite environmental and electrical and access control/security.
 
Depends on the PSU it's load and environmental factors. If it is a
quality PSU with low load (compared to it's abilities) run in a clean,
dry, cool computer room, chances are (if it doesn't fail prematurely)
it will last longer than you want/need it to.

Doesnt have to be a low load in your sense to get that.
Or that other stuff either, just as long as its within specs.

And the last part after 'chances are' is mindlessly silly.

There are no other possibilitys, stupid.
 
Doesnt have to be a low load in your sense to get that.
Or that other stuff either, just as long as its within specs.

And the last part after 'chances are' is mindlessly silly.

There are no other possibilitys, stupid.

Keep trying. One of these days you'll get out of that wet paper bag.
 
Guys, stop fighting!

At least pleas answer to my latest RAID question:

Rebooting while rebuilding/verifiyng.

What happens if the Operating System forces a shotdown, with or without
reboot?

Data loss or just a 'process pause' and once rebooted, it will start again
where it stopped?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Back
Top