High Disk Read Performance

High Disk Read Performance

Post by dbda » Wed, 26 May 2004 04:51:35

I need to architect 2TB of disk storage with absolute maximum read
capabilities. I have an application that sequentially reads and
analyzes large flat data base files and the faster I can read them the
faster I can analyze. CPU and memory are not a factor - currently the
disk "reads" are my bottle neck.

What can I do to push the envelope?


Do I use lots of smaller disks or fewer large disks?
What RAID config?
Lots of cache or no cache?
Multiple SCSI/FC controllers?

I am open to any ideas (assuming the cost is reasonable)

My growth will be from 2TB to 4TB over the next three years.

I would like to achieve disk "reads" around 200MB/sec - 300MB/sec
(Mega Bytes not bits!)

High Disk Read Performance

Post by Malcolm We » Wed, 26 May 2004 10:21:02

Lots of fast disks. The size is probably irrelevant (likely enough,
to get the performance you want you'll need lots of spindles, and so
the capacity just comes along with that).

3 or 5. Probably 5, but it depends on the manufacturer of the RAID

Cache won't help much, so "lots" won't help, but some will (to
decouple the back-end/front-end transfers).


2Gb Fibre Channel seems the way to go.

That's entirely manageable.

The biggest issue *might* turn out to be the filesystem!

Still, I'd advocate splitting the storage (as seen by the host) into 4
entities, each attached via 2Gb/s Fibre. Depending on your RAID
vendor, those 4 channels may go to 1, 2, or 4 RAID controllers, each
with a number of disks.

Your environment doesn't make much use of cache, so you could even
architect the thing as four separate units, of 1TB each, with each
needing to deliver 50-75MB/sec.

Which is not hard to do, at least with SGI's XFS filesystem, and
probably with most other similar class products.

Or you could use two RAID systems each with dual controllers, which
decreases the number of points of failure... each RAID system should
have two arrays, one mapped to each controller (i.e. back to the 4

To be comfortable, I'd suggest using 9 drive RAID-5 arrays using 15K
rpm drives at 73GB per drive, for ~580GB usable per array. Expansion
would be by adding more shelves of drives (i.e. twice as many disks)
in their own arrays.



High Disk Read Performance

Post by robertwess » Wed, 26 May 2004 12:21:20

If you have a single volume, and go RAID1/0 or RAID5 (using as big a
stripe as you can), your job become tricking the OS to schedule reads
far enough into the future to keep all the disks reading. You can
often come close to forcing this by always reading n*disks*stripe
ahead in a separate thread (where n is 2-5), and having enough buffer
in memory to hold that amount of data.

You can get pretty much the same read-only performance for RAID0,
RAID1/0 or RAID5, although the number of disks required, and the level
of redundancy changes. RAID5, can (with the right controller and
whatnot), match a pure striped (RAID0) approach in sequential read
performance, with minimal cost overhead and still provide redundancy.

If you have multiple volumes, you can get the same effect strictly at
the application level by wrigin your files in chunks to the different
volumes/drives. Then have one or more threads reading ahead on all
the files in parallel.

You will have issues with total throughput. You'll certainly run out
of bandwidth on a single SCSI or 1 or 2GB FC channel, and you'll
probably not be able to drive a single PCI bus to those levels unless
it's a 133x66 PCI-X slot, which just moves the question back to the

It's certainly possible with most mid-sized server hardware and
multiple HBAs (on separate PCI busses), driving at least 8-10 disks
(perhaps less with careful tuning). You'll clearly want disks with
fast sequential read rates, so you're fast SCSI drives are probably
the answer, unless you can find enough ATA RAID controllers that can
sustain the parallel reads and required bus bandwidth, or manage to
get them on enough different PCI busses (I have my doubts). Another
issue with less than high-end SCSI/FC HBAs will be OS overhead for the
I/O load.

I'm just curious, however, how much processing are you going able to
at those data rates? The OS overheads alone will probably eat most of
a 3GHz P4.

High Disk Read Performance

Post by Stephane G » Wed, 26 May 2004 16:04:39

The solution could depend on your app.
What is the kind of data? biological? geographic? pictures? unstructured
texts, like web pages?
You need sequential scan; are you doing data mining, info. retrieval?

CPU and memory are not a factor - currently the

use half the capacity of each disk anyway. Each will deliver up to
50MB/s in sequential read. Filling the disk will cause the throughput
to fall quickly when reading the end.

you don't need much cache for seq. read.

>> I would like to achieve disk "reads" around 200MB/sec - 300MB/sec >> (Mega Bytes not bits!)

what about data compression?

High Disk Read Performance

Post by dave dicke » Thu, 27 May 2004 08:28:19

You'll also need more than one Fiber Channel card and some way to
distribute the I/O across the channels (e.g. powerpath, dmp, hdlm, ...)

What OS? Do you need a file system, or could your source file be a raw


High Disk Read Performance

Post by Ron Reaug » Thu, 27 May 2004 09:43:06

You seem to be starting at the wrong end of the problem. The prior issue is
the processing and analysis application to deal with a data flow of that

I'm guessing that a quad CPU or larger box may be required. Can the
analysis app apply usefully a number of CPUs to your single task?

You'll need a system bus[es] PCI[-X] fast enough to support the desired data
rate. After that there are a number of HD array technologies that are
capable of delivering high sequential data rates. For the flat
file(sequential read) case [S]ATA HDs are by far the most cost effective
choice. A German university has recently setup a multi-TB array based on
ATA HDs purely as a target of backups of other systems and that's primarily
a sequential I/O situation. Then comes the question of
redundancy(relaibility) required for your disk array.

Since you describe it as multi-TB, then likely some version of RAID 5 seems
suitable and probably several such arrays. Another factor in the storage
system design is backup and the real time it takes to complete a backup or
restore. Backup may be the primary determinate of your storage system

Is this a standalone single user analysis situation(one pass at a time) or
might there be several unrelated analyses happening concurrently?

High Disk Read Performance

Post by Joshua Bak » Sat, 29 May 2004 23:40:31

n article < XXXX@XXXXX.COM >, DB wrote:
A lower cost option is lots of IDE disks attached to multiple 3ware
controllers. I've got two boxes set up using dual Xeons on Supermicro
MBs, 2 3ware 7500-8 boards (on separate PCI busses), and 16 180GB WD 7200
RPM drives. The 3wares are setup for hardware RAID5 w/ a hot spare,
and then I do a RAID0 stripe across the two arrays in software (Linux is
the OS). Here are some benchmarks from bonnie++:

[jlb@buckbeak tmp]$ bonnie++ -s 8192
Version 1.02c ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
buckbeak 8G 20485 76 55244 17 27481 10 27383 97 365660 81 446.0 1
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 2168 19 +++++ +++ 2710 23 1196 11 +++++ +++ 3178 32

And tiobench:

Unit information
File size = megabytes
Blk Size = bytes
Rate = megabytes per second
CPU% = percentage of CPU used during the test
Latency = milliseconds
Lat% = percent of requests that took longer than X seconds
CPU Eff = Rate divided by CPU% - throughput per cpu load

Sequential Reads
File Blk Num Avg Maximum
Lat% Lat% CPU
Identifier Size Size Thr Rate (CPU%) Latency Latency
>2s >10s Eff
---------------------------- ------ ----- --- ------ ------ --------- ---------
-- -------- -------- -----
2.4.20-19.7.XFS1.3.0smp 4096 4096 1 276.27 81.68% 0.013 99.
82 0.00000 0.00000 338
2.4.20-19.7.XFS1.3.0smp 4096 4096 2 275.17 102.6% 0.027 106.
56 0.00000 0.00000 268
2.4.20-19.7.XFS1.3.0smp 4096 4096 4 225.11 108.5% 0.067 256.
05 0.00000 0.00000 207
2.4.20-19.7.XFS1.3.0smp 4096 4096 8 221.65 111.8% 0.132 217.
67 0.00000 0.00000 198

Random Reads
File Blk Num Avg Maximum
Lat% Lat% CPU
Identifier Size Size Thr Rate (CPU%) Latency Latency
>2s >10s Eff
---------------------------- ------ ----- --- ------ ------ --------- ---------
-- -------- -------- -----
2.4.20-19.7.XFS1.3.0smp 4096 4096 1 1.13 2.676% 3.454 54.
82 0.00000 0.00000 42
2.4.20-19.7.XFS1.3.0smp 4096 4096 2 1.93 16.33% 4.002 53.
27 0.00000 0.00000 12
2.4.20-19.7.XFS1.3.0smp 4096 4096 4 2.94 20.87% 4.971 78.
67 0.00000 0.00000 14
2.4.20-19.7.XFS1.3.0smp 4096 4096 8 4.61 26.52% 5.989 103.
03 0.00000 0.00000 17

Sequential Writes
File Blk Num Avg Maximum
Lat% Lat% CPU
Identifier Size Size Thr Rate (CPU%) Latency Latency
>2s >10s Eff
---------------------------- ------ ----- --- ---

High Disk Read Performance

Post by Joshua Bak » Sat, 29 May 2004 23:41:16

Following up to myself (sigh): XFS is the filesystem.

Joshua Baker-LePain
Department of Biomedical Engineering
Duke University

High Disk Read Performance

Post by dbda » Sat, 05 Jun 2004 03:01:56

Thank you all for your comments - they have been very useful.

I am going to be doing the following

1. Purchase a Quad CPU box with three PCI-X buses. I will install an
Ultra 320 SCSI PCI-X controller with 256MB Cache in each bus
2. Each controller will connect to a 14 disk SCSI drive shelve with 14
x 73GB 15K Ultra 320 drives configured as RAID5
3. Using some software I will create a single large partition that
spans all three RAIDs. I will initially try this using an Windows
Extended volume set for three dynamic disks. I will also evaluate
Veritas volume manager to see if there is a big difference. (..and
the database app may be able to span the partitions itself)
4. I will set the cache to 100% reads on each controller.

Is there anything else I can do? Any suggestions on the stripe sizes
in the RAIDS at both hardware and OS level?

Again - thanks for all your feedback.