= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
In This Chapter
© Peter Harrison
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
The main goals of using Redundant Arrays of Inexpensive Disks (RAID) is to either improve disk data performance and/or provide data redundancy.
RAID can be handled either by the operating system software or it may be implemented via a purpose built RAID disk controller card without having to configure the operating system at all. This chapter will attempt to explain how to configure the software RAID schemes supported by RedHat Linux 9 and newer.
For the sake of simplicity, this chapter focuses on using RAID for partitions that include neither the /boot or the root (/) filesystems.
Whether hardware or software based, RAID can be configured using a variety of standards. The most popular ones are listed below.
o In this scheme, the RAID controller views the RAID set as a “chain” of disks. Data is only written to the next device in the chain after the previous one has been filled.
o The aim of Linear RAID is to accommodate large filesystems spread over multiple devices with no data redundancy. A drive failure will corrupt your data.
o Linear mode RAID is supported by RedHat Linux.
o The RAID controller tries to evenly distribute data across all disks in the RAID set. A good analogy would be to envisage a file as if it were a book. In RAID 0, the system will try to sequentially place each “page” of the file on a different disk. In reality, the pages are called “chunks” whose size is determined when you initially configure RAID. So for example, in a RAID set of 3 disks, the first “chunk” of data in a file will be placed on disk 1, the second chunk will be on disk 2, the third chunk will on disk 3, the fourth chunk will then be placed on disk 1, etc. It is for this reason that RAID 0 is often called “striping”.
o Like Linear RAID, RAID 0 aims to accommodate large file systems spread over multiple devices with no data redundancy. The advantage of RAID 0 is data access speed. A file that is spread over three disks can be read three times as fast.
o RAID 0 can accommodate disks of unequal sizes. When RAID runs out of “striping space” on the smallest device, it then continues the striping using the available space on the remaining drives. When this occurs, the data access speed is lower for this portion of data as the total number of RAID drives available is reduced. It is for this reason that RAID 0 is best used with equal sized drives.
o RAID 0 is supported by RedHat Linux.
o With RAID 1, data is cloned on a duplicate disk. This RAID method is therefore frequently called “disk mirroring”.
o When one of the disks in the RAID set fails, the other one continues to function. When the failed disk is replaced, the data is automatically cloned to the new disk from the surviving disk. RAID 1 also offers the possibility of using a “hot standby” spare disk which will be automatically cloned in the event of a disk failure on any of the primary RAID devices.
o RAID 1 offers data redundancy, without the speed advantages of RAID 0. A disadvantage of software based RAID 1 is that the server has to send data twice to be written to each of the mirror disks. This can saturate data busses and CPU utilization. With a hardware based solution, the server CPU sends the data to the RAID disk controller once, and the disk controller then duplicates the data to the mirror disks. This makes fact often makes RAID capable disk controllers the preferred solution when implementing RAID 1.
o A limitation of RAID 1 is that the total RAID size in Gigabytes is equal to that of the smallest disk in the RAID set. Unlike RAID 0, the extra space on the larger device isn’t used.
o RAID 1 is supported by RedHat Linux.
o Let’s return to the book analogy from our description of RAID 0. RAID 4 operates similarly, but inserts a special error correcting or parity “page” on an additional disk dedicated to this purpose.
o RAID 4 requires at least three disks in the RAID set and can only survive the loss of a single drive. When this occurs, the data in can be recreated “on the fly” with the aid of the information on the RAID set’s parity disk. When the failed disk is replaced, is repopulated with the “lost data” with the help of the parity disk’s information.
o RAID 4 combines the goal of high speed provided by RAID 0 with the redundancy goal of RAID 0. Its major disadvantage is that the data is striped, but the parity information is not. In other words, any data written to the any section of the data portion of the RAID set must be followed by an update of the parity disk. The parity disk can therefore act as a bottleneck. It is for this reason that RAID 4 isn’t used very frequently.
o RAID 4 is not supported by RedHat Linux.
o RAID 5 improves on RAID 4 by striping the parity data between all the disks in the RAID set, This avoids the parity disk bottleneck while maintaining many of the speed features of RAID 0 and the redundancy of RAID 1. Like RAID 4, RAID 5 can only survive the loss of a single disk.
o RAID 5 is supported by RedHat Linux.
There are some basic guidelines you may want to follow when setting up RAID.
Most home & SOHO systems will probably use IDE disks They do have some limitations.
o The total length of an IDE cable can only be a few feet long which generally limits their use to small home systems.
o IDE drives do not “hot swapping”. You cannot replace them while your system is running.
o Only two devices can be attached per controller.
o The performance of the IDE bus can be degraded by the presence of a second device on the cable.
o The failure of one drive on an IDE bus often causes the malfunctioning of the second device. This can be fatal if you have two IDE drives of the same RAID set attached to the same cable.
It is for these reasons that it is recommended to use only one IDE drive per controller when using RAID, especially in a corporate environment. In a home or SOHO setting, IDE based software RAID may be adequate.
SCSI hard disks have a number of features that make them more attractive for RAID use.
o SCSI controllers are more tolerant of disk failures. The failure of a single drive is less likely to disrupt the remaining drives on the bus.
o SCSI cables can be several meters long, making them suitable for data center applications.
o Much more than two devices may be connected to a SCSI cable bus.
o Some models of SCSI devices support “hot swapping” which allows you to replace them while the system is running.
However SCSI drives tend to be more expensive than IDE drives which may make them less attractive for home use.
It is generally a not a good idea to share RAID configured partitions with non RAID partitions. The reason for this is obvious as a disk failure could still incapacitate a system.
If you decide to use RAID, all the partitions on each RAID disk should be part of a RAID set. Many people simplify this problem by filling each disk of a RAID set with only one partition.
You will first need to identify two or more partitions, each on a separate disk. If you are doing RAID 0 or RAID 5, the partitions should be of approximately the same size, as in this scenario, RAID will limit the extent of data access on each partition to an area no larger than that of the smallest partition in the RAID set.
In this example we’ll be configuring RAID 5 using a system with three pre-partitioned hard disks. The partitions to be used will be:
o /dev/hde1
o /dev/hdf2
o /dev/hdg1
· You have to change each RAID partition used to be of type FD (Linux raid autodetect). This can be done using fdisk. Here is an example using /dev/hde1
[root@bigboy updates]# fdisk /dev/hde
The number of cylinders for this disk is set to 8355.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
(e.g., DOS FDISK, OS/2 FDISK)
Command (m for help):
· We now use the fdisk “m” command to get some help
Command (m for help): m
Command action
a toggle a bootable flag
b edit bsd disklabel
c toggle the dos compatibility flag
d delete a partition
l list known partition types
m print this menu
n add a new partition
o create a new empty DOS partition table
p print the partition table
q quit without saving changes
s create a new empty Sun disklabel
t change a partition's system id
u change display/entry units
v verify the partition table
w write table to disk and exit
x extra functionality (experts only)
Command (m for help):
· Partition /dev/hde1 is the 1st partition on disk /dev/hde. We now modify its “type” using the “t” command and then specifying the partition number and type code. We also use the “L” command to get a full listing of ID types in case we forget.
Command (m for help): t
Partition number (1-5): 1
Hex code (type L to list codes): L
0 Empty 1c Hidden Win95 FA 70 DiskSecure Mult bb Boot Wizard hid
1 FAT12 1e Hidden Win95 FA 75 PC/IX be Solaris boot
2 XENIX root 24 NEC DOS 80 Old Minix c1 DRDOS/sec (FAT-
3 XENIX usr 39 Plan 9 81 Minix / old Lin c4 DRDOS/sec (FAT-
4 FAT16 <32M 3c PartitionMagic 82 Linux swap c6 DRDOS/sec (FAT-
5 Extended 40 Venix 80286 83 Linux c7 Syrinx
6 FAT16 41 PPC PReP Boot 84 OS/2 hidden C: da Non-FS data
7 HPFS/NTFS 42 SFS 85 Linux extended db CP/M / CTOS / .
8 AIX 4d QNX4.x 86 NTFS volume set de Dell Utility
9 AIX bootable 4e QNX4.x 2nd part 87 NTFS volume set df BootIt
a OS/2 Boot Manag 4f QNX4.x 3rd part 8e Linux LVM e1 DOS access
b Win95 FAT32 50 OnTrack DM 93 Amoeba e3 DOS R/O
c Win95 FAT32 (LB 51 OnTrack DM6 Aux 94 Amoeba BBT e4 SpeedStor
e Win95 FAT16 (LB 52 CP/M 9f BSD/OS eb BeOS fs
f Win95 Ext'd (LB 53 OnTrack DM6 Aux a0 IBM Thinkpad hi ee EFI GPT
10 OPUS 54 OnTrackDM6 a5 FreeBSD ef EFI (FAT-12/16/
11 Hidden FAT12 55 EZ-Drive a6 OpenBSD f0 Linux/PA-RISC b
12 Compaq diagnost 56 Golden Bow a7 NeXTSTEP f1 SpeedStor
14 Hidden FAT16 <3 5c Priam Edisk a8 Darwin UFS f4 SpeedStor
16 Hidden FAT16 61 SpeedStor a9 NetBSD f2 DOS secondary
17 Hidden HPFS/NTF 63 GNU HURD or Sys ab Darwin boot fd Linux raid auto
18 AST SmartSleep 64 Novell Netware b7 BSDI fs fe LANstep
1b Hidden Win95 FA 65 Novell Netware b8 BSDI swap ff BBT
Hex code (type L to list codes): fd
Changed system type of partition 1 to fd (Linux raid autodetect)
Command (m for help):
· Use the “p” command to get the new proposed partition table
Command (m for help): p
Disk /dev/hde: 4311 MB, 4311982080 bytes
16 heads, 63 sectors/track, 8355 cylinders
Units = cylinders of 1008 * 512 = 516096 bytes
Device Boot Start End Blocks Id System
/dev/hde1 1 4088 2060320+ fd Linux raid autodetect
/dev/hde2 4089 5713 819000 83 Linux
/dev/hde3 5714 6607 450576 83 Linux
/dev/hde4 6608 8355 880992 5 Extended
/dev/hde5 6608 7500 450040+ 83 Linux
/dev/hde6 7501 8355 430888+ 83 Linux
Command (m for help):
· Use the “w” command to permanently save the changes to disk /dev/hde.
Command (m for help): w
The partition table has been altered!
Calling ioctl() to re-read partition table.
WARNING: Re-reading the partition table failed with error 16: Device or resource busy.
The kernel still uses the old table.
The new table will be used at the next reboot.
Syncing disks.
[root@bigboy updates]#
· The error above will occur if any of the other partitions on the disk is mounted.
· For the sake of brevity, I won’t show the process to do this, but the steps for changing the IDs for /dev/hdf2 and /dev/hdg1 are very similar.
The linux RAID configuration file is /etc/raidtab. A good template file for /etc/raidtab can be found in the man pages, simply issue the command “man raidtab”.
· When configuring RAID 5 a “parity-algorithm” setting must be used.
· The “raid-disk” parameters for each partition in the /etc/raidtab file are numbered starting at “0”.
· The /etc/raidtab “persistent-superblock” must be set to “1” in order for the RAID autodetect feature (partition type FD) to work.
· We configure RAID 5 on using each of the desired partitions on the 3 disks.
· All other RAID levels use a “persistent-superblock” setting of “0”.
· The set of 3 RAID disks will be called /dev/md0.
#
# sample raiddev configuration file
# ’old’ RAID0 array created with mdtools.
#
raiddev /dev/md0
raid-level 5
nr-raid-disks 3
persistent-superblock 1
chunk-size 32
parity-algorithm left-symmetric
device /dev/hde1
raid-disk 0
device /dev/hdf2
raid-disk 1
device /dev/hdg1
raid-disk 2
The mkraid command creates the RAID set by reading the /etc/raidtab file. In this case we want to create the logical RAID device /dev/md0
[root@bigboy root]# mkraid /dev/md0
handling MD device /dev/md0
analyzing super-block
[root@bigboy root]#
Your new RAID device will now have to be formatted. In the example below:
o We use the ”-j” qualifier to ensure that a journaling file systems is created.
o A block size of 4KB (4096 bytes) is used with each chunk being comprised of 8 blocks. It is very important that the “chunk-size” parameter in the /etc/raidtab file match the value of the block size multiplied by the stride value in the command below. Note: If the values don’t match, then you will get parity errors.
[root@bigboy root]# mke2fs -j -b 4096 -R stride=8 /dev/md0
mke2fs 1.32 (09-Nov-2002)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
516096 inodes, 1030160 blocks
51508 blocks (5.00%) reserved for the super user
First data block=0
32 block groups
32768 blocks per group, 32768 fragments per group
16128 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736
Writing inode tables: done
Creating journal (8192 blocks): done
Writing superblocks and filesystem accounting information: done
This filesystem will be automatically checked every 26 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
[root@bigboy root]#
The next step is make the Linux operating system fully aware of the RAID set by loading the driver for the new RAID set using the raidstart command.
root@bigboy root]# raidstart /dev/md0
[root@bigboy root]#
The next step is to create a mount point for /dev/md0. In this case we’ll create one called /mnt/raid
[root@bigboy mnt]# mkdir /mnt/raid
We’ll now add an entry for the /dev/md0 device. Here is an example of a line that could be used:
/dev/md0 /mnt/raid ext3 defaults 1 2
Note: It is very important that you DO NOT use labels in the /etc/fstab file for RAID devices, just use the real device name such as “/dev/md0”. On startup, the /etc/rc.d/rc.sysinit script checks the /etc/fstab file for device entries that match RAID set names in the /etc/raidtab file. It will not automatically start the RAID set driver for the RAID set if it doesn’t find a match. Device mounting then occurs later on in the boot process. Mounting a RAID device that doesn’t have a loaded driver can corrupt your data giving the error below.
Starting up RAID devices: md0(skipped)
Checking filesystems
/raiddata: Superblock has a bad ext3 journal(inode8)
CLEARED.
***journal has been deleted - file system is now ext 2 only***
/raiddata: The filesystem size (according to the superblock) is 2688072 blocks.
The physical size of the device is 8960245 blocks.
Either the superblock or the partition table is likely to be corrupt!
/boot: clean, 41/26104 files, 12755/104391 blocks
/raiddata: UNEXPECTED INCONSISTENCY; Run fsck manually (ie without -a or -p options).
This is done using the raidstart command.
[root@bigboy root]# raidstart /dev/md0
The mount command can now be used to mount the RAID set.
The mount command’s “-a” flag will cause Linux to mount all the devices in the /etc/fstab file that have automounting enabled (default) and that are also not already mounted.
[root@bigboy root]# mount –a
You can also mount the device manually.
[root@bigboy root]# mount /dev/md0 /mnt/raid
The /proc/mdstat file provides the current status of all the devices. When the raid driver is stopped, the file has very little information as seen below
[root@bigboy root]# raidstop /dev/md0
[root@bigboy root]# cat /proc/mdstat
Personalities : [raid5]
read_ahead 1024 sectors
unused devices: <none>
[root@bigboy root]#
More information, including the partitions of the RAID set, is provided once the driver is loaded using the raidstart command.
[root@bigboy root]# raidstart /dev/md0
[root@bigboy root]# cat /proc/mdstat
Personalities : [raid5]
read_ahead 1024 sectors
md0 : active raid5 hdg1[2] hde1[1] hdf2[0]
4120448 blocks level 5, 32k chunk, algorithm 3 [3/3] [UUU]
unused devices: <none>
[root@bigboy root]#