Chapter 2

Linux Software RAID

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

In This Chapter

Chapter 2

Linux Software RAID

RAID Types

Before You Start

Configuring Software RAID

 

© Peter Harrison

 

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

The main goals of using Redundant Arrays of Inexpensive Disks (RAID) is to either improve disk data performance and/or provide data redundancy.

 

RAID can be handled either by the operating system software or it may be implemented via a purpose built RAID disk controller card without having to configure the operating system at all. This chapter will attempt to explain how to configure the software RAID schemes supported by RedHat Linux 9 and newer.

 

For the sake of simplicity, this chapter focuses on using RAID for partitions that include neither the /boot or the root (/) filesystems.

RAID Types

Whether hardware or software based, RAID can be configured using a variety of standards. The most popular ones are listed below.

Linear Mode RAID

o        In this scheme, the RAID controller views the RAID set as a “chain” of disks. Data is only written to the next device in the chain after the previous one has been filled.

o        The aim of Linear RAID is to accommodate large filesystems spread over multiple devices with no data redundancy. A drive failure will corrupt your data.

o        Linear mode RAID is supported by RedHat Linux.

RAID 0

o        The RAID controller tries to evenly distribute data across all disks in the RAID set. A good analogy would be to envisage a file as if it were a book. In RAID 0, the system will try to sequentially place each “page” of the file on a different disk. In reality, the pages are called “chunks” whose size is determined when you initially configure RAID. So for example, in a RAID set of 3 disks, the first “chunk” of data in a file will be placed on disk 1, the second chunk will be on disk 2, the third chunk will on disk 3, the fourth chunk will then be placed on disk 1, etc. It is for this reason that RAID 0 is often called “striping”.

o        Like Linear RAID, RAID 0 aims to accommodate large file systems spread over multiple devices with no data redundancy. The advantage of RAID 0 is data access speed. A file that is spread over three disks can be read three times as fast.

o        RAID 0 can accommodate disks of unequal sizes. When RAID runs out of “striping space” on the smallest device, it then continues the striping using the available space on the remaining drives. When this occurs, the data access speed is lower for this portion of data as the total number of RAID drives available is reduced. It is for this reason that RAID 0 is best used with equal sized drives.

o        RAID 0 is supported by RedHat Linux.

RAID 1

o        With RAID 1, data is cloned on a duplicate disk. This RAID method is therefore frequently called “disk mirroring”.

o        When one of the disks in the RAID set fails, the other one continues to function. When the failed disk is replaced, the data is automatically cloned to the new disk from the surviving disk. RAID 1 also offers the possibility of using a “hot standby” spare disk which will be automatically cloned in the event of a disk failure on any of the primary RAID devices.

o        RAID 1 offers data redundancy, without the speed advantages of RAID 0. A disadvantage of software based RAID 1 is that the server has to send data twice to be written to each of the mirror disks. This can saturate data busses and CPU utilization. With a hardware based solution, the server CPU sends the data to the RAID disk controller once, and the disk controller then duplicates the data to the mirror disks. This makes fact often makes RAID capable disk controllers the preferred solution when implementing RAID 1.

o        A limitation of RAID 1 is that the total RAID size in Gigabytes is equal to that of the smallest disk in the RAID set. Unlike RAID 0, the extra space on the larger device isn’t used.

o        RAID 1 is supported by RedHat Linux.

RAID 4

o        Let’s return to the book analogy from our description of RAID 0. RAID 4 operates similarly, but inserts a special error correcting or parity “page” on an additional disk dedicated to this purpose.

o        RAID 4 requires at least three disks in the RAID set and can only survive the loss of a single drive. When this occurs, the data in can be recreated “on the fly” with the aid of the information on the RAID set’s parity disk. When the failed disk is replaced, is repopulated with the “lost data” with the help of the parity disk’s information.

o        RAID 4 combines the goal of high speed provided by RAID 0 with the redundancy goal of RAID 0. Its major disadvantage is that the data is striped, but the parity information is not. In other words, any data written to the any section of the data portion of the RAID set must be followed by an update of the parity disk. The parity disk can therefore act as a bottleneck. It is for this reason that RAID 4 isn’t used very frequently.

o        RAID 4 is not supported by RedHat Linux.

RAID 5

o        RAID 5 improves on RAID 4 by striping the parity data between all the disks in the RAID set, This avoids the parity disk bottleneck while maintaining many of the speed features of RAID 0 and the redundancy of RAID 1. Like RAID 4, RAID 5 can only survive the loss of a single disk.

o        RAID 5 is supported by RedHat Linux.

Before You Start

There are some basic guidelines you may want to follow when setting up RAID.

IDE Drives

Most home & SOHO systems will probably use IDE disks They do have some limitations.

o        The total length of an IDE cable can only be a few feet long which generally limits their use to small home systems.

o        IDE drives do not “hot swapping”. You cannot replace them while your system is running.

o        Only two devices can be attached per controller.

o        The performance of the IDE bus can be degraded by the presence of a second device on the cable.

o        The failure of one drive on an IDE bus often causes the malfunctioning of the second device. This can be fatal if you have two IDE drives of the same RAID set attached to the same cable.

It is for these reasons that it is recommended to use only one IDE drive per controller when using RAID, especially in a corporate environment. In a home or SOHO setting, IDE based software RAID may be adequate.

SCSI Drives

SCSI hard disks have a number of features that make them more attractive for RAID use.

o        SCSI controllers are more tolerant of disk failures. The failure of a single drive is less likely to disrupt the remaining drives on the bus.

o        SCSI cables can be several meters long, making them suitable for data center applications.

o        Much more than two devices may be connected to a SCSI cable bus.

o        Some models of SCSI devices support “hot swapping” which allows you to replace them while the system is running.

However SCSI drives tend to be more expensive than IDE drives which may make them less attractive for home use.

 

Should I Software RAID Partitions Or Entire Disks?

It is generally a not a good idea to share RAID configured partitions with non RAID partitions. The reason for this is obvious as a disk failure could still incapacitate a system.

If you decide to use RAID, all the partitions on each RAID disk should be part of a RAID set. Many people simplify this problem by filling each disk of a RAID set with only one partition.

Configuring Software RAID

RAID Partitioning

You will first need to identify two or more partitions, each on a separate disk. If you are doing RAID 0 or RAID 5, the partitions should be of approximately the same size, as in this scenario, RAID will limit the extent of data access on each partition to an area no larger than that of the smallest partition in the RAID set.

In this example we’ll be configuring RAID 5 using a system with three pre-partitioned hard disks. The partitions to be used will be:

o        /dev/hde1

o        /dev/hdf2

o        /dev/hdg1

 

Start FDISK

·         You have to change each RAID partition used to be of type FD (Linux raid autodetect). This can be done using fdisk. Here is an example using /dev/hde1

 

[root@bigboy updates]# fdisk /dev/hde

 

The number of cylinders for this disk is set to 8355.

There is nothing wrong with that, but this is larger than 1024,

and could in certain setups cause problems with:

1) software that runs at boot time (e.g., old versions of LILO)

2) booting and partitioning software from other OSs

   (e.g., DOS FDISK, OS/2 FDISK)

 

Command (m for help):

Use FDISK Help

·         We now use the fdisk “m” command to get some help

 

Command (m for help): m

 

Command action

   a   toggle a bootable flag

   b   edit bsd disklabel

   c   toggle the dos compatibility flag

   d   delete a partition

   l   list known partition types

   m   print this menu

   n   add a new partition

   o   create a new empty DOS partition table

   p   print the partition table

   q   quit without saving changes

   s   create a new empty Sun disklabel

   t   change a partition's system id

   u   change display/entry units

   v   verify the partition table

   w   write table to disk and exit

   x   extra functionality (experts only)

 

Command (m for help):

Set The ID Type To FD

·         Partition /dev/hde1 is the 1st partition on disk /dev/hde. We now modify its “type” using the “t” command and then specifying the partition number and type code. We also use the “L” command to get a full listing of ID types in case we forget.

 

Command (m for help): t

Partition number (1-5): 1

Hex code (type L to list codes): L

 

 0  Empty           1c  Hidden Win95 FA 70  DiskSecure Mult bb  Boot Wizard hid

 1  FAT12           1e  Hidden Win95 FA 75  PC/IX           be  Solaris boot

 2  XENIX root      24  NEC DOS         80  Old Minix       c1  DRDOS/sec (FAT-

 3  XENIX usr       39  Plan 9          81  Minix / old Lin c4  DRDOS/sec (FAT-

 4  FAT16 <32M      3c  PartitionMagic  82  Linux swap      c6  DRDOS/sec (FAT-

 5  Extended        40  Venix 80286     83  Linux           c7  Syrinx

 6  FAT16           41  PPC PReP Boot   84  OS/2 hidden C:  da  Non-FS data

 7  HPFS/NTFS       42  SFS             85  Linux extended  db  CP/M / CTOS / .

 8  AIX             4d  QNX4.x          86  NTFS volume set de  Dell Utility

 9  AIX bootable    4e  QNX4.x 2nd part 87  NTFS volume set df  BootIt

 a  OS/2 Boot Manag 4f  QNX4.x 3rd part 8e  Linux LVM       e1  DOS access

 b  Win95 FAT32     50  OnTrack DM      93  Amoeba          e3  DOS R/O

 c  Win95 FAT32 (LB 51  OnTrack DM6 Aux 94  Amoeba BBT      e4  SpeedStor

 e  Win95 FAT16 (LB 52  CP/M            9f  BSD/OS          eb  BeOS fs

 f  Win95 Ext'd (LB 53  OnTrack DM6 Aux a0  IBM Thinkpad hi ee  EFI GPT

10  OPUS            54  OnTrackDM6      a5  FreeBSD         ef  EFI (FAT-12/16/

11  Hidden FAT12    55  EZ-Drive        a6  OpenBSD         f0  Linux/PA-RISC b

12  Compaq diagnost 56  Golden Bow      a7  NeXTSTEP        f1  SpeedStor

14  Hidden FAT16 <3 5c  Priam Edisk     a8  Darwin UFS      f4  SpeedStor

16  Hidden FAT16    61  SpeedStor       a9  NetBSD          f2  DOS secondary

17  Hidden HPFS/NTF 63  GNU HURD or Sys ab  Darwin boot     fd  Linux raid auto

18  AST SmartSleep  64  Novell Netware  b7  BSDI fs         fe  LANstep

1b  Hidden Win95 FA 65  Novell Netware  b8  BSDI swap       ff  BBT

Hex code (type L to list codes): fd

Changed system type of partition 1 to fd (Linux raid autodetect)

 

Command (m for help):

Make Sure The Change Occurred

·         Use the “p” command to get the new proposed partition table

 

Command (m for help): p

 

Disk /dev/hde: 4311 MB, 4311982080 bytes

16 heads, 63 sectors/track, 8355 cylinders

Units = cylinders of 1008 * 512 = 516096 bytes

 

   Device Boot    Start       End    Blocks   Id  System

/dev/hde1             1      4088   2060320+  fd  Linux raid autodetect

/dev/hde2          4089      5713    819000   83  Linux

/dev/hde3          5714      6607    450576   83  Linux

/dev/hde4          6608      8355    880992    5  Extended

/dev/hde5          6608      7500    450040+  83  Linux

/dev/hde6          7501      8355    430888+  83  Linux

 

Command (m for help):

Save The Changes

·         Use the “w” command to permanently save the changes to disk /dev/hde.

 

Command (m for help): w

The partition table has been altered!

 

Calling ioctl() to re-read partition table.

 

WARNING: Re-reading the partition table failed with error 16: Device or resource busy.

The kernel still uses the old table.

The new table will be used at the next reboot.

Syncing disks.

[root@bigboy updates]#

 

·         The error above will occur if any of the other partitions on the disk is mounted.

Repeat For The Other Partitions

·         For the sake of brevity, I won’t show the process to do this, but the steps for changing the IDs for /dev/hdf2 and /dev/hdg1 are very similar.

Edit The RAID Configuration File

The linux RAID configuration file is /etc/raidtab. A good template file for /etc/raidtab can be found in the man pages, simply issue the command “man raidtab”.

General Guidelines

·         When configuring RAID 5 a “parity-algorithm” setting must be used.

·         The “raid-disk” parameters for each partition in the /etc/raidtab file are numbered starting at “0”.

·         The /etc/raidtab “persistent-superblock” must be set to “1” in order for the RAID autodetect feature (partition type FD) to work.


 

In our example:

·         We configure RAID 5 on using each of the desired partitions on the 3 disks.

·         All other RAID levels use a “persistent-superblock” setting of “0”.

·         The set of 3 RAID disks will be called /dev/md0.

 

       #

       # sample raiddev configuration file

       # ’old’ RAID0 array created with mdtools.

       #

       raiddev /dev/md0

           raid-level              5

           nr-raid-disks           3

           persistent-superblock   1

           chunk-size              32

           parity-algorithm        left-symmetric                                                                                                    

           device                  /dev/hde1

           raid-disk               0

           device                  /dev/hdf2

           raid-disk               1

           device                  /dev/hdg1

           raid-disk               2

 

Create the RAID Set

The mkraid command creates the RAID set by reading the /etc/raidtab file. In this case we want to create the logical RAID device /dev/md0

 

[root@bigboy root]# mkraid /dev/md0

handling MD device /dev/md0

analyzing super-block

[root@bigboy root]#

 

Format The New RAID Set

Your new RAID device will now have to be formatted. In the example below:

o        We use the ”-j” qualifier to ensure that a journaling file systems is created.

o        A block size of 4KB (4096 bytes) is used with each chunk being comprised of 8 blocks. It is very important that the “chunk-size” parameter in the /etc/raidtab file match the value of the block size multiplied by the stride value in the command below.  Note: If the values don’t match, then you will get parity errors.

 

[root@bigboy root]# mke2fs -j -b 4096 -R stride=8 /dev/md0

mke2fs 1.32 (09-Nov-2002)

Filesystem label=

OS type: Linux

Block size=4096 (log=2)

Fragment size=4096 (log=2)

516096 inodes, 1030160 blocks

51508 blocks (5.00%) reserved for the super user

First data block=0

32 block groups

32768 blocks per group, 32768 fragments per group

16128 inodes per group

Superblock backups stored on blocks:

        32768, 98304, 163840, 229376, 294912, 819200, 884736

 

Writing inode tables: done

Creating journal (8192 blocks): done

Writing superblocks and filesystem accounting information: done

 

This filesystem will be automatically checked every 26 mounts or

180 days, whichever comes first.  Use tune2fs -c or -i to override.

[root@bigboy root]#

Load The RAID Driver For The New RAID Set

The next step is make the Linux operating system fully aware of the RAID set by loading the driver for the new RAID set using the raidstart command.

 

root@bigboy root]# raidstart /dev/md0

[root@bigboy root]#

 

Create A Mount Point For The RAID Set

The next step is to create a mount point for /dev/md0. In this case we’ll create one called /mnt/raid

 

 [root@bigboy mnt]# mkdir /mnt/raid

 

Edit The /etc/fstab File

We’ll now add an entry for the /dev/md0 device. Here is an example of a line that could be used:

 

/dev/md0      /mnt/raid     ext3    defaults    1 2

 

Note: It is very important that you DO NOT use labels in the /etc/fstab file for RAID devices, just use the real device name such as “/dev/md0”. On startup, the /etc/rc.d/rc.sysinit script checks the /etc/fstab file for device entries that match RAID set names in the /etc/raidtab file. It will not automatically start the RAID set driver for the RAID set if it doesn’t find a match. Device mounting then occurs later on in the boot process. Mounting a RAID device that doesn’t have a loaded driver can corrupt your data giving the error below.

 

Starting up RAID devices: md0(skipped)

Checking filesystems

/raiddata: Superblock has a bad ext3 journal(inode8)

CLEARED.

***journal has been deleted - file system is now ext 2 only***

 

/raiddata: The filesystem size (according to the superblock) is 2688072 blocks.

The physical size of the device is 8960245 blocks.

Either the superblock or the partition table is likely to be corrupt!

/boot: clean, 41/26104 files, 12755/104391 blocks

 

/raiddata: UNEXPECTED INCONSISTENCY; Run fsck manually (ie without -a or -p options).

 

Start The New RAID Set’s Driver

This is done using the raidstart command.

 

[root@bigboy root]# raidstart /dev/md0

 

Mount The New RAID Set

The mount command can now be used to mount the RAID set.

Using the automount feature

The mount command’s “-a” flag will cause Linux to mount all the devices in the /etc/fstab file that have automounting enabled (default) and that are also not already mounted.

 

[root@bigboy root]# mount –a

 

Manually

You can also mount the device manually.

 

[root@bigboy root]# mount /dev/md0 /mnt/raid

 

Check The Status Of The New RAID

The /proc/mdstat file provides the current status of all the devices. When the raid driver is stopped, the file has very little information as seen below

 

[root@bigboy root]# raidstop /dev/md0

[root@bigboy root]# cat /proc/mdstat

Personalities : [raid5]

read_ahead 1024 sectors

unused devices: <none>

[root@bigboy root]#

 

More information, including the partitions of the RAID set, is provided once the driver is loaded using the raidstart command.

 

[root@bigboy root]# raidstart /dev/md0

[root@bigboy root]# cat /proc/mdstat

Personalities : [raid5]

read_ahead 1024 sectors

md0 : active raid5 hdg1[2] hde1[1] hdf2[0]

      4120448 blocks level 5, 32k chunk, algorithm 3 [3/3] [UUU]

      

unused devices: <none>

[root@bigboy root]#