RAID-1 root/boot/lilo problem

RAID-1 root/boot/lilo problem

Post by M » Fri, 16 Apr 2004 20:11:26


apologize in advance.Iaveoogledntilyrainurtsndaveot
found my problem.Iaveheckedutheiloocs,heEADME.raid1ile,
the linux+root+boot+raid howto, etc.A lot of the RAID1 stuff has been beat
to death, but I can't quite find my exact problem.
I'mairlyuch00bn the RAID department and this is a long post, so
please be patient.

I am using mdadm, running kernel 2.6.5 on Debian 3.0 stable (woody), on a
P3-866 box.

I have two EIDE drives (WD800JB, 80GB) of identical size and partitioned
identically size-wise./dev/hdand?dev/hdc.The?dev/hdaartitions
were set to the Linux and Linux Swap partition types appropriately via
cfdisk.The?dev/hdcartitionserelletoD,inuxAIDutodetect.

I had done the 'missing disk' installation procedure, installing Debian
on /dev/hda, mirroring the /dev/hdaN partitions to their /dev/hdcN
counterparts via tar (with the exception of swap), then used mdadm to
create 8 RAID-1 arrays with the /dev/hdaN partitions as missing and using
the appropriate /dev/hdcN partitions.Soaroood.Theystems
running with 6 out of the 8 RAID arrays synced.

The problem I am having is getting the boot and root partitions working.

I will show my DESIRED configuration, then explain the problem further.

/dev/hd[ac]1=?dev/md0??boot? /dev/hd[ac]5=?dev/md1??usr
/dev/hd[ac]6=?dev/md2??var
/dev/hd[ac]7=?dev/md3??tmp
/dev/hd[ac]8=?dev/md4??home
/dev/hd[ac]9=?dev/md5?wap
/dev/hd[ac]10 = /dev/md6 = /
/dev/hd[ac]11 = /dev/md7 = /storage

All arrays except /dev/md0 (boot) and /dev/md6 (root) are synced.Allf
the arrays exist, but the boot partition is only running with /dev/hdc1 and
the root partition is only running with /dev/hdc10.

For starters, knowing that the array handling /boot might be tricky, I left
the boot partition of /dev/hda1 in /etc/lilo.conf and figured I would deal
with the /boot array later.

When I was providing mdadm with the names of the missing partitions in order
for it to start syncing, I used basically the same command, obviously
changing the RAID device and disk partition for each command.Whename
up to the root partition, I used:

mdadm?dev/md6?a?dev/hda10

It returned the following error message:

mdadm:otddailedor?dev/hda10:nvalidrgument

Checking dmesg, I see:

md:ryingoot-addda10od6?..
md:ouldotockda10.
md:rror,d_import_device()eturned?16

I thought that since maybe the system was actually booting off /dev/hda1
that it was somehow locking the disk, so 'md' couldn't lock it in order to
add it to the array.Wouldhiseorrectssumption?

To test, I started trying to get the boot RAID going.I'msingilo
22.2-3, but have also temporarily tried compiling 22.5.9 to no avail.My
intention was to use lilo to write boot info to /dev/md0, which only
consists of /dev/hdc1 for now, boot the system, then sync it by
adding /dev/hda1 to the array.Heresy?etc/lilo.conf:

lba32
boot=/dev/md0
raid-extra-boot=auto
root=/dev/md6
install=/boot/boot-menu.b
map=/boot/map
delay=20
vga=normal
append=""
default=Linux-2.6.5
image=/vmlinuz-2.6.5
label=Linux-2.6.5
read-only

I had unmounted /dev/hda1 from /boot, mounted /dev/md0 to /boot, and tried:

lilo?t

and got:

Warning:nt?x13unction?ndunction?x48eturnifferent
head/sectoreometriesorIOSrive?x80
Warning:nt?x13unction?ndunction?x48eturnifferent
head/sectoreometriesorIOSrive?x81
Warning:singIOSeviceode?x81orAIDootlocks
Warning:?dev/md0sotnheirstisk
Fatal:ryingoapilesromnnamedevice?x0000?NFS?)

Now I am stuck.Theatalessageshro
 
 
 

RAID-1 root/boot/lilo problem

Post by JohnInSD A » Fri, 16 Apr 2004 23:52:49

n Thu, 15 Apr 2004 05:11:26 -0600, M < XXXX@XXXXX.COM > wrote:


Error here. With only /dev/hdc1 active in the array, the installation of the
boot record will only be on hdc. You will not be able to boot from hda.

THat is why you get next to the last message: about 0x81, which is not the
first disk.

Unless both hda1 and hdc1 are active in the md0 array, it is pretty useless to
try to install LILO. You will have no redundancy, because the installation
will be on only one disk.

Once you have md0 fully redundant, then the output from "lilo -v3" would be of
interest.

--John





 
 
 

RAID-1 root/boot/lilo problem

Post by M » Sat, 17 Apr 2004 21:31:45

K, with your hint I'm getting somewhere.

My original intention was to make lilo write to a degraded /dev/md0 to get
the needed boot sector written there, then reboot the system and complete
the addition of /dev/hda1 into the array. Obviously lilo was not enthused
about doing this, and probably rightfully so- the system may not have been
able to boot off /dev/hdc1 anyway. I just wish its fatal message would
have been a little more clueful.

So I modified /etc/fstab and made /dev/md0 mount on /boot and rebooted with
a mkbootdisk floppy. The system came up showing /dev/md0 as mounted
on /boot, so I successfully completed the md0 array by adding /dev/hda1. I
thought I had it in the bag, so I made sure my /etc/lilo.conf had
boot=/dev/md0 and did a lilo, figuring it would update the boot sectors of
the drives. Lilo succeeded, but on reboot I just got a big string of crap
on the console. Rebooted with the mkbootdisk floppy and /dev/md0 was again
degraded to only running /dev/hdc1 (not sure why).

As a second try, reading the README.raid1 from lilo-doc informed me that
using the raid-extra-boot="/dev/hdc1,/dev/hda1" would write usable boot
sectors to the MBR's of those devices. Tried that, same string of crap on
reboot.

I -am- using the plain old "lilo" command (without -t) for my tests.

Here is the output of lilo -v3 with the same lilo.conf as my previous post,
except with raid-extra-boot="/dev/hdc1,/dev/hda1". Before this command was
given, the md0 array was synced with both /dev/hda1 and /dev/hdc1. Would
the output of lilo -q -v3 help?

Thank you for the pointer. I think I'm getting close.

Mike



LILO version 22.2, Copyright (C) 1992-1998 Werner Almesberger
Development beyond version 21 Copyright (C) 1999-2001 John Coffman
Released 05-Feb-2002 and compiled at 20:57:26 on Apr 13 2002.
MAX_IMAGES = 27

RAID info: nr=2, raid=2, active=2, working=2, failed=0, spare=0
md: RAIDset device 0 = 0x0301
bios_dev: device 0301
Warning: Int 0x13 function 8 and function 0x48 return different
head/sector geometries for BIOS drive 0x80
Warning: Int 0x13 function 8 and function 0x48 return different
head/sector geometries for BIOS drive 0x81
bios_dev: masked device 0300, which is /dev/hda
bios_dev: geometry check found 0 matches
bios_dev: PT match found 1 match (0x80)
Device 0x0301: BIOS drive 0x80, 16 heads, 65535 cylinders,
63 sectors. Partition offset: 63 sectors.
RAID scan: geo_get: returns geo->device = 0x80 for device 0301
disk->start = 63 raid_offset = 0 (00000000)
Name: /dev/hda1 yields MBR: /dev/hda (without primary partition check)
md: RAIDset device 1 = 0x1601
bios_dev: device 1601
bios_dev: masked device 1600, which is /dev/hdc
bios_dev: geometry check found 0 matches
bios_dev: PT match found 1 match (0x81)
Device 0x1601: BIOS drive 0x81, 16 heads, 65535 cylinders,
63 sectors. Partition offset: 63 sectors.
RAID scan: geo_get: returns geo->device = 0x81 for device 1601
disk->start = 63 raid_offset = 0 (00000000)
Name: /dev/hdc1 yields MBR: /dev/hdc (without primary partition check)
RAID list: /dev/hdc is device 0x1600
RAID list: /dev/hda is device 0x0300
Warning: using BIOS device code 0x80 for RAID boot blocks
raid_setup returns offset = 00000000
raid flags: at bsect_open 0x02
Reading boot sector from /dev/md0
Merging with /boot/boot-menu.b
Device 0x0301: BIOS drive 0x80, 16 heads
 
 
 

RAID-1 root/boot/lilo problem

Post by JohnInSD A » Sat, 17 Apr 2004 23:48:50

efore going any further, grab the latest LILO: 22.5.9
22.2 is over 2 years out of date.

http://freshmeat.net/projects/lilo

--John


On Fri, 16 Apr 2004 06:31:45 -0600, M < XXXX@XXXXX.COM > wrote:


 
 
 

RAID-1 root/boot/lilo problem

Post by M » Tue, 20 Apr 2004 01:30:07

OK, I'm going to back up a step, and also apologize for being stupid (can't
be helped with some people!).

I got the latest lilo and have been working with that now.

I actually -DID- have luck booting with the
raid-extra-boot="/dev/hdc1,/dev/hda1" statement in my lilo.conf; for some
reason I got confused when I posted and didn't think I did. However, upon
return to the box, I did try that and had a successful boot off the MBR,
however the system hung at a certain point in its startup sequence because
it was attempting to access /dev/hda to mount stuff on, but that was in use
with the RAID. It was strange. I had updated my lilo.conf to use /dev/md0
for the boot device, but when I got fed up and modified lilo.conf to boot
off /dev/hdc1, I noticed that the /etc/fstab was attempting to use /dev/hda
instead of the RAID devices! Man, I was all screwed up.

Anyway. I got the system booting fine now, and the /boot partition is
RAIDed and synced. The other issue I had was with /dev/hda10, putting
into /dev/md6 for the root partition. Every time I tried, I would get an
error from md saying the device couldn't be locked.

I finally got my head out of my rear and started *thinking*. Obviously
something was using /dev/hda or things should fly. So... what was
using /dev/hda? Aha! Swap! I modified /etc/fstab to only use the swap
partition off /dev/hdc, and rebooted (perhaps I could have done a swapoff
-s) instead. Still no dice. OK, something ELSE was obviously
using /dev/hda. *DING!* All the other partitions in the RAID! Damn, I
amaze myself at my incredible ability of oversight sometimes. I marked all
other partitions from /dev/hda as failed, then removed them from their RAID
assemblies, and *poomp* - I could add /dev/hda10 to the /dev/md6 array.

I learned quite a few things with all this.

1. If I scour the net hard and find nothing closely tied to my situation,
I'm doing something really stupid.

2. Always add the root partition to the RAID first!

3. THINK more before posting to the net and wasting bandwidth and people's
time.

4. Linux rules. But I already knew that.

Thank you John for being willing to help. I hope to return the favor some
day.

Mike