[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Tru64 v5.1: AdvFS file domain panic



Dear managers,

today I had a serious AdvFS domain panic causing a total loss of one domain. The explanation gets a bit longer, since I would give a more detailed information of what I've done.

But first the necessary OS details:

system:    AS 1200 5/533, 2 CPU
           (CPU no and memsize changed during error searching)
harddisks: dsk0: RZ1DF-CB,       9 Gbyte
           dsk1: DRHS36V,       36 GByte
           dsk2: sgtst336704lc, 36 GByte
           dsk3: sgtst336704lc  36 GByte
           dsk4: ST336705       36 GByte
           dsk5: OXYGENRAID, RAID5-Array, 8x160 GByte, 1 TByte netto
OS:        TruUNIX v5.1, at time of AdvFS panic patch level 5, now patch level 6
           advfs license installed

The message log from the last successful boot is as follows

Jun 6 12:44:45 muxs0et0 vmunix: Alpha boot: available memory from 0x1110000 to 0x2fffc000
Jun 6 12:44:45 muxs0et0 vmunix: Compaq Tru64 UNIX V5.1 (Rev. 732); Tue Jun 6 12:42:27 CEST 2006
Jun 6 12:44:45 muxs0et0 vmunix: physical memory = 512.00 megabytes.
Jun 6 12:44:45 muxs0et0 vmunix: available memory = 490.97 megabytes.
Jun 6 12:44:45 muxs0et0 vmunix: using 1930 buffers containing 15.07 megabytes of memory
Jun 6 12:44:45 muxs0et0 vmunix: Master cpu at slot 0
Jun 6 12:44:45 muxs0et0 vmunix: Starting secondary cpu 1
Jun 6 12:44:45 muxs0et0 vmunix: Firmware revision: 6.0
Jun 6 12:44:45 muxs0et0 vmunix: PALcode: UNIX version 1.23
Jun 6 12:44:45 muxs0et0 vmunix: AlphaServer 1200 5/533 4MB
Jun 6 12:44:45 muxs0et0 vmunix: pci1 (primary bus:1) at mcbus0 slot 5
Jun 6 12:44:45 muxs0et0 vmunix: Loading SIOP: script c0000000, reg 7feef00, data c000a000
Jun 6 12:44:45 muxs0et0 vmunix: scsi0 at psiop0 slot 0 rad 0
Jun 6 12:44:45 muxs0et0 vmunix: isp0 at pci1 slot 2
Jun 6 12:44:45 muxs0et0 vmunix: isp0: QLOGIC ISP1040B/V2


History
==========

A while back I had a system crash with the following error:

Apr 6 14:39:44 muxs0et0 vmunix:
Apr 6 14:39:45 muxs0et0 vmunix: idx_create_index_file: bmtr_put_rec failed
Apr 6 14:39:45 muxs0et0 vmunix: AdvFS Domain Panic; Domain raid_pdmn Id 0x3e3af2e6.00095d85
Apr 6 14:39:45 muxs0et0 vmunix: An AdvFS domain panic has occurred due to either a metadata write error or an internal inconsistency. T
his domain is being rendered inaccessible.
Apr 6 14:39:45 muxs0et0 vmunix: Please refer to guidelines in AdvFS Guide to File System Administration regarding what steps to take to
recover this domain.
Apr 6 14:59:24 muxs0et0 vmunix: NFS server: stale file handle fs(2869,368282) file 2 gen 32769
Apr 6 14:59:24 muxs0et0 vmunix: RFS3_FSSTAT, client address = 141.56.22.41, errno 5
Apr 6 15:00:33 muxs0et0 vmunix: AdvFS I/O error:
Apr 6 15:00:34 muxs0et0 vmunix: A read failure occurred - the AdvFS domain is inaccessible (paniced)
Apr 6 15:00:34 muxs0et0 vmunix: Domain#Fileset: raid_pdmn#projekte
Apr 6 15:00:34 muxs0et0 vmunix: Mounted on: /Projekte
Apr 6 15:00:34 muxs0et0 vmunix: Volume: /dev/disk/dsk5d
Apr 6 15:00:34 muxs0et0 vmunix: Tag: 0x00000001.8001
Apr 6 15:00:34 muxs0et0 vmunix: Page: 50371
Apr 6 15:00:34 muxs0et0 vmunix: Block: 119461568
Apr 6 15:00:34 muxs0et0 vmunix: Block count: 16
Apr 6 15:00:34 muxs0et0 vmunix: Type of operation: Read
Apr 6 15:00:34 muxs0et0 vmunix: Error: 5
Apr 6 15:00:34 muxs0et0 vmunix: EEI: 0x300
Apr 6 15:01:43 muxs0et0 vmunix: AdvFS I/O error:
Apr 6 15:01:43 muxs0et0 vmunix: A read failure occurred - the AdvFS domain is inaccessible (paniced)
Apr 6 15:01:43 muxs0et0 vmunix: Domain#Fileset: raid_pdmn#projekte
Apr 6 15:01:43 muxs0et0 vmunix: Mounted on: /Projekte
Apr 6 15:01:43 muxs0et0 vmunix: Volume: /dev/disk/dsk5d
Apr 6 15:01:43 muxs0et0 vmunix: Tag: 0x00000004.8001
Apr 6 15:01:43 muxs0et0 vmunix: Page: 0
Apr 6 15:01:43 muxs0et0 vmunix: Block: 182107584
Apr 6 15:01:43 muxs0et0 vmunix: Block count: 16
Apr 6 15:01:43 muxs0et0 vmunix: Type of operation: Read
Apr 6 15:01:43 muxs0et0 vmunix: Error: 5
Apr 6 15:01:43 muxs0et0 vmunix: EEI: 0x300
Apr 6 15:01:43 muxs0et0 vmunix: To obtain the name of the file on which
Apr 6 15:01:43 muxs0et0 vmunix: the error occurred, type the command:
Apr 6 15:01:43 muxs0et0 vmunix: /sbin/advfs/tag2name /Projekte/.tags/4
Apr 6 15:06:18 muxs0et0 vmunix: panic (cpu 0): kernel memory fault
Apr 6 15:06:18 muxs0et0 vmunix: syncing disks... 85 device string for dump = SCSI 1 2 0 0 0 0 0.
Apr 6 15:06:18 muxs0et0 vmunix: DUMP.prom: dev SCSI 1 2 0 0 0 0 0, block 524288
Apr 6 15:06:18 muxs0et0 vmunix: device string for dump = SCSI 1 2 0 0 0 0 0.
Apr 6 15:06:18 muxs0et0 vmunix: DUMP.prom: dev SCSI 1 2 0 0 0 0 0, block 524288



The domain resides on the RAID-Array. The RAID-Array is running for about 2 years without any problem. The RAID-Array was partitioned into 8 partitions with the following layout (comments removed):


# /dev/rdisk/dsk5c:
type: SCSI
disk: OXYGENRA
label:
flags: dynamic_geometry
bytes/sector: 512
sectors/track: 255
tracks/cylinder: 255
sectors/cylinder: 65025
cylinders: 38955
sectors/unit: 2147483647
rpm: 5411
interleave: 1
trackskew: 14
cylinderskew: 23
headswitch: 0           # milliseconds
track-to-track seek: 0  # milliseconds
drivedata: 0

8 partitions:
#          size     offset    fstype   [fsize bsize   cpg]
  a:  335544320          0    unused        0     0
  b:  335544320  335544320     AdvFS
  c: 2147483647          0    unused        0     0
  d:  335544320  671088640     AdvFS
  e:  335544320 1006632960     AdvFS        0     0
  f:  335544320 1342177280     AdvFS        0     0
  g:  335544320 1677721600    unused        0     0
  h:  134217727 2013265920     AdvFS

The domain raid_pdmn consisted of the partitions 'd', 'e' and 'f' of the RAID array (dsk5). One partition is 160 GB. The whole domain has therefore 480 GB.

I rebooted the system and everything worked without any hassle. On Friday, June 2nd, the system went down again. The syslog string was

Jun 2 14:38:07 muxs0et0 vmunix:
Jun 2 14:38:07 muxs0et0 vmunix: idx_create_index_file: bmtr_put_rec failed
Jun 2 14:38:07 muxs0et0 vmunix: AdvFS Domain Panic; Domain raid_pdmn Id \
0x3e3af2e6.00095d85
Jun 2 14:38:07 muxs0et0 vmunix: An AdvFS domain panic has occurred due to \
either a metadata write error or an internal \


                                 inconsistency. This domain is being rendered \
                                 inaccessible.
Jun  2 14:38:07 muxs0et0 vmunix: Please refer to guidelines in AdvFS Guide to \
                                 File System Administration regarding what \
                                 steps to take to recover this domain.

After that I reseated every mem module, cleaned the system from dust and so on. After restarting the power-up tests failed with

IOD0 failed power-up selft test
IOD1 failed power-up selft test

Removing one CPU and populating only mem bank 0 with 256 MB (yes, I used both mem cards) showed immediately CPU MEM test errors. After putting in mem without errors the system came up again but kept falling over AdvFS errors. fixfdmn rendered the domain raid_pdmn unusable. Nearly every directory in the root dir of this file domain was removed. I tried to delete the file set. The system fell over again! After that I had to remove the file domain by hand (removing the entry in /etc/fdmns, setting the disklabel of dsk5{d,e,f} to unused). After that I recreated raid_pdmn with

mkfdmn /dev/disk/dsk5d raid_pdmn
addvol /dev/disk/dsk5e raid_pdmn
addvol /dev/disk/dsk5f raid_pdmn
mkfset raid_pdmn projekte

Fortunately I'm running TIVOLI. After the domain was newly created I started restoring everything from backup. But, even if the backup is stored on another system on a raid system (no tape) the backup would take a considerable amount of time.

Right after the beginning of the restoring process the system paniced again. Even the newly created domain produced errors after some MB transferred from backup. I was stumped! This domain has to become online immediately! All our projects depend on this file domain!

Ok, I had a look into the latest patch kit I had downloaded in Oct 2003. It was PK-06 for v5.1. Yes, I know I should upgrade to v5.1B, but this takes some more time. So I decided to install PK-06. And I changed the domain layout to contain only one partition as follows:

dsk5c
8 partitions:
#          size     offset    fstype   [fsize bsize   cpg]
  a:  335544320          0    unused        0     0
  b:  335544320  335544320     AdvFS
  c: 2147483647          0    unused        0     0
  d: 1006632960  671088640     AdvFS
  e:          0          0    unused        0     0
  f:          0          0    unused        0     0
  g:  335544320 1677721600    unused        0     0
  h:  134217727 2013265920     AdvFS

raid_pdmn now consists only of dsk5d, that is now 480GB.

Due to the changes I'm very suspicious about the reliability of the failing domain. I have no idea, why the domain in question paniced nor do I know what caused the various panics. I'm not sure if there was any hardware error involved in this.

Right now the system restores the data from backup. restoration is running for 3 hours now. I'm hoping everything will be restored without problem and data corruption. But I'm not really sure. And I'd like to know what caused the panic. Is it any known error?

Last but not least I have 6 mem modules lying around that I'm not sure if they are ok. How to test the mem? System down time can easily be arranged but is limited in the amount of time (say 2 or 3 hours) or at weekend. Where to get new modules for not to much mem. Am I right, that the AS1200 uses PC-100 SDRAM with parity?

OK. Thank you everyone who read to the end. It's become rather long. I hope I didn't forget any useful information. Don't hesitate to ask me.

I'd like to know, if the AS1200 will work for the future as it has done for the past 6 years. I love these Alpha systems. But now I'm anxious about the stability of my AS1200.

Any hint is welcome. Many thanks in advance.

Best regards
--


Uwe Lienig ---------- fon: (+49 351) 462 2780 fax: (+49 351) 462 3476 mailto:uwe.lienig@xxxxxxxxxxxxxxxxxxxxx

Forschungsinstitut Fahrzeugtechnik
<http://www.fif.mw.htw-dresden.de>
parcels: Gutzkowstr. 22, 01069 Dresden
letters: PF 12 07 01,    01008 Dresden

Hochschule für Technik und Wirtschaft Dresden (FH)
Friedrich-List-Platz 1, 01069 Dresden