常见的磁盘故障是磁盘空间不足、磁盘出现坏块、磁盘未挂载等。
磁盘故障有的会导致文件系统损坏,比如磁盘未挂载,集群管理自动定期做磁盘检测时会识别故障并将实例停止,查看集群状态时对应实例状态异常;有的不会导致文件系统损坏,比如磁盘空间不足,集群管理无法检测到,服务进程访问到故障磁盘会异常退出,比如:数据库无法启动、checksum校验不对、页面读写失败、页面校验错误等。
对于会导致文件系统损坏的故障,查看集群状态会显示对应实例状态持续为Unknown,定位方法如下:
查看cm_agent日志,日志保存在mpp/omm/cm/cm_agent,日志中会有类似“data path disc writable test failed”异常,说明文件系统已损坏。
文件系统损坏可能是磁盘未挂载,通过ls –l可以看到该磁盘对应的目录权限异常,如下:
ls -l
ls: cannot access data
total 108
drwxr-xr-x 2 root root 4096 2014-10-24 00:00 bin
drwxr-xr-x 3 root root 4096 2014-10-24 00:07 boot
d????????? ? ? ? ? ? data
也可能是磁盘出现坏块,然后操作系统将文件系统保护起来,拒绝读写,可以使用磁盘坏块检查工具如badblocks检查磁盘是否有坏块,如下:
badblocks /dev/sda2 -s -v
Checking blocks to 30681000
Checking for bad blocks (read-only test): 306809600674112/ 306810000000
30680964
30680973
...
done
Pass completed, 37 bad blocks found.
对于不会导致文件系统损坏的故障,服务进程访问到故障磁盘会异常退出,定位方法如下:
查看CN/DN日志,日志保存在mpp/omm/pg_log下。日志中会有文件读写错误,如“No space left on device”、“ invalid page header in block 122838 of relation base/16385/152715”。
文件读写错误可能是磁盘空间不足,通过df -h可以看到磁盘空间已达100%%,如下:
df -h
-文件系统 容量 已用 可用 已用%% 挂载点
/dev/sda3 258G 245G 100%% /
devtmpfs 48G 132K 48G 1%% /dev
tmpfs 48G 4.0K 48G 1%% /dev/shm
/dev/sda1 1004M 45M 908M 5%% /boot
/dev/sdb 14T 449G 13T 4%% /mnt/slicefile1
/dev/sdc 14T 1.3T 12T 10%% /mnt/slicefile2
如果磁盘是RAID5,执行df命令查看到磁盘空间已满,但可能空间并没有写满,可能是一个raid组里发生了多盘故障,通过Megacli工具可以看到一个以上raid盘出现故障,如下:
./megacli64 -cfgdsply -aall
Adapter:
Product Name: PERC 5/i Integrated
Memory: 256MB
BBU: Present
Serial No: 12345
RAID Level: Primary-1, Secondary-, RAID Level Qualifier-
Size:285568MB
State: Optimal
Physical Disk:
Media Error Count:
Other Error Count:
Firmware state: Offline
Physical Disk: 1
Media Error Count:
Other Error Count:
Firmware state: Offline
如果是因为机器异常掉电,导致数据页面丢失场景,可能是因为买手机靓号平台磁盘Disk Cache Policy没有关闭导致,查看Disk Cache Policy是否关闭的命令如下:
plat1:/opt/MegaRAID/MegaCli # ./MegaCli64 -LDInfo -Lall -aAll
Adapter -- Virtual Drive Information:
Virtual Drive: (Target Id: )
Name :
RAID Level : Primary-1, Secondary-, RAID Level Qualifier-0
Size : 557.861 GB
Sector Size : 512
Mirror Data : 557.861 GB
State : Optimal
Strip Size : 256 KB
Number Of Drives : 2
Span Depth : 1
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy : Disabled
Encryption Type : None
PI type: No PI
Is VD Cached: No
如果查看Disk Cache Policy不是Disabled,那么执行下面命令将其关闭:
plat1:/opt/MegaRAID/MegaCli # ./MegaCli64 -LDSetProp -DisDskCache -Immediate -Lall -aAll
Set Disk Cache Policy to Disabled on Adapter , VD (target id: ) success
Set Disk Cache Policy to Disabled on Adapter , VD 1 (target id: 1) success
Set Disk Cache Policy to Disabled on Adapter , VD 2 (target id: 2) success
Set Disk Cache Policy to Disabled on Adapter , VD 3 (target id: 3) success
Set Disk Cache Policy to Disabled on Adapter , VD 4 (target id: 4) success
Set Disk Cache Policy to Disabled on Adapter , VD 5 (target id: 5) success
Set Disk Cache Policy to Disabled on Adapter , VD 6 (target id: 6) success
Exit Code: x00