zilongzilong

浏览: 332196 次
性别:
来自: 上海

最近访客更多访客>>

beee

wwj_415259272

WangJiaX

u012781923

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

centos7( 3.10.0-327.el7.x86_64) 重启问题

博客分类：

linux

centos7 RHEL7.4 reboot kernel bug kernel BUG at mm/page_alloc.c:1389!

centos7( 3.10.0-123.el7.x86_64) 重启问题	http://aperise.iteye.com/blog/2326082
centos7( 3.10.0-327.el7.x86_64) 重启问题	http://aperise.iteye.com/blog/2425717

centos7( 3.10.0-327.el7.x86_64) 重启问题

1.问题

服务器(2U 2cpu 6cores/cpu 16G*8 5 * 2TB)安装centos7(3.10.0-123.el7.x86_64)，之前遇到过“kernel BUG at mm/page_alloc.c:3765!”的kernel BUG，redhat官网给的意见是升级，之前已经升级到centos7(3.10.0-327.el7.x86_64)，但是最近发现一台应用服务器在资源使用到一定时间后，仍然出现自动重启问题，该问题的错误信息是“kernel BUG at mm/page_alloc.c:1389!”。

2.解决思路

服务器自动重启问题，因为之前已经有过类似处理经验，这里主要步骤如下：

（1）在出问题的机器上启用KDUMP服务，在服务器宕机重启时候抓取宕机日志；

（2）分析服务器宕机日志，从日志中发现问题，解决问题。

3.KDUMP服务安装

公司已经购买了redhat相关服务，这里从系统运维工程师那边已经拿到了一个KDUMP安装的shell文件，直接执行，即可完成KDUMP的所有安装和启用事宜，脚本主要干了以下事情：

（1）#kexec-tools checking，

（2）#add crash kernel https://access.redhat.com/site/solutions/916043

（3）#backup kdump.conf

（4）#Check if the dump directory be mounted

（5）#enable kdump service

（6）#kernel parameter change

（7）#server hang

（8）#softlockup

（9）#oom

完整的脚本参见附件kdumpconfig.zip,内容如下：

#!/bin/sh
echo Kdump Helper is starting to configure kdump service

#kexec-tools checking
if ! rpm -q kexec-tools > /dev/null
then 
    echo "kexec-tools no found, please run command yum install kexec-tools to install it"
    exit 1
fi
mem_total=`free -g |awk 'NR==2 {print $2 }'`
echo Your total memory is $mem_total G

#add crash kernel
#https://access.redhat.com/site/solutions/916043
grub_conf=/boot/grub2/grub.cfg
grub_conf_kdumphelper=/boot/grub2/grub.cfg.kdumphelper.$(date +%y-%m-%d-%H:%M:%S)
echo backup $grub_conf to $grub_conf_kdumphelper 
cp $grub_conf $grub_conf_kdumphelper 
compute_rhel7_crash_kernel ()
{
    mem_size=$1
    if [ $mem_size -le 2 ]
    then
        reserved_memory="128M"
    else
        reserved_memory="auto"
    fi
    echo "$reserved_memory"
}
crashkernel_para=`compute_rhel7_crash_kernel $mem_total `
echo crashkernel=$crashkernel_para is set in $grub_conf
sed -i  '/^\tlinux/ s/crashkernel=\(auto\|[[:digit:]]*[mM]@[[:digit:]]*[mM]\|[[:digit:]]*[mM]\)//g' $grub_conf
sed -i ' /^\tlinux/  s/$/ crashkernel='$crashkernel_para'/g' $grub_conf

#backup kdump.conf
kdump_conf=/etc/kdump.conf
kdump_conf_kdumphelper=/etc/kdump.conf.kdumphelper.$(date +%y-%m-%d-%H:%M:%S)
echo backup $kdump_conf to $kdump_conf_kdumphelper
cp $kdump_conf $kdump_conf_kdumphelper
dump_path=/var/crash
echo path $dump_path > $kdump_conf
dump_level=1
echo core_collector makedumpfile -c --message-level 1 -d $dump_level >> $kdump_conf
echo 'default reboot' >>  $kdump_conf

#Check if the dump directory be mounted
dump_dev_name=$(mount | grep $dump_path | awk '{print $1}')
dump_dev_uuid=$(blkid `mount | grep $dump_path | awk '{print $1}'`| awk '{print $2}')
dump_fs_type=$(mount | grep $dump_path | awk '{print $5}')
mount | grep $dump_path > /dev/null
if [ $? -ne 0 ]; then
    echo "==== The dump directory is not mounted to a separate device. Your vmcore will be saved in the root filesystem ===="
else
    echo "==== The dump directory is mounted to a separate device. Your vmcore will be dumped to that device ===="
    echo "$dump_fs_type $dump_dev_uuid" >> $kdump_conf 
    cat /etc/fstab | awk '{print $1}' | grep -E "^${dump_dev_name}|^${dump_dev_uuid}" >> /dev/null
    if [ $? -ne 0 ]; then
        echo "==== You need to add an entry in the /etc/fstab to make sure the dump directory is auto-mounted after system reboot. ===="
        echo "==== Read more in https://access.redhat.com/solutions/1197493 ===="
    fi
fi

#enable kdump service
echo enable kdump service...
systemctl enable kdump.service
systemctl -a|grep kdump
systemctl restart kdump.service

#kernel parameter change
echo Starting to Configure extra diagnostic opstions
sysctl_conf=/etc/sysctl.conf
sysctl_conf_kdumphelper=/etc/sysctl.conf.kdumphelper.$(date +%y-%m-%d-%H:%M:%S)
echo backup $sysctl_conf to $sysctl_conf_kdumphelper
cp $sysctl_conf $sysctl_conf_kdumphelper

#server hang
sed -i '/^kernel.sysrq/ s/kernel/#kernel/g ' $sysctl_conf 
echo >> $sysctl_conf
echo '#Panic on sysrq and nmi button, magic button alt+printscreen+c or nmi button could be pressed to collect a vmcore' >> $sysctl_conf
echo '#Added by kdumphelper, more information about it can be found in solution below' >> $sysctl_conf
echo '#https://access.redhat.com/site/solutions/2023' >> $sysctl_conf
echo 'kernel.sysrq=1' >> $sysctl_conf
echo 'kernel.sysrq=1 set in /etc/sysctl.conf'
echo '#https://access.redhat.com/site/solutions/125103' >> $sysctl_conf
echo 'kernel.unknown_nmi_panic=1' >> $sysctl_conf
echo 'kernel.unknown_nmi_panic=1  set in /etc/sysctl.conf'

#softlockup
sed -i '/^kernel.softlockup_panic/ s/kernel/#kernel/g ' $sysctl_conf 
echo >> $sysctl_conf
echo '#Panic on soft lockups.' >> $sysctl_conf
echo '#Added by kdumphelper, more information about it can be found in solution below' >> $sysctl_conf
echo '#https://access.redhat.com/site/solutions/19541' >> $sysctl_conf
echo 'kernel.softlockup_panic=1' >> $sysctl_conf
echo 'kernel.softlockup_panic=1 set in /etc/sysctl.conf'

#oom
sed -i '/^kernel.panic_on_oom/ s/kernel/#kernel/g ' $sysctl_conf 
echo >> $sysctl_conf
echo '#Panic on out of memory.' >> $sysctl_conf
echo '#Added by kdumphelper, more information about it can be found in solution below' >> $sysctl_conf
echo '#https://access.redhat.com/site/solutions/20985' >> $sysctl_conf
echo 'vm.panic_on_oom=1' >> $sysctl_conf
echo 'vm.panic_on_oom=1 set in /etc/sysctl.conf'

拿到上面脚本，直接在服务器上运行，即可完成KDUMP的安装和启用，这样在下次服务器宕机时候，KDUMP会记录宕机日志，日志会在/var/crash/目录下存储。

3.分析日志

启用了KDUMP后，就是坐等下次出问题时拿到日志分析问题了，这里我的服务器拿到的日志如下：

以上是服务器上生成的宕机日志文件，这里打开文件vmcore-dmesg.txt，查看到如下内容：

[466312.238996] ------------[ cut here ]------------
[466312.239025] kernel BUG at mm/page_alloc.c:1389!
[466312.239043] invalid opcode: 0000 [#1] SMP

注意上面的那句“kernel BUG at mm/page_alloc.c:1389!”，这句已经提示是一个内核级的BUG，那么好了，下面要做的就是去redhat官网查下如何解决这个问题。

4.解决问题

去redhat官网查询“kernel BUG at mm/page_alloc.c:1389!”相关问题，官网上该问题处理意见参见https://access.redhat.com/solutions/3208581，注意Redhat官网只能注册的用户才能查看完整问题处理内容，注册时候也需要购买了Redhat服务的公司或者个人才能注册，这个比较扯淡，不过我已经联系了公司的系统运维工程师用他的账号给我下载了这个页面，详见附件“RHEL 7.4 server panics with message _kernel BUG at mm_page_alloc.c_1389!_.rar”。

从上可知：

（1）centos7只要版本为RHEL 7.4的都会存在上述问题；

（2）该问题最后修复并验证成功的时间是在2018年五月30日21:55

（3）出现问题的centos7大版本为Red Hat Enterprise Linux 7.4，这个大版本下包括kernel-3.10.0-693.el7.x86_64及其以后版本都会存在类似问题

（4）目前发现存在该问题的厂商及服务器版本如下：

LENOVO System x3650 M5
LENOVO System x3550 M5
IBM Flex System x240 M5
FUJITSU PRIMERGY BX2560 M1
FUJITSU PRIMERGY RX2530 M4
Cisco UCS B200 M4

（5）官网处理建议

Red Hat Enterprise Linux 7：针对这个版本，请升级到kernel-3.10.0-862.el7 from Errata RHSA-2018:1062 或者更新的版本
Red Hat Enterprise Linux 7.4 (EUS)：针对这个版本，请升级到kernel-3.10.0-693.33.1.el7 from Errata RHSA-2018:1738或者更新的版本

（6）官网给出的产生问题的根源

When a memory page is reclaimed from a freelist a whole block is considered. When the beginning of a range is not aligned with the block, kernel crashes due to uninitialized page metadata at the beginning of the block.

（7）官网也给出了诊断步骤，详细内容参见https://access.redhat.com/solutions/3208581