Linux自动重启排查

2023.07.08,据客户反馈,我司提供的系统无法登录,登上服务器发现部分java进程不在了,经过分析是由于系统重启导致的,如下是排查重启过程,仅供日后参考。

[TOC]

确定重启时间

  • 查看系统的启动时间,经分析:确定本机的开机时间为 2023-07-08 17:47:33
[root@localhost ~]# uptime -s
2023-07-08 17:47:33

[root@localhost ~]# date -d "$(awk -F. '{print $1}' /proc/uptime) second ago" +"%Y-%m-%d %H:%M:%S"
2023-07-08 17:47:33
  • 查看本机的重启和关机时间,经分析:确定本机最近一次重启时间为 2023-07-08 17:47
[root@localhost ~]# who -b
         系统引导 2023-07-08 17:47

[root@localhost ~]# last reboot
reboot   system boot  3.10.0-1160.90.1 Sat Jul  8 17:47 - 09:46 (2+15:58)   #.看这里
reboot   system boot  3.10.0-1160.90.1 Sat Jul  8 16:48 - 17:14  (00:26)    
reboot   system boot  3.10.0-1160.90.1 Mon May 22 13:19 - 10:03 (46+20:44)  
reboot   system boot  3.10.0-1127.el7. Fri May 19 23:28 - 12:43 (2+13:15)   

[root@localhost ~]# last shutdown
wtmp begins Fri May 19 23:28:19 2023

查看crontab是否有定时重启任务

  • 查看定时任务,经分析:没有导致重启的异常任务
[root@localhost ~]# crontab -l
no crontab for root
  • 查看重启(即 2023-07-08 17:47:33)前后1个小时内的计划任务的运行日志,经分析:crontab日志并没有reboot的记录
[root@localhost ~]# sed -n '/Jul  8 17:00:01/,/Jul  8 18:00:01/p' /var/log/cron
Jul  8 17:00:01 localhost CROND[33407]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Jul  8 17:01:01 localhost CROND[33449]: (root) CMD (run-parts /etc/cron.hourly)
Jul  8 17:01:01 localhost run-parts(/etc/cron.hourly)[33449]: starting 0anacron
Jul  8 17:01:01 localhost run-parts(/etc/cron.hourly)[33462]: finished 0anacron
Jul  8 17:01:01 localhost run-parts(/etc/cron.hourly)[33449]: starting mcelog.cron
Jul  8 17:01:01 localhost run-parts(/etc/cron.hourly)[33469]: finished mcelog.cron
Jul  8 17:10:01 localhost CROND[33595]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Jul  8 17:20:01 localhost CROND[33738]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Jul  8 17:30:01 localhost CROND[33884]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Jul  8 17:40:02 localhost CROND[34035]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Jul  8 17:50:01 localhost CROND[34183]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Jul  8 18:00:01 localhost CROND[34337]: (root) CMD (/usr/lib64/sa/sa1 1 1)

分析系统日志

  • 查看最近登录本机的ip地址(last命令会读取 /var/log/wtmp
[root@localhost ~]# last -n10 -f /var/log/wtmp
root     pts/1        192.168.4.250    Sat Jul  8 18:44   still logged in   
root     pts/0        192.168.4.184    Sat Jul  8 17:52 - 18:04  (00:11)    
reboot   system boot  3.10.0-1160.90.1 Sat Jul  8 17:47 - 09:43 (2+15:56)   #.看这里
reboot   system boot  3.10.0-1160.90.1 Sat Jul  8 16:48 - 17:14  (00:26)    
reboot   system boot  3.10.0-1160.90.1 Sat Jul  8 16:29 - 17:14  (00:45)    
reboot   system boot  3.10.0-1160.90.1 Sat Jul  8 16:13 - 17:14  (01:01)    
reboot   system boot  3.10.0-1160.90.1 Sat Jul  8 15:39 - 17:14  (01:35)    
reboot   system boot  3.10.0-1160.90.1 Sat Jul  8 15:19 - 17:14  (01:55)    
reboot   system boot  3.10.0-1160.90.1 Sat Jul  8 14:59 - 17:14  (02:14)    
root     pts/2        192.168.4.69     Fri Jul  7 15:11 - 18:01  (02:50)    #.看这里

[root@localhost ~]# last -a | more
root     pts/1        192.168.4.250    Sat Jul  8 18:44   still logged in  
root     pts/0        Sat Jul  8 17:52 - 18:04  (00:11)     192.168.4.184
reboot   system boot  Sat Jul  8 17:47 - 09:58 (2+16:10)    3.10.0-1160.90.1.el7.x86_64    #.看这里
reboot   system boot  Sat Jul  8 16:48 - 17:14  (00:26)     3.10.0-1160.90.1.el7.x86_64
reboot   system boot  Sat Jul  8 16:29 - 17:14  (00:45)     3.10.0-1160.90.1.el7.x86_64
reboot   system boot  Sat Jul  8 16:13 - 17:14  (01:01)     3.10.0-1160.90.1.el7.x86_64
reboot   system boot  Sat Jul  8 15:39 - 17:14  (01:35)     3.10.0-1160.90.1.el7.x86_64
reboot   system boot  Sat Jul  8 15:19 - 17:14  (01:55)     3.10.0-1160.90.1.el7.x86_64
reboot   system boot  Sat Jul  8 14:59 - 17:14  (02:14)     3.10.0-1160.90.1.el7.x86_64
root     pts/2        Fri Jul  7 15:11 - 18:01  (02:50)     192.168.4.69                   #.看这里
  • 经分析:重启之前(即 2023-07-08 17:47:33 之前)只有 192.168.4.692023-07-07 15:11 登录本机,登录时间与重启时间相隔太久,所以排除人为重启
  • 最终原因:经客户确认机房多台物理机集体重启,导致宿主机上面的所有虚机也跟着发生了重启

云主机如何排查重启原因

阿里云

  • 2021.05.26,一台主机于 00:27:03 自动重启,登录ecs控制台看到是由于系统错误(与阿里云技术确认是由于虚机所在的物理机电池模块故障)导致的重启。

注:登录阿里云ecs控制台 - 选中某主机 - 实例详情 - 实践 - 非预期运维事件:[事件类型] 因系统错误实例重启

华为云

待补充...

附录

导致重启的常见原因

  • 管理员择机维护,比如安装紧急补丁、更换配件等
  • 硬件故障导致重启,比如电源或内存故障导致重启
  • 宿主机重启连带虚拟机重启
  • 系统软件异常导致的重启,比如系统文件损坏等

其他信息

  • 2021.01.13,分析一个挖矿病毒的脚本中发现一段毁尸灭迹的脚本
echo 0>/var/spool/mail/root
echo 0>/var/log/wtmp
echo 0>/var/log/secure
echo 0>/var/log/cron
  • 根据关键字查看匹配行前后各4行记录
more /var/log/syslog
more /var/log/kern.log
cat /var/log/messages | grep nr -i "shutting down for system reboot"
cat /var/log/messages | grep -B 4 "shutting down for system reboot"    #.重启前4行日志
cat /var/log/messages | grep -A 4 "shutting down for system reboot"    #.重启后4行日志
Copyright © www.sqlfans.cn 2023 All Right Reserved更新时间: 2023-07-12 09:25:40

results matching ""

    No results matching ""