Linux自动重启排查
2023.07.08,据客户反馈,我司提供的系统无法登录,登上服务器发现部分java进程不在了,经过分析是由于系统重启导致的,如下是排查重启过程,仅供日后参考。
[TOC]
确定重启时间
- 查看系统的启动时间,经分析:确定本机的开机时间为 2023-07-08 17:47:33
[root@localhost ~]# uptime -s
2023-07-08 17:47:33
[root@localhost ~]# date -d "$(awk -F. '{print $1}' /proc/uptime) second ago" +"%Y-%m-%d %H:%M:%S"
2023-07-08 17:47:33
- 查看本机的重启和关机时间,经分析:确定本机最近一次重启时间为 2023-07-08 17:47
[root@localhost ~]# who -b
系统引导 2023-07-08 17:47
[root@localhost ~]# last reboot
reboot system boot 3.10.0-1160.90.1 Sat Jul 8 17:47 - 09:46 (2+15:58) #.看这里
reboot system boot 3.10.0-1160.90.1 Sat Jul 8 16:48 - 17:14 (00:26)
reboot system boot 3.10.0-1160.90.1 Mon May 22 13:19 - 10:03 (46+20:44)
reboot system boot 3.10.0-1127.el7. Fri May 19 23:28 - 12:43 (2+13:15)
[root@localhost ~]# last shutdown
wtmp begins Fri May 19 23:28:19 2023
查看crontab是否有定时重启任务
- 查看定时任务,经分析:没有导致重启的异常任务
[root@localhost ~]# crontab -l
no crontab for root
- 查看重启(即
2023-07-08 17:47:33
)前后1个小时内的计划任务的运行日志,经分析:crontab日志并没有reboot的记录
[root@localhost ~]# sed -n '/Jul 8 17:00:01/,/Jul 8 18:00:01/p' /var/log/cron
Jul 8 17:00:01 localhost CROND[33407]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Jul 8 17:01:01 localhost CROND[33449]: (root) CMD (run-parts /etc/cron.hourly)
Jul 8 17:01:01 localhost run-parts(/etc/cron.hourly)[33449]: starting 0anacron
Jul 8 17:01:01 localhost run-parts(/etc/cron.hourly)[33462]: finished 0anacron
Jul 8 17:01:01 localhost run-parts(/etc/cron.hourly)[33449]: starting mcelog.cron
Jul 8 17:01:01 localhost run-parts(/etc/cron.hourly)[33469]: finished mcelog.cron
Jul 8 17:10:01 localhost CROND[33595]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Jul 8 17:20:01 localhost CROND[33738]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Jul 8 17:30:01 localhost CROND[33884]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Jul 8 17:40:02 localhost CROND[34035]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Jul 8 17:50:01 localhost CROND[34183]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Jul 8 18:00:01 localhost CROND[34337]: (root) CMD (/usr/lib64/sa/sa1 1 1)
分析系统日志
- 查看最近登录本机的ip地址(last命令会读取
/var/log/wtmp
)
[root@localhost ~]# last -n10 -f /var/log/wtmp
root pts/1 192.168.4.250 Sat Jul 8 18:44 still logged in
root pts/0 192.168.4.184 Sat Jul 8 17:52 - 18:04 (00:11)
reboot system boot 3.10.0-1160.90.1 Sat Jul 8 17:47 - 09:43 (2+15:56) #.看这里
reboot system boot 3.10.0-1160.90.1 Sat Jul 8 16:48 - 17:14 (00:26)
reboot system boot 3.10.0-1160.90.1 Sat Jul 8 16:29 - 17:14 (00:45)
reboot system boot 3.10.0-1160.90.1 Sat Jul 8 16:13 - 17:14 (01:01)
reboot system boot 3.10.0-1160.90.1 Sat Jul 8 15:39 - 17:14 (01:35)
reboot system boot 3.10.0-1160.90.1 Sat Jul 8 15:19 - 17:14 (01:55)
reboot system boot 3.10.0-1160.90.1 Sat Jul 8 14:59 - 17:14 (02:14)
root pts/2 192.168.4.69 Fri Jul 7 15:11 - 18:01 (02:50) #.看这里
[root@localhost ~]# last -a | more
root pts/1 192.168.4.250 Sat Jul 8 18:44 still logged in
root pts/0 Sat Jul 8 17:52 - 18:04 (00:11) 192.168.4.184
reboot system boot Sat Jul 8 17:47 - 09:58 (2+16:10) 3.10.0-1160.90.1.el7.x86_64 #.看这里
reboot system boot Sat Jul 8 16:48 - 17:14 (00:26) 3.10.0-1160.90.1.el7.x86_64
reboot system boot Sat Jul 8 16:29 - 17:14 (00:45) 3.10.0-1160.90.1.el7.x86_64
reboot system boot Sat Jul 8 16:13 - 17:14 (01:01) 3.10.0-1160.90.1.el7.x86_64
reboot system boot Sat Jul 8 15:39 - 17:14 (01:35) 3.10.0-1160.90.1.el7.x86_64
reboot system boot Sat Jul 8 15:19 - 17:14 (01:55) 3.10.0-1160.90.1.el7.x86_64
reboot system boot Sat Jul 8 14:59 - 17:14 (02:14) 3.10.0-1160.90.1.el7.x86_64
root pts/2 Fri Jul 7 15:11 - 18:01 (02:50) 192.168.4.69 #.看这里
- 经分析:重启之前(即
2023-07-08 17:47:33
之前)只有192.168.4.69
于2023-07-07 15:11
登录本机,登录时间与重启时间相隔太久,所以排除人为重启。 - 最终原因:经客户确认机房多台物理机集体重启,导致宿主机上面的所有虚机也跟着发生了重启。
云主机如何排查重启原因
阿里云
- 2021.05.26,一台主机于 00:27:03 自动重启,登录ecs控制台看到是由于系统错误(与阿里云技术确认是由于虚机所在的物理机电池模块故障)导致的重启。
注:登录阿里云ecs控制台 - 选中某主机 - 实例详情 - 实践 - 非预期运维事件:[事件类型] 因系统错误实例重启
华为云
待补充...
附录
导致重启的常见原因
- 管理员择机维护,比如安装紧急补丁、更换配件等
- 硬件故障导致重启,比如电源或内存故障导致重启
- 宿主机重启连带虚拟机重启
- 系统软件异常导致的重启,比如系统文件损坏等
其他信息
- 2021.01.13,分析一个挖矿病毒的脚本中发现一段毁尸灭迹的脚本
echo 0>/var/spool/mail/root
echo 0>/var/log/wtmp
echo 0>/var/log/secure
echo 0>/var/log/cron
- 根据关键字查看匹配行前后各4行记录
more /var/log/syslog
more /var/log/kern.log
cat /var/log/messages | grep nr -i "shutting down for system reboot"
cat /var/log/messages | grep -B 4 "shutting down for system reboot" #.重启前4行日志
cat /var/log/messages | grep -A 4 "shutting down for system reboot" #.重启后4行日志