Terwijl ik op vakantie was, werd mijn Intel Xeon server met Debian opeens onbereikbaar. Erg vervelend. Ik heb geen meldingen gehad van stroomstoringen, of waarschuwingen van hoge temperatuur (want die krijg ik gepushed op mijn telefoon). Toch in lichte paniek maar even iemand langs het serverhok gestuurd om te kijken of ie (a) niet in brand stond en (b) er niet ingebroken was.
Niets aan de hand. Omdat dit een secundaire server is die voornamelijk logs en numbers van andere servers cruncht kon ie wel even op mij wachten. Wel had ik door het nieuws van Meltdown en Spectre opeens een andere schrik, maar die was meer paranoide dan realistisch bij het opeens uitgaan van een server.
Nu zie ik in de Grafana logs niets bijzonders, alleen dat ie op 2 januari om 5 voor 10 stopt met loggen:

Ook stopt de syslog opeens met na het booten bij terugkomst wat recovery meldingen. Volgens mij is het safe to say dat dit toch echt een stroomstoring geweest moet zijn, hoewel andere apparaten zoals het klokje op het buro geen tekenen daarvan vertonen. (Logs hieronder voor de volledigheid, exclusief voorspelbare bootmeldingen)
Deze secondaire server zit niet aan een UPS. Ik heb dit nog niet eerder meegemaakt, en ben een server-opeens-uit-mogelijk-door-stroomstoring-noob.
NB:
Niets aan de hand. Omdat dit een secundaire server is die voornamelijk logs en numbers van andere servers cruncht kon ie wel even op mij wachten. Wel had ik door het nieuws van Meltdown en Spectre opeens een andere schrik, maar die was meer paranoide dan realistisch bij het opeens uitgaan van een server.
Nu zie ik in de Grafana logs niets bijzonders, alleen dat ie op 2 januari om 5 voor 10 stopt met loggen:

Ook stopt de syslog opeens met na het booten bij terugkomst wat recovery meldingen. Volgens mij is het safe to say dat dit toch echt een stroomstoring geweest moet zijn, hoewel andere apparaten zoals het klokje op het buro geen tekenen daarvan vertonen. (Logs hieronder voor de volledigheid, exclusief voorspelbare bootmeldingen)
Deze secondaire server zit niet aan een UPS. Ik heb dit nog niet eerder meegemaakt, en ben een server-opeens-uit-mogelijk-door-stroomstoring-noob.
- Op het Enexis Storingenoverzicht staan geen meldingen. Bestaat er een openbare site waar ook kleine haperingen in de stroomvoorziening worden gelogged?
- Wat zijn best practises om bij een crash/shutdown/poweroff/hack snel te zien wat de oorzaak is?
- Zijn er handige aanbevolen views/dashboards voor Prometheus>Grafana om crash-redenen, hacks of kernel-panics inzichtelijk te maken?
- Reboot On Poweroff (BIOS) was in dit geval handig geweest. Zijn er mensen die dat aan hebben staan? (Als de computer zichzelf uitzet vanwege een oplopende temperatuur oid dan lijkt die optie me gevaarlijk, maar de optie zal wel niet voor niets bestaan. Ik weet niet of Reboot On Poweroff de reden bijhoudt.)
code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
| Jan 2 06:49:48 abcserv systemd[1]: Starting Daily apt upgrade and clean activities... Jan 2 06:49:51 abcserv systemd[1]: Started Daily apt upgrade and clean activities. Jan 2 07:00:31 abcserv smartd[695]: Device: /dev/sdb [SAT], CHECK POWER STATUS spins up disk (0x81 -> 0xff) Jan 2 07:00:36 abcserv smartd[695]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 84 to 100 Jan 2 07:00:36 abcserv smartd[695]: Device: /dev/sdc [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 13 to 100 Jan 2 07:09:46 abcserv dhclient[1437]: DHCPREQUEST of 192.168.1.32 on eno2 to 192.168.1.254 port 67 Jan 2 07:09:46 abcserv dhclient[1437]: DHCPACK of 192.168.1.32 from 192.168.1.254 Jan 2 07:09:46 abcserv systemd[1]: Reloading Samba SMB Daemon. Jan 2 07:09:46 abcserv systemd[1]: Reloaded Samba SMB Daemon. Jan 2 07:09:46 abcserv dhclient[1437]: bound to 192.168.1.32 -- renewal in 36552 seconds. Jan 2 07:17:01 abcserv CRON[23607]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Jan 2 07:30:26 abcserv smartd[695]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 54 to 55 Jan 2 07:30:31 abcserv smartd[695]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 74 to 75 Jan 2 08:00:26 abcserv smartd[695]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 55 to 54 Jan 2 08:00:36 abcserv smartd[695]: Device: /dev/sdc [SAT], CHECK POWER STATUS spins up disk (0x81 -> 0xff) Jan 2 08:00:36 abcserv smartd[695]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 100 to 65 Jan 2 08:00:36 abcserv smartd[695]: Device: /dev/sdc [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 100 to 25 Jan 2 08:17:01 abcserv CRON[29333]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Jan 2 08:30:36 abcserv smartd[695]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 65 to 67 Jan 2 08:30:36 abcserv smartd[695]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 72 to 73 Jan 2 08:30:36 abcserv smartd[695]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 28 to 27 Jan 2 08:30:36 abcserv smartd[695]: Device: /dev/sdc [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 25 to 27 Jan 2 09:00:37 abcserv smartd[695]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 67 to 68 Jan 2 09:00:37 abcserv smartd[695]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 73 to 72 Jan 2 09:00:37 abcserv smartd[695]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 27 to 28 Jan 2 09:00:37 abcserv smartd[695]: Device: /dev/sdc [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 27 to 28 Jan 2 09:17:01 abcserv CRON[778]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Jan 2 09:30:26 abcserv smartd[695]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 68 to 69 Jan 2 09:30:26 abcserv smartd[695]: Device: /dev/sdc [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 28 to 29 Jan 8 22:47:29 abcserv systemd-modules-load[271]: Inserted module 'ecryptfs' Jan 8 22:47:29 abcserv systemd-modules-load[271]: Inserted module 'coretemp' Jan 8 22:47:29 abcserv systemd[1]: Starting Flush Journal to Persistent Storage... Jan 8 22:47:29 abcserv systemd[1]: Started udev Coldplug all Devices. Jan 8 22:47:29 abcserv systemd[1]: Started Load/Save Random Seed. Jan 8 22:47:29 abcserv systemd[1]: Started Flush Journal to Persistent Storage. Jan 8 22:47:29 abcserv systemd[1]: Started Create Static Device Nodes in /dev. Jan 8 22:47:29 abcserv systemd[1]: Starting udev Kernel Device Manager... Jan 8 22:47:29 abcserv systemd[1]: Started udev Kernel Device Manager. Jan 8 22:47:29 abcserv systemd[1]: Reached target Swap. Jan 8 22:47:29 abcserv systemd-modules-load[271]: Inserted module 'zfs' Jan 8 22:47:29 abcserv kernel: [ 0.000000] microcode: microcode updated early to revision 0x700000d, date = 2016-10-12 Jan 8 22:47:29 abcserv kernel: [ 0.000000] Linux version 4.9.0-4-amd64 (debian-kernel@lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18) ) #1 SMP Debian 4.9.65-3+deb9u1 (2017-12-23) Jan 8 22:47:29 abcserv systemd[1]: Started Load Kernel Modules. Jan 8 22:47:29 abcserv systemd[1]: Starting Apply Kernel Variables... Jan 8 22:47:29 abcserv systemd[1]: Started Apply Kernel Variables. Jan 8 22:47:29 abcserv systemd[1]: Started Set the console keyboard layout. Jan 8 22:47:29 abcserv systemd[1]: Reached target Local File Systems (Pre). Jan 8 22:47:29 abcserv systemd[1]: Starting File System Check on /dev/disk/by-uuid/780642d0-1ab6-1ab1-99bb-1ab49287542ac... Jan 8 22:47:29 abcserv systemd[1]: Started File System Check Daemon to report status. Jan 8 22:47:29 abcserv systemd-fsck[532]: /dev/sda2: recovering journal Jan 8 22:47:29 abcserv systemd-fsck[532]: /dev/sda2: clean, 24584/4890624 files, 522756/19531264 blocks Jan 8 22:47:29 abcserv systemd[1]: Started File System Check on /dev/disk/by-uuid/780642d0-1ab6-1ab1-99bb-1ab49287542ac. Jan 8 22:47:29 abcserv systemd[1]: Mounting /home... Jan 8 22:47:29 abcserv systemd[1]: Mounted /home. Jan 8 22:47:29 abcserv systemd[1]: Reached target Local File Systems. Jan 8 22:47:29 abcserv systemd[1]: Started ifup for eno4. Jan 8 22:47:29 abcserv systemd[1]: Started Permit User Sessions. Jan 8 22:47:29 abcserv sensors[724]: i350bb-pci-0500 Jan 8 22:47:29 abcserv sensors[724]: Adapter: PCI adapter Jan 8 22:47:29 abcserv sensors[724]: loc1: +48.0°C (high = +120.0°C, crit = +110.0°C) Jan 8 22:47:29 abcserv sensors[724]: coretemp-isa-0000 Jan 8 22:47:29 abcserv sensors[724]: Adapter: ISA adapter Jan 8 22:47:29 abcserv sensors[724]: Physical id 0: +43.0°C (high = +82.0°C, crit = +104.0°C) Jan 8 22:47:29 abcserv sensors[724]: Core 0: +43.0°C (high = +82.0°C, crit = +104.0°C) Jan 8 22:47:29 abcserv sensors[724]: Core 1: +43.0°C (high = +82.0°C, crit = +104.0°C) Jan 8 22:47:29 abcserv sensors[724]: Core 2: +43.0°C (high = +82.0°C, crit = +104.0°C) Jan 8 22:47:29 abcserv sensors[724]: Core 3: +43.0°C (high = +82.0°C, crit = +104.0°C) Jan 8 22:47:29 abcserv systemd[1]: Started Initialize hardware monitoring sensors. Jan 8 22:47:29 abcserv cron[715]: (CRON) INFO (Running @reboot jobs) Jan 8 22:47:29 abcserv kernel: [ 10.368782] sda: sda1 sda2 sda3 Jan 8 22:47:29 abcserv kernel: [ 10.370050] sd 0:0:0:0: [sda] Attached SCSI disk Jan 8 22:47:29 abcserv kernel: [ 10.420626] sdc: sdc1 sdc9 Jan 8 22:47:29 abcserv kernel: [ 10.421374] sd 4:0:0:0: [sdc] Attached SCSI disk Jan 8 22:47:29 abcserv kernel: [ 10.428766] sdb: sdb1 sdb9 Jan 8 22:47:29 abcserv kernel: [ 10.429695] sd 3:0:0:0: [sdb] Attached SCSI disk Jan 8 22:47:29 abcserv systemd[1]: Started System Logging Service. Jan 8 22:47:29 abcserv systemd[1]: Started Login Service. Jan 8 22:47:29 abcserv systemd[1]: Started LSB: disk temperature monitoring daemon. Jan 8 22:47:29 abcserv systemd[1]: Started Getty on tty1. Jan 8 22:47:29 abcserv systemd[1]: Reached target Login Prompts. Jan 8 22:47:29 abcserv smartd[711]: Device: /dev/sda [SAT], can't monitor Current_Pending_Sector count - no Attribute 197 Jan 8 22:47:29 abcserv smartd[711]: Device: /dev/sda [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198 Jan 8 22:47:29 abcserv systemd[1]: Started OpenBSD Secure Shell server. Jan 8 22:47:29 abcserv smartd[711]: Device: /dev/sda [SAT], no SMART Error Log, ignoring -l error (override with -T permissive) Jan 8 22:47:29 abcserv smartd[711]: Device: /dev/sda [SAT], is SMART capable. Adding to "monitor" list. Jan 8 22:47:29 abcserv smartd[711]: Device: /dev/sdb, type changed from 'scsi' to 'sat' Jan 8 22:47:29 abcserv smartd[711]: Device: /dev/sdb [SAT], opened Jan 8 22:47:29 abcserv zfs-mount[707]: Mounting ZFS filesystem(s) . Jan 8 22:47:29 abcserv prometheus[708]: time="2018-01-08T22:47:29+01:00" level=info msg="Starting prometheus (version=1.5.2+ds, branch=debian/sid, revision=1.5.2+ds-2+b3)" source="main.go:75" Jan 8 22:47:29 abcserv grafana-server[695]: t=2018-01-08T22:47:29+0100 lvl=info msg="Starting Grafana" logger=server version=4.6.2 commit=8db5f08 compiled=2017-11-16T10:19:25+0100 Jan 8 22:47:29 abcserv smartd[711]: Device: /dev/sdb [SAT], is SMART capable. Adding to "monitor" list. Jan 8 22:47:29 abcserv smartd[711]: Device: /dev/sdc, type changed from 'scsi' to 'sat' Jan 8 22:47:29 abcserv smartd[711]: Device: /dev/sdc [SAT], opened Jan 8 22:47:29 abcserv smartd[711]: Device: /dev/sdc [SAT], not found in smartd database. Jan 8 22:47:29 abcserv systemd[1]: nmbd.service: Supervising process 1228 which is not our child. We'll most likely not notice when it exits. Jan 8 22:47:29 abcserv exim4[713]: Starting MTA: exim4. Jan 8 22:47:29 abcserv exim4[713]: ALERT: exim paniclog /var/log/exim4/paniclog has non-zero size, mail system possibly broken Jan 8 22:47:29 abcserv systemd[1]: Started LSB: exim Mail Transport Agent. Jan 8 22:47:29 abcserv systemd[1]: Started LSB: Mount ZFS filesystems and volumes. Jan 8 22:47:29 abcserv systemd[1]: Starting LSB: ZFS Event Daemon... Jan 8 22:47:29 abcserv smartd[711]: Device: /dev/sdc [SAT], is SMART capable. Adding to "monitor" list. Jan 8 22:47:29 abcserv smartd[711]: Device: /dev/sdd, type changed from 'scsi' to 'sat' Jan 8 22:47:29 abcserv smartd[711]: Device: /dev/sdd [SAT], opened Jan 8 22:47:29 abcserv grafana-server[695]: t=2018-01-08T22:47:29+0100 lvl=info msg="Initializing CleanUpService" logger=cleanup Jan 8 22:47:29 abcserv grafana-server[695]: t=2018-01-08T22:47:29+0100 lvl=info msg="Initializing Alerting" logger=alerting.engine Jan 8 22:47:29 abcserv zed[1259]: ZFS Event Daemon 0.7.0-133_g35df0bb55 (PID 1259) Jan 8 22:47:29 abcserv systemd[1]: Started LSB: Network share ZFS datasets and volumes.. Jan 8 22:47:29 abcserv smartd[711]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 54 to 40 Jan 8 22:47:31 abcserv prometheus[708]: time="2018-01-08T22:47:31+01:00" level=info msg="Loading series map and head chunks..." source="storage.go:373" Jan 8 22:47:34 abcserv prometheus[708]: time="2018-01-08T22:47:34+01:00" level=warning msg="Persistence layer appears dirty." source="persistence.go:815" Jan 8 22:47:34 abcserv prometheus[708]: time="2018-01-08T22:47:34+01:00" level=warning msg="Starting crash recovery. Prometheus is inoperational until complete." source="crashrecovery.go:40" Jan 8 22:47:34 abcserv prometheus[708]: time="2018-01-08T22:47:34+01:00" level=warning msg="To avoid crash recovery in the future, shut down Prometheus with SIGTERM or a HTTP POST to /-/quit." source="crashrecovery.go:41" Jan 8 22:47:34 abcserv prometheus[708]: time="2018-01-08T22:47:34+01:00" level=info msg="Scanning files." source="crashrecovery.go:55" Jan 8 22:47:34 abcserv smartd[711]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 75 to 76 Jan 8 22:47:34 abcserv smartd[711]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 69 to 91 Jan 8 22:47:34 abcserv smartd[711]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 31 to 9 Jan 8 22:47:34 abcserv smartd[711]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 54 to 39 Jan 8 22:47:37 abcserv sh[550]: DHCPREQUEST of 192.168.1.32 on eno2 to 255.255.255.255 port 67 Jan 8 22:47:37 abcserv sh[550]: DHCPOFFER of 192.168.1.32 from 192.168.1.254 Jan 8 22:47:37 abcserv dhclient[583]: DHCPREQUEST of 192.168.1.32 on eno2 to 255.255.255.255 port 67 Jan 8 22:47:37 abcserv dhclient[583]: DHCPOFFER of 192.168.1.32 from 192.168.1.254 Jan 8 22:47:37 abcserv sh[550]: DHCPACK of 192.168.1.32 from 192.168.1.254 Jan 8 22:47:37 abcserv dhclient[583]: DHCPACK of 192.168.1.32 from 192.168.1.254 Jan 8 22:47:37 abcserv systemd[1]: Starting Samba SMB Daemon... Jan 8 22:47:37 abcserv systemd[1]: smbd.service: Supervising process 1501 which is not our child. We'll most likely not notice when it exits. Jan 8 22:47:37 abcserv systemd[1]: Started Samba SMB Daemon. Jan 8 22:47:37 abcserv dhclient[583]: bound to 192.168.1.32 -- renewal in 41921 seconds. Jan 8 22:47:37 abcserv sh[550]: bound to 192.168.1.32 -- renewal in 41921 seconds. Jan 8 22:47:37 abcserv prometheus[708]: time="2018-01-08T22:47:37+01:00" level=warning msg="Recovered metric node_vmstat_kswapd_inodesteal{instance=\"localhost:9100\", job=\"node\"}, fingerprint 50871399a8ac3990: recovered 95 chunks from series file, recovered 1 chunks from checkpoint." source="crashrecovery.go:364" Jan 8 22:47:37 abcserv systemd[1]: Reloading OpenBSD Secure Shell server. Jan 8 22:47:37 abcserv systemd[1]: Reloaded OpenBSD Secure Shell server. Jan 8 22:47:37 abcserv sh[550]: eno2=eno2 Jan 8 22:47:37 abcserv prometheus[708]: time="2018-01-08T22:47:37+01:00" level=info msg="90000 files scanned." source="crashrecovery.go:77" Jan 8 22:47:39 abcserv systemd[1]: Reached target Multi-User System. Jan 8 22:47:39 abcserv systemd[1]: Reached target Graphical Interface. Jan 8 22:47:39 abcserv systemd[1]: Starting Update UTMP about System Runlevel Changes... Jan 8 22:47:39 abcserv systemd[1]: Started Update UTMP about System Runlevel Changes. Jan 8 22:47:39 abcserv prometheus[708]: time="2018-01-08T22:47:39+01:00" level=info msg="140000 files scanned." source="crashrecovery.go:77" Jan 8 22:47:39 abcserv smartd[711]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 94 to 93 Jan 8 22:47:39 abcserv smartd[711]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 72 to 91 Jan 8 22:47:39 abcserv smartd[711]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 28 to 9 Jan 8 22:47:40 abcserv smartd[711]: Device: /dev/sdc [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 29 to 28 Jan 8 22:47:40 abcserv smartd[711]: Device: /dev/sdd [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 206 to 253 Jan 8 22:47:40 abcserv systemd[1]: Started Daily apt upgrade and clean activities. Jan 8 22:47:40 abcserv systemd[1]: Startup finished in 14.063s (kernel) + 14.181s (userspace) = 28.244s. Jan 8 22:47:45 abcserv prometheus[708]: time="2018-01-08T22:47:45+01:00" level=info msg="File scan complete. 283815 series found." source="crashrecovery.go:83" Jan 8 22:47:45 abcserv prometheus[708]: time="2018-01-08T22:47:45+01:00" level=info msg="Checking for series without series file." source="crashrecovery.go:85" Jan 8 22:47:45 abcserv prometheus[708]: time="2018-01-08T22:47:45+01:00" level=info msg="Check for series without series file complete." source="crashrecovery.go:130" Jan 8 22:47:45 abcserv prometheus[708]: time="2018-01-08T22:47:45+01:00" level=info msg="Cleaning up archive indexes." source="crashrecovery.go:402" |
NB:
code:
1
2
| # smartctl -a /dev/sdc | grep ass SMART overall-health self-assessment test result: PASSED |
🇪🇺 Buy from EU (GoT)