Hoi,
Ik heb nu een tijdje een Linux installatie draaien als mijn "daily driver" en dat gaat allemaal redelijk mooi behalve dat mijn NVMe schijf problemen geeft. Ik heb last van vastlopers en het heeft even geduurd voordat ik erachter was waar hem dat in zat, het lijkt erop dat de schijf zichzelf van de PCIe bus haalt, of dat de kernel dat doet.
Ik kreeg met kernel 4.13 meldingen dat de schijf in "stuck in D3" was, dit lijkt iets te maken te hebben met APST powermanagement van de schijf. Nu kun je dit uitschakelen met een kernel parameter via "nvme.core" maar dit lijkt niet helemaal te werken.
Nu heb ik alles ge-upgrade naar kernel 4.17 met een ThreadRipper patch voor het rewriten van de bridge zodat ik mijn NVidia kaart kan doorzetten naar een VM. Maar zelfs met deze nieuwe kernel is het nog problematisch.
Meldingen:
Lspci -vt
Iemand nog een idee wat ik kan proberen? Ik ben er op het moment al zover mee heen dat ik op het punt sta die NVMe schijf te laten voor wat het is en er gewoon maar een SATA SSD in te zetten. Ik weet niet of het nu iets in de Samsung schijf is, chipset van het moederbord of de aansturing via Linux. Er zijn redelijk wat topics over NVMe en Linux te vinden maar ik heb nog niet een bevredigend antwoord kunnen vinden.
Het rottige eraan is dat het zich soms direct voordoet en de volgende reboot is de machine de gehele dag stabiel. Dat zijn echt de meest frustrerende fouten.
Ik heb nu een tijdje een Linux installatie draaien als mijn "daily driver" en dat gaat allemaal redelijk mooi behalve dat mijn NVMe schijf problemen geeft. Ik heb last van vastlopers en het heeft even geduurd voordat ik erachter was waar hem dat in zat, het lijkt erop dat de schijf zichzelf van de PCIe bus haalt, of dat de kernel dat doet.
Ik kreeg met kernel 4.13 meldingen dat de schijf in "stuck in D3" was, dit lijkt iets te maken te hebben met APST powermanagement van de schijf. Nu kun je dit uitschakelen met een kernel parameter via "nvme.core" maar dit lijkt niet helemaal te werken.
Nu heb ik alles ge-upgrade naar kernel 4.17 met een ThreadRipper patch voor het rewriten van de bridge zodat ik mijn NVidia kaart kan doorzetten naar een VM. Maar zelfs met deze nieuwe kernel is het nog problematisch.
Meldingen:
code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
| 14-06-18 09:19 hephaestus kernel [ 1455.458781] pcieport 0000:40:01.2: AER: Multiple Corrected error received: id=0000 14-06-18 09:19 hephaestus kernel [ 1455.637663] pcieport 0000:40:01.2: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=400a(Transmitter ID) 14-06-18 09:19 hephaestus kernel [ 1455.637669] pcieport 0000:40:01.2: device [1022:1453] error status/mask=00001040/00006000 14-06-18 09:19 hephaestus kernel [ 1455.637673] pcieport 0000:40:01.2: [ 6] Bad TLP 14-06-18 09:19 hephaestus kernel [ 1455.637676] pcieport 0000:40:01.2: [12] Replay Timer Timeout 14-06-18 09:19 hephaestus kernel [ 1455.639758] nvme 0000:42:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=4200(Transmitter ID) 14-06-18 09:19 hephaestus kernel [ 1455.639762] nvme 0000:42:00.0: device [144d:a804] error status/mask=00001000/00006000 14-06-18 09:19 hephaestus kernel [ 1455.639765] nvme 0000:42:00.0: [12] Replay Timer Timeout 14-06-18 09:19 hephaestus kernel [ 1455.731922] pcieport 0000:40:01.2: AER: Multiple Corrected error received: id=0000 14-06-18 09:19 hephaestus kernel [ 1455.759125] pcieport 0000:40:01.2: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=400a(Transmitter ID) 14-06-18 09:19 hephaestus kernel [ 1455.759130] pcieport 0000:40:01.2: device [1022:1453] error status/mask=00001040/00006000 14-06-18 09:19 hephaestus kernel [ 1455.759134] pcieport 0000:40:01.2: [ 6] Bad TLP 14-06-18 09:19 hephaestus kernel [ 1455.759137] pcieport 0000:40:01.2: [12] Replay Timer Timeout 14-06-18 09:19 hephaestus kernel [ 1455.759142] nvme 0000:42:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=4200(Transmitter ID) 14-06-18 09:19 hephaestus kernel [ 1455.759145] nvme 0000:42:00.0: device [144d:a804] error status/mask=00001100/00006000 14-06-18 09:19 hephaestus kernel [ 1455.759148] nvme 0000:42:00.0: [ 8] RELAY_NUM Rollover 14-06-18 09:19 hephaestus kernel [ 1455.759151] nvme 0000:42:00.0: [12] Replay Timer Timeout 14-06-18 09:19 hephaestus kernel [ 1455.761207] pcieport 0000:40:01.2: AER: Multiple Corrected error received: id=0000 14-06-18 09:19 hephaestus kernel [ 1455.761223] pcieport 0000:40:01.2: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=400a(Receiver ID) 14-06-18 09:19 hephaestus kernel [ 1455.761225] pcieport 0000:40:01.2: device [1022:1453] error status/mask=00000040/00006000 14-06-18 09:19 hephaestus kernel [ 1455.761228] pcieport 0000:40:01.2: [ 6] Bad TLP 14-06-18 09:19 hephaestus kernel [ 1455.763349] nvme 0000:42:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=4200(Transmitter ID) 14-06-18 09:19 hephaestus kernel [ 1455.763351] nvme 0000:42:00.0: device [144d:a804] error status/mask=00001000/00006000 14-06-18 09:19 hephaestus kernel [ 1455.763354] nvme 0000:42:00.0: [12] Replay Timer Timeout 14-06-18 09:19 hephaestus kernel [ 1455.765489] pcieport 0000:40:01.2: AER: Corrected error received: id=0000 14-06-18 09:19 hephaestus kernel [ 1455.765493] pcieport 0000:40:01.2: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=400a(Transmitter ID) 14-06-18 09:19 hephaestus kernel [ 1455.765495] pcieport 0000:40:01.2: device [1022:1453] error status/mask=00001040/00006000 14-06-18 09:19 hephaestus kernel [ 1455.765498] pcieport 0000:40:01.2: [ 6] Bad TLP 14-06-18 09:19 hephaestus kernel [ 1455.765502] pcieport 0000:40:01.2: [12] Replay Timer Timeout 14-06-18 09:19 hephaestus kernel [ 1455.766774] pcieport 0000:40:01.2: AER: Multiple Corrected error received: id=0000 14-06-18 09:19 hephaestus kernel [ 1455.794966] pcieport 0000:40:01.2: can't find device of ID0000 14-06-18 09:19 hephaestus kernel [ 1455.799770] pcieport 0000:40:01.2: AER: Corrected error received: id=0000 14-06-18 09:19 hephaestus kernel [ 1455.799775] pcieport 0000:40:01.2: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=400a(Transmitter ID) 14-06-18 09:19 hephaestus kernel [ 1455.799779] pcieport 0000:40:01.2: device [1022:1453] error status/mask=00001000/00006000 14-06-18 09:19 hephaestus kernel [ 1455.799782] pcieport 0000:40:01.2: [12] Replay Timer Timeout 14-06-18 09:19 hephaestus kernel [ 1459.000838] pcieport 0000:40:01.2: AER: Corrected error received: id=0000 14-06-18 09:19 hephaestus kernel [ 1459.000843] pcieport 0000:40:01.2: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=400a(Transmitter ID) 14-06-18 09:19 hephaestus kernel [ 1459.000848] pcieport 0000:40:01.2: device [1022:1453] error status/mask=00001000/00006000 14-06-18 09:19 hephaestus kernel [ 1459.000851] pcieport 0000:40:01.2: [12] Replay Timer Timeout 14-06-18 09:19 hephaestus kernel [ 1459.099833] pcieport 0000:40:01.2: AER: Corrected error received: id=0000 14-06-18 09:19 hephaestus kernel [ 1459.099838] pcieport 0000:40:01.2: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=400a(Transmitter ID) 14-06-18 09:19 hephaestus kernel [ 1459.099842] pcieport 0000:40:01.2: device [1022:1453] error status/mask=00001000/00006000 14-06-18 09:19 hephaestus kernel [ 1459.099844] pcieport 0000:40:01.2: [12] Replay Timer Timeout 14-06-18 09:19 hephaestus kernel [ 1459.143835] pcieport 0000:40:01.2: AER: Multiple Corrected error received: id=0000 14-06-18 09:19 hephaestus kernel [ 1459.215092] pcieport 0000:40:01.2: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=400a(Receiver ID) 14-06-18 09:19 hephaestus kernel [ 1459.215096] pcieport 0000:40:01.2: device [1022:1453] error status/mask=00000040/00006000 14-06-18 09:19 hephaestus kernel [ 1459.215099] pcieport 0000:40:01.2: [ 6] Bad TLP 14-06-18 09:19 hephaestus kernel [ 1459.324242] nvme 0000:42:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=4200(Transmitter ID) 14-06-18 09:19 hephaestus kernel [ 1459.324246] nvme 0000:42:00.0: device [144d:a804] error status/mask=00001100/00006000 14-06-18 09:19 hephaestus kernel [ 1459.324248] nvme 0000:42:00.0: [ 8] RELAY_NUM Rollover 14-06-18 09:19 hephaestus kernel [ 1459.324251] nvme 0000:42:00.0: [12] Replay Timer Timeout 14-06-18 09:19 hephaestus kernel [ 1459.404020] pcieport 0000:40:01.2: AER: Multiple Corrected error received: id=0000 14-06-18 09:19 hephaestus kernel [ 1459.433444] pcieport 0000:40:01.2: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=400a(Transmitter ID) 14-06-18 09:19 hephaestus kernel [ 1459.433447] pcieport 0000:40:01.2: device [1022:1453] error status/mask=000010c0/00006000 14-06-18 09:19 hephaestus kernel [ 1459.433449] pcieport 0000:40:01.2: [ 6] Bad TLP 14-06-18 09:19 hephaestus kernel [ 1459.433451] pcieport 0000:40:01.2: [ 7] Bad DLLP 14-06-18 09:19 hephaestus kernel [ 1459.433455] pcieport 0000:40:01.2: [12] Replay Timer Timeout 14-06-18 09:19 hephaestus kernel [ 1459.433459] pcieport 0000:40:01.2: AER: Multiple Corrected error received: id=0000 14-06-18 09:19 hephaestus kernel [ 1459.468986] pcieport 0000:40:01.2: AER: Multiple Corrected error received: id=0000 14-06-18 09:19 hephaestus kernel [ 1459.468996] pcieport 0000:40:01.2: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=400a(Receiver ID) 14-06-18 09:19 hephaestus kernel [ 1459.468999] pcieport 0000:40:01.2: device [1022:1453] error status/mask=00002040/00006000 14-06-18 09:19 hephaestus kernel [ 1459.469002] pcieport 0000:40:01.2: [ 6] Bad TLP 14-06-18 09:19 hephaestus kernel [ 1459.469005] pcieport 0000:40:01.2: AER: Multiple Corrected error received: id=0000 14-06-18 09:19 hephaestus kernel [ 1459.469014] pcieport 0000:40:01.2: can't find device of ID0000 14-06-18 09:19 hephaestus kernel [ 1459.469015] pcieport 0000:40:01.2: AER: Multiple Corrected error received: id=0000 14-06-18 09:19 hephaestus kernel [ 1459.469024] pcieport 0000:40:01.2: can't find device of ID0000 14-06-18 09:19 hephaestus kernel [ 1459.469025] pcieport 0000:40:01.2: AER: Corrected error received: id=0000 14-06-18 09:19 hephaestus kernel [ 1459.469033] pcieport 0000:40:01.2: can't find device of ID0000 14-06-18 09:19 hephaestus kernel [ 1459.473850] dpc 0000:40:01.2:pcie008: DPC containment event, status:0x1f0d source:0x4200 14-06-18 09:19 hephaestus kernel [ 1459.473853] dpc 0000:40:01.2:pcie008: DPC ERR_FATAL detected, remove downstream devices 14-06-18 09:19 hephaestus kernel [ 1459.540934] nvme1n1: detected capacity change from 512110190592 to 0 14-06-18 09:19 hephaestus kernel [ 1459.541329] print_req_error: I/O error, dev nvme1n1, sector 462867120 14-06-18 09:19 hephaestus kernel [ 1459.541342] Aborting journal on device nvme1n1p3-8. 14-06-18 09:19 hephaestus kernel [ 1459.541349] Buffer I/O error on dev nvme1n1p3, logical block 57704448, lost sync page write 14-06-18 09:19 hephaestus kernel [ 1459.541351] JBD2: Error -5 detected when updating journal superblock for nvme1n1p3-8. 14-06-18 09:19 hephaestus kernel [ 1459.541376] Buffer I/O error on dev nvme1n1p3, logical block 0, lost sync page write 14-06-18 09:19 hephaestus kernel [ 1459.541381] EXT4-fs error (device nvme1n1p3): ext4_journal_check_start:61: Detected aborted journal 14-06-18 09:19 hephaestus kernel [ 1459.541383] EXT4-fs (nvme1n1p3): Remounting filesystem read-only 14-06-18 09:19 hephaestus kernel [ 1459.541390] EXT4-fs (nvme1n1p3): previous I/O error to superblock detected 14-06-18 09:19 hephaestus kernel [ 1459.541394] Buffer I/O error on dev nvme1n1p3, logical block 0, lost sync page write 14-06-18 09:19 hephaestus kernel [ 1459.541399] EXT4-fs (nvme1n1p3): ext4_writepages: jbd2_start: 9223372036854775807 pages, ino 2113244; err -30 |
Lspci -vt
code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
| -+-[0000:40]-+-00.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex | +-00.2 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit | +-01.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge | +-01.1-[41]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961 | +-01.2-[42]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961 | +-02.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge | +-03.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge | +-03.1-[43]--+-00.0 NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] | | \-00.1 NVIDIA Corporation GP102 HDMI Audio Controller | +-04.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge | +-07.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge | +-07.1-[44]--+-00.0 Advanced Micro Devices, Inc. [AMD] Device 145a | | +-00.2 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor | | \-00.3 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) USB 3.0 Host Controller | +-08.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge | \-08.1-[45]--+-00.0 Advanced Micro Devices, Inc. [AMD] Device 1455 | \-00.2 Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] \-[0000:00]-+-00.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex +-00.2 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit +-01.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge +-01.1-[01-07]--+-00.0 Advanced Micro Devices, Inc. [AMD] Device 43ba | +-00.1 Advanced Micro Devices, Inc. [AMD] Device 43b6 | \-00.2-[02-07]--+-00.0-[03]-- | +-04.0-[04]----00.0 Intel Corporation I211 Gigabit Network Connection | +-05.0-[05]----00.0 Intel Corporation Device 24fb | +-06.0-[06]----00.0 Intel Corporation I211 Gigabit Network Connection | \-07.0-[07]-- +-02.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge +-03.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge +-03.1-[08]--+-00.0 NVIDIA Corporation GP104 [GeForce GTX 1080] | \-00.1 NVIDIA Corporation GP104 High Definition Audio Controller +-04.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge +-07.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge +-07.1-[09]--+-00.0 Advanced Micro Devices, Inc. [AMD] Device 145a | +-00.2 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor | \-00.3 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) USB 3.0 Host Controller +-08.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge +-08.1-[0a]--+-00.0 Advanced Micro Devices, Inc. [AMD] Device 1455 | +-00.2 Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] | \-00.3 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller +-14.0 Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller +-14.3 Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge +-18.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0 +-18.1 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1 +-18.2 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2 +-18.3 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3 +-18.4 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4 +-18.5 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5 +-18.6 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric Device 18h Function 6 +-18.7 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7 +-19.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0 +-19.1 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1 +-19.2 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2 +-19.3 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3 +-19.4 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4 +-19.5 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5 +-19.6 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric Device 18h Function 6 \-19.7 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7 |
Iemand nog een idee wat ik kan proberen? Ik ben er op het moment al zover mee heen dat ik op het punt sta die NVMe schijf te laten voor wat het is en er gewoon maar een SATA SSD in te zetten. Ik weet niet of het nu iets in de Samsung schijf is, chipset van het moederbord of de aansturing via Linux. Er zijn redelijk wat topics over NVMe en Linux te vinden maar ik heb nog niet een bevredigend antwoord kunnen vinden.
Het rottige eraan is dat het zich soms direct voordoet en de volgende reboot is de machine de gehele dag stabiel. Dat zijn echt de meest frustrerende fouten.
[ Voor 8% gewijzigd door Sandor_Clegane op 17-06-2018 20:55 ]
Less alienation, more cooperation.