1 | #+TITLE: RAM failures: How to detect and fix
|
---|
2 | #+DATE: <2022-10-26 Wed>
|
---|
3 | #+LANGUAGE: en
|
---|
4 |
|
---|
5 | * RAM failures: How to detect and fix
|
---|
6 |
|
---|
7 | #+BEGIN_abstract
|
---|
8 | If your system break in unexpected ways and you can't understand why it
|
---|
9 | might be the RAM issue. In this article, I will try to explain why RAM
|
---|
10 | is getting broken, how to detect it, and how to potentially fix it
|
---|
11 | without going too deep into details.
|
---|
12 | #+END_abstract
|
---|
13 |
|
---|
14 | #+ATTR_HTML: :loading lazy
|
---|
15 | [[file:../../../public/images/thinkpad_ram_fail.jpg]]
|
---|
16 |
|
---|
17 | I've started struggling with segfaults and other problems more often
|
---|
18 | after years of using my laptop. It was not critical, but annoying and it
|
---|
19 | turned out that the problem was a broken ram module. I tested it and
|
---|
20 | then after figuring out that it was broken I removed it completely. I
|
---|
21 | had only 2GiB of RAM available after that, but I got to the conclusion
|
---|
22 | that 2GiB is enough for me to do the work if to apply some optimizations
|
---|
23 | to my GNU+Linux system.
|
---|
24 |
|
---|
25 | ** What RAM is used for
|
---|
26 | RAM is used to execute and store program memory. When you run the binary
|
---|
27 | all CPU instructions are being placed into RAM and then all that
|
---|
28 | instructions are being read sequentially one after another. While
|
---|
29 | executing your program it will store intermediate results such as
|
---|
30 | storing variables, data structures, and so on into RAM. After finishing
|
---|
31 | your program RAM is cleared from program instructions and intermediate
|
---|
32 | results of execution of these instructions.
|
---|
33 |
|
---|
34 | [[[https://en.wikipedia.org/wiki/Random-access_memory][Wikipedia: Random-access memory]]]
|
---|
35 |
|
---|
36 | ** Why RAM can be damaged
|
---|
37 | I can't say why exactly it can be damaged, overheating is probably one
|
---|
38 | of the factors. There is a chance that RAM itself is not damaged at all.
|
---|
39 | Actually it looks like that it breaks rarely. What can be "damaged" is
|
---|
40 | RAM module contacts and it can be easily fixed. What you need to do is
|
---|
41 | to take out ram modules and use an eraser to clean its contacts. But how
|
---|
42 | to understand that you have a problem in the first place?
|
---|
43 |
|
---|
44 | ** How to understand that RAM is broken
|
---|
45 | The most expected way to see RAM fail is [[https://computerhope.com/beep.htm][BIOS signaling]] that it is
|
---|
46 | broken. It should beep a special signal using a PC Speaker. You can read
|
---|
47 | your motherboard manual to understand what it mean. Usually, it means
|
---|
48 | that the computer won't start with a "completely" broken RAM module.
|
---|
49 |
|
---|
50 | If the system loads just fine, but you experience problems along the way
|
---|
51 | such as random segfaults and programs crash, kernel panics, and so on,
|
---|
52 | you might have broken segments of RAM. To detect such segments you can
|
---|
53 | use several programs listed below. RAM checking is usually not a fast
|
---|
54 | process, so you will probably need to leave your device running for
|
---|
55 | several hours.
|
---|
56 |
|
---|
57 | *** Memtest86+
|
---|
58 | Memtest86+ runs from the Grub menu before running your OS. It needs to
|
---|
59 | run in such way because it needs the whole range of RAM and your running
|
---|
60 | system is using that RAM range. It runs a lot of checks and checks every
|
---|
61 | segment of your ram. While checking it logs the list of broken segments
|
---|
62 | that you can write down.
|
---|
63 |
|
---|
64 | You can install it using your GNU+Linux package manager such as apt. The
|
---|
65 | package is usually called ~memtest86+~. But there is a small caveat. If
|
---|
66 | you use the old version it won't work on UEFI systems.
|
---|
67 |
|
---|
68 | If it doesn't work you can download memtest86+ newer version
|
---|
69 | distribution to your USB stick and load memtest from that. It should
|
---|
70 | work on UEFI and BIOS systems. It can be downloaded from the offical
|
---|
71 | website.
|
---|
72 |
|
---|
73 | [[[https://memtest.org/][Official Website]]]
|
---|
74 |
|
---|
75 | *** Memtester
|
---|
76 | It has the same purpose as memtest86+, but it runs while your system is
|
---|
77 | running, so it doesn't check the whole RAM range, but only specified
|
---|
78 | free ram available at your system at the moment of running this
|
---|
79 | program. It can be installed using your package manager of choice by
|
---|
80 | typing ~memtester~ as a package name.
|
---|
81 |
|
---|
82 | [[[https://pyropus.ca./software/memtester/][Official Website]]]
|
---|
83 |
|
---|
84 | ** How to fix broken RAM
|
---|
85 | #+ATTR_HTML: :loading lazy
|
---|
86 | [[file:../../../public/images/thinkpad_ram_repair.jpg]]
|
---|
87 |
|
---|
88 | First of all, if memtest86+ or(and) memtester doesn't show you any
|
---|
89 | error, congratulations! You don't have any problems with your RAM.
|
---|
90 |
|
---|
91 | If it shows a small number of errors like one or two, you can let Linux
|
---|
92 | Kernel ignore such segments of RAM, so programs don't use such broken
|
---|
93 | segments and work stable all the time. You need to use for that ~memmap~
|
---|
94 | kernel argument in your grub configuration. For example:
|
---|
95 | ~memmap=0x100000$762ce9c38420,0x100000$34e03060,0x100000$87fce060,0x100000$23c63060,0x100000$87b6c060~. There
|
---|
96 | is also grub config unit called ~GRUB_BADRAM~, but it looks like it is
|
---|
97 | deprecated and memmap is prefered.
|
---|
98 |
|
---|
99 | For more details about blacklisting bad segments of RAM read [[https://unix.stackexchange.com/questions/75059/how-to-blacklist-a-correct-bad-ram-sector-according-to-memtest86-error-indicati][this
|
---|
100 | comprehensive Stack Overflow answer]].
|
---|
101 |
|
---|
102 | If it shows a big number of errors, like many thousand, it means that
|
---|
103 | probably one of your sticks of RAM is broken. To detect which one is
|
---|
104 | broken exactly you can probably figure it out by looking at addresses or
|
---|
105 | running another test using a specific stick(s) of RAM and seeing if errors
|
---|
106 | are gone.
|
---|
107 |
|
---|
108 | Be aware, that if you leave with one RAM stick there is a chance, that
|
---|
109 | it will only boot in a specific RAM slot. Read your motherboard manual
|
---|
110 | if something doesn't work.
|
---|
111 |
|
---|
112 | If you have a RAM memory stick with tons of errors, you can try to
|
---|
113 | repair it. I can't tell how exactly it is being done and why it is done
|
---|
114 | in the way it should be done. You can find videos on fixing RAM sticks
|
---|
115 | on YouTube and other resources. Here is [[https://youtu.be/KVR91p-Bd6M][the link to one of such video]].
|
---|
116 |
|
---|
117 | ** RAM Optimizations of GNU+Linux system
|
---|
118 | If your RAM was broken and you left with much less memory than you
|
---|
119 | expected, don't run and buy new RAM sticks. There is a chance that even
|
---|
120 | with less RAM the system will work completely fine. Linux is pretty good
|
---|
121 | at working on low-end machines and it has different ways to handle a
|
---|
122 | lack of memory. There is often a situation in a modern world, when a
|
---|
123 | person has devices that outperform their tasks, like working on gaming
|
---|
124 | laptop with very powerful CPU and GPU, that are used mostly to render
|
---|
125 | text in a text editor.
|
---|
126 |
|
---|
127 | *** Swap
|
---|
128 | Swap is a partition on your hard drive that is being used in a situation
|
---|
129 | when there is no RAM left. It is used for other reasons too and such
|
---|
130 | partition is recommended to have on most GNU+Linux systems.
|
---|
131 |
|
---|
132 | You can configure how often linux system will use swap changing
|
---|
133 | ~swappiness~. You can read about changing that setting and learn about
|
---|
134 | swap in general in the link below.
|
---|
135 |
|
---|
136 | [[[https://wiki.archlinux.org/title/swap][Arch Linux Wiki: Swap]]]
|
---|
137 |
|
---|
138 | *** Zram
|
---|
139 | Zram is something that stays in between RAM and Swap in terms of
|
---|
140 | performance. It helps your system to stay performant if it uses swap a
|
---|
141 | lot, but it increases the CPU usage because of that. I use Zram on a
|
---|
142 | machine with 2GiB of RAM and 16GiB swap and it works great even with
|
---|
143 | many programs opened at the same time (text editor, browser, docker
|
---|
144 | container, messenger).
|
---|
145 |
|
---|
146 | [[[https://wiki.archlinux.org/title/Improving_performance#zram_or_zswap][Arch Linux Wiki: Zram]]]
|
---|
147 |
|
---|
148 | *** Less bloat software
|
---|
149 | Also as an alternative way you can simply use less bloat software, so
|
---|
150 | you don't need so much RAM in the first place. In many cases, good
|
---|
151 | software doesn't require a lot of RAM, but bad software always leaks
|
---|
152 | memory, so you would need many GiBs of RAM to use it properly. The most
|
---|
153 | bloated software is a web browser such as chromium and firefox and
|
---|
154 | browser-based apps done in electron such as Slack, VSCode, and other
|
---|
155 | proprietary products.
|
---|
156 |
|
---|
157 | ** Conclusions
|
---|
158 | Now you have directions about what to do when you suspect RAM
|
---|
159 | failure. That knowledge can also be used for testing when you buy used
|
---|
160 | memory sticks from someone else.
|
---|