You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
当写入到一行 invalidate 状态的 cache line 时便会使用到 store buffer。写如果要继续执行,CPU 需要先发出一条 read-invalid 消息(因为需要确保所有其它缓存了当前内存地址的 CPU 的 cache line 都被 invalidate 掉),然后将写推入到 store buffer 中,当最终 cache line 达到当前 CPU 时再执行这个写操作。
62
+
63
+
CPU 存在 store buffer 的直接影响是,当 CPU 提交一个写操作时,这个写操作不会立即写入到 cache 中。因而,无论什么时候 CPU 需要从 cache line 中读取,都需要先扫描它自己的 store buffer 来确认是否存在相同的 line,因为有可能当前 CPU 在这次操作之前曾经写入过 cache,但该数据还没有被刷入过 cache(之前的写操作还在 store buffer 中等待)。需要注意的是,虽然 CPU 可以读取其之前写入到 store buffer 中的值,但其它 CPU 并不能在该 CPU 将 store buffer 中的内容 flush 到 cache 之前看到这些值。即 store buffer 是不能跨核心访问的,CPU 核心看不到其它核心的 store buffer。
从结果上来讲,memory barrier 是必须的。一个 store barrier 会把 store buffer flush 掉,确保所有的写操作都被应用到 CPU 的 cache。一个 read barrier 会把 invalidation queue flush 掉,也就确保了其它 CPU 的写入对执行 flush 操作的当前这个 CPU 可见。再进一步,MMU 没有办法扫描 store buffer,会导致类似的问题。这种效果对于单线程处理器来说已经是会发生的了。
mesi 协议解决了多核环境下,内存多层级带来的问题。使得 cache 层对于 CPU 来说可以认为是透明的,不存在的。单一地址的变量的写入,可以以线性的逻辑进行理解。
147
162
148
-
Store Buffer:
163
+
但 mesi 协议有两个问题没有解决,一种是 RMW 操作,或者叫 CAS;一种是 ADD 操作。因为这两种操作都需要先读到原始值,进行修改,然后再写回到内存中。
149
164
150
-
当写入到一行 invalidate 状态的 cache line 时便会使用到 store buffer。写如果要继续执行,CPU 需要先发出一条 read-invalid 消息(因为需要确保所有其它缓存了当前内存地址的 CPU 的 cache line 都被 invalidate 掉),然后将写推入到 store buffer 中,当最终 cache line 达到当前 CPU 时再执行这个写操作。
165
+
同时,在 CPU 架构中我们看到 CPU 除了 cache 这一层之外,还存在 store buffer,而 store buffer 导致的内存乱序问题,mesi 协议是解决不了的,这是 memory consistency 范畴讨论的问题。
151
166
152
-
CPU 存在 store buffer 的直接影响是,当 CPU 提交一个写操作时,这个写操作不会立即写入到 cache 中。因而,无论什么时候 CPU 需要从 cache line 中读取,都需要先扫描它自己的 store buffer 来确认是否存在相同的 line,因为有可能当前 CPU 在这次操作之前曾经写入过 cache,但该数据还没有被刷入过 cache(之前的写操作还在 store buffer 中等待)。需要注意的是,虽然 CPU 可以读取其之前写入到 store buffer 中的值,但其它 CPU 并不能在该 CPU 将 store buffer 中的内容 flush 到 cache 之前看到这些值。即 store buffer 是不能跨核心访问的,CPU 核心看不到其它核心的 store buffer。
从结果上来讲,memory barrier 是必须的。一个 store barrier 会把 store buffer flush 掉,确保所有的写操作都被应用到 CPU 的 cache。一个 read barrier 会把 invalidation queue flush 掉,也就确保了其它 CPU 的写入对执行 flush 操作的当前这个 CPU 可见。再进一步,MMU 没有办法扫描 store buffer,会导致类似的问题。这种效果对于单线程处理器来说已经是会发生的了。
172
+
|#LoadLoad|#LoadStore|
173
+
|-|-|
174
+
|#StoreLoad|#StoreStore|
175
+
176
+
具体说明:
177
+
178
+
|barrier name|desc|
179
+
|-|-|
180
+
|#LoadLoad|阻止不相关的 Load 操作发生重排|
181
+
|#LoadStore|阻止 Store 被重排到 Load 之前|
182
+
|#StoreLoad|阻止 Load 被重排到 Store 之前|
183
+
|#StoreStore|阻止 Store 被重排到 Store 之前|
184
+
185
+
需要注意的是,这里所说的重排都是内存一致性范畴,即全局视角的不同变量操作。如果是同一个变量的读写的话,显然是不会发生重排的,因为不按照 program order 来执行程序的话,相当于对用户的程序发生了破坏行为。
159
186
160
187
## lfence, sfence, mfence
161
188
189
+
Intel x86/x64 平台,total store ordering(但未来不一定还是 TSO):
190
+
191
+
mm_lfence("load fence": wait for all loads to complete)
mm_sfence("store fence": wait for all stores to complete)
196
+
197
+
>The Semantics of a Store Fence
198
+
>On all Intel and AMD processors, a store fence has two effects:
199
+
200
+
>Serializes all memory stores. All stores that precede the store fence in program order will become globally observable before any stores that succeed the store fence in program order become globally observable. Note that a store fence is ordered, by definition, with other store fences.
201
+
>In the cycle that follows the cycle when the store fence instruction retires, all stores that precede the store fence in program order are guaranteed to be globally observable. Any store that succeeds the store fence in program order may or may not be globally observable immediately after retiring the store fence. See Section 7.4 of the Intel optimization manual and Section 7.5 of the Intel developer manual Volume 2. Global observability will be discussed in the next section.
202
+
>However, that “SFENCE operation” is not actually a fully store serializing operation even though there is the word SFENCE in its name. This is clarified by the emphasized part. I think there is a typo here; it should read “store fence” not “store, fence.” The manual specifies that while clearing IA32_RTIT_CTL.TraceEn serializes all previous stores with respect to all later stores, it does not by itself ensure global observability of previous stores. It is exactly for this reason why the Linux kernel uses a full store fence (called wmb, which is basically SFENCE) after clearing TraceEn.
203
+
204
+
mm_mfence("mem fence": wait for all memory operations to complete)
> std::memory_order specifies how memory accesses, including regular, non-atomic memory accesses, are to be ordered around an atomic operation. Absent any constraints on a multi-core system, when multiple threads simultaneously read and write to several variables, one thread can observe the values change in an order different from the order another thread wrote them. Indeed, the apparent order of changes can even differ among multiple reader threads. Some similar effects can occur even on uniprocessor systems due to compiler transformations allowed by the memory model.
171
228
172
-
std::memory_order specifies how memory accesses, including regular, non-atomic memory accesses, are to be ordered around an atomic operation. Absent any constraints on a multi-core system, when multiple threads simultaneously read and write to several variables, one thread can observe the values change in an order different from the order another thread wrote them. Indeed, the apparent order of changes can even differ among multiple reader threads. Some similar effects can occur even on uniprocessor systems due to compiler transformations allowed by the memory model.
229
+
> The default behavior of all atomic operations in the library provides for sequentially consistent ordering (see discussion below). That default can hurt performance, but the library's atomic operations can be given an additional std::memory_order argument to specify the exact constraints, beyond atomicity, that the compiler and processor must enforce for that operation.
173
230
174
-
The default behavior of all atomic operations in the library provides for sequentially consistent ordering (see discussion below). That default can hurt performance, but the library's atomic operations can be given an additional std::memory_order argument to specify the exact constraints, beyond atomicity, that the compiler and processor must enforce for that operation.
0 commit comments