FALSE SHARING
Recently, I decided to analyze the speed of some of my C projects to identify areas for improvement. During this process, I came across a concept that caught my attention: False Sharing.
False Sharing is a performance issue that can affect concurrent programs. It occurs when multiple threads access and modify different variables that, coincidentally, share the same cache line. This phenomenon leads to frequent cache invalidations, slowing down the program even if the variables are not logically related.
Technical Context#
To better understand this issue, it is useful to consider how modern processors handle memory:
- Cache hierarchy: Processors have cache levels (L1, L2, L3) that minimize latency when accessing memory.
- Cache lines: Main memory is organized into blocks called cache lines, typically 64 bytes in size (though this can vary by architecture).
- Access and coherence: When a thread accesses a variable, the cache line containing it is loaded into the CPU’s cache. If another thread modifies a different variable within the same cache line, the entire line is invalidated in other caches. This forces it to be reloaded from main memory or the modifying cache, negatively impacting performance.
Example#
#include <stdio.h>
#include <pthread.h>
#include <time.h>
#define iterations 1000
struct point{
int x;
int y;
}point;
pthread_mutex_t mutex_x = PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_t mutex_y = PTHREAD_MUTEX_INITIALIZER;
void*
sum_x(void *arg)
{
for (int i = 0; i < 1000; i++)
{
point.x++;
}
}
void*
sum_y(void *arg)
{
for (int i = 0; i < 1000; i++)
{
point.y++;
}
}
int main()
{
pthread_t thread1, thread2;
for (int iter = 0; iter < iterations; iter++)
{
point.x = 0;
point.y = 0;
pthread_create(&thread1, NULL, sum_x, NULL);
pthread_create(&thread2, NULL, sum_y, NULL);
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);
}
pthread_mutex_destroy(&mutex_x);
pthread_mutex_destroy(&mutex_y);
return 0;
}
- Thread 1 updates
point.x
while thread 2 updatespoint.y
. - Even though
point.x
andpoint.y
are independent variables, they share the same cache line. - Every time one thread modifies its variable, the cache line is invalidated in the other thread’s cache, forcing costly synchronization operations between caches.
This has three main effects:
- Performance degradation: Latency increases due to frequent cache invalidations.
- Higher memory bus usage: Communication between processor cores increases.
- Scaling difficulty: The problem can worsen as more threads are added.
How to Mitigate This Effect?#
There are multiple ways to mitigate False Sharing in C for this specific case. Here are three options and their performance results for different optimizations:
- Align variables to 64 bytes:
struct point {
int x;
int y __attribute__((aligned(64)));
}point;
This solution separates variables into different cache lines, eliminating False Sharing. It is memory-efficient and significantly improves performance.
- Align the entire struct:
struct point {
int x;
int y;
} __attribute__((aligned(64))) point;
Aligns the struct to 64 bytes. While it improves overall access, it does not eliminate False Sharing between x
and y
.
- Add explicit padding:
struct point {
int x;
char padding[64];
int y;
}point;
Physically separates x
and y
in memory. While it eliminates False Sharing, it increases memory usage and can negatively impact performance.
- Use independent global variables:
int x;
int y;
Avoids False Sharing but may not always be practical.
Performance Analysis#
Using perf
, I analyzed the impact of each solution:
perf record -o perf_program1.data ./program
perf report -i perf_program1.data
Base Case#
The perf
report looks like this:
8,72% prueba1 libc.so.6 [.] pthread_mutex_lock@@GLIBC_2.2.5
7,75% prueba1 [kernel.kallsyms] [k] memset
5,79% prueba1 libc.so.6 [.] __GI___pthread_mutex_unlock_usercnt
3,25% prueba1 [kernel.kallsyms] [k] clear_page_rep
2,78% prueba1 [kernel.kallsyms] [k] unmap_page_range
2,46% prueba1 [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
2,11% prueba1 [kernel.kallsyms] [k] native_write_msr
1,45% prueba1 [kernel.kallsyms] [k] psi_group_change
Case with y
Alignment#
Aligning y
to 64 bytes eliminates False Sharing:
19,27% prueba2 libc.so.6 [.] pthread_mutex_lock@@GLIBC_2.2.5
19,15% prueba2 libc.so.6 [.] __GI___pthread_mutex_unlock_usercnt
10,31% prueba2 [kernel.kallsyms] [k] memset
9,63% prueba2 [kernel.kallsyms] [k] clear_page_rep
5,38% prueba2 [kernel.kallsyms] [k] native_write_msr
1,98% prueba2 [kernel.kallsyms] [k] __list_del_entry_valid
1,29% prueba2 [kernel.kallsyms] [k] copy_process
1,20% prueba2 [kernel.kallsyms] [k] __memcg_kmem_charge_page
1,20% prueba2 [kernel.kallsyms] [k] alloc_vmap_area
1,06% prueba2 [kernel.kallsyms] [k] memcpy
0,85% prueba2 prueba2 [.] sum_x
0,84% prueba2 [kernel.kallsyms] [k] try_charge_memcg
Explicit synchronization dominates resource usage, eliminating bottlenecks in the cache.
Case with Struct Alignment#
Aligning the entire struct does not eliminate False Sharing between x
and y
, as they remain in the same cache line:
19,74% prueba4 libc.so.6 [.] pthread_mutex_lock@@GLIBC_2.2.5
19,24% prueba4 libc.so.6 [.] __GI___pthread_mutex_unlock_usercnt
5,86% prueba4 [kernel.kallsyms] [k] native_write_msr
5,85% prueba4 [kernel.kallsyms] [k] memset
2,59% prueba4 [kernel.kallsyms] [k] clear_page_rep
2,05% prueba4 [kernel.kallsyms] [k] unmap_page_range
1,31% prueba4 [kernel.kallsyms] [k] psi_group_change
1,26% prueba4 [kernel.kallsyms] [k] memcpy
0,95% prueba4 [kernel.kallsyms] [k] copy_process
0,86% prueba4 [kernel.kallsyms] [k] native_read_msr
0,84% prueba4 [kernel.kallsyms] [k] memcg_slab_post_alloc_hook
0,84% prueba4 prueba4 [.] sum_x
0,83% prueba4 [kernel.kallsyms] [k] retbleed_return_thunk
Although it slightly improves overall performance by optimizing global access to the struct, false sharing persists, limiting the gains.
Case with Explicit Padding#
Adding explicit padding eliminates False Sharing but introduces additional costs due to the increased struct size:
22,11% prueba5 libc.so.6 [.] pthread_mutex_lock@@GLIBC_2.2.5
21,95% prueba5 libc.so.6 [.] __GI___pthread_mutex_unlock_usercnt
7,58% prueba5 [kernel.kallsyms] [k] native_write_msr
5,12% prueba5 [kernel.kallsyms] [k] memset
2,57% prueba5 [kernel.kallsyms] [k] clear_page_rep
1,50% prueba5 [kernel.kallsyms] [k] copy_process
1,40% prueba5 [kernel.kallsyms] [k] psi_group_change
1,05% prueba5 [kernel.kallsyms] [k] native_read_msr
0,89% prueba5 prueba5 [.] sum_x
0,81% prueba5 [kernel.kallsyms] [k] retbleed_return_thunk
0,76% prueba5 [kernel.kallsyms] [k] ___slab_alloc
0,74% prueba5 [kernel.kallsyms] [k] _raw_spin_lock
The increased memory usage reduces cache data density, penalizing prefetching and limiting improvements.
Case with Independent Variables#
Using independent global variables eliminates False Sharing, but performance does not improve significantly:
32,54% prueba6 libc.so.6 [.] __GI___pthread_mutex_unlock_usercnt
21,81% prueba6 libc.so.6 [.] pthread_mutex_lock@@GLIBC_2.2.5
6,89% prueba6 [kernel.kallsyms] [k] memset
5,90% prueba6 prueba6 [.] sum_y
4,81% prueba6 [kernel.kallsyms] [k] native_write_msr
1,00% prueba6 [kernel.kallsyms] [k] _raw_spin_lock
0,97% prueba6 [kernel.kallsyms] [k] clear_page_rep
This is due to the overhead of explicit synchronization and the latency of accessing global memory.
Conclusion of perf
#
The analysis of the results obtained with perf
and the measured times for the different configurations reveals how alignment strategies and data design impact performance, explaining the effect in each case.
In the base case, the variables x
and y
are in the same struct without any specific alignment. This causes them to share the same cache line, leading to false sharing when threads concurrently access these variables. The perf
report shows significant usage of functions related to synchronization, such as native_queued_spin_lock_slowpath
and memset
, indicating that threads are constantly invalidating cache lines. This behavior explains the relatively high execution time (123.93 seconds), as false sharing introduces significant overhead.
When the variable y
is aligned to 64 bytes, false sharing is eliminated because x
and y
occupy different cache lines. This results in a significant improvement in execution time (83.21 seconds). The perf
report shows that functions like native_queued_spin_lock_slowpath
disappear, while pthread_mutex_lock
and pthread_mutex_unlock
dominate the cycle consumption, reflecting that explicit synchronization is now the main cost. This configuration turns out to be the most efficient, as it eliminates the false sharing bottleneck without introducing significant overhead.
In this case, there is no evidence of false sharing in the perf
report, as native_queued_spin_lock_slowpath
and other clear indicators of cache contention do not appear. However, aligning the entire struct
to 64 bytes does not eliminate the false sharing between x
and y
because both variables remain in the same cache line. This slightly improves overall performance by optimizing access to the struct
, but false sharing persists, limiting the gains. Performance is slightly worse than aligning only the individual variables, likely due to the increased memory usage, which adds pressure on the cache. Furthermore, synchronization functions like pthread_mutex_lock
and pthread_mutex_unlock
are the primary bottlenecks, indicating that the problem is more related to synchronization than data organization.
When explicit padding is used to separate x
and y
in memory, false sharing disappears. However, this strategy introduces an additional cost due to the increased size of the struct, which reduces cache data density and complicates prefetching. This results in the worst performance observed (127.01 seconds). The perf
analysis confirms that, while false sharing is eliminated, the time is still dominated by synchronization functions (pthread_mutex_lock
and pthread_mutex_unlock
), limiting the improvements.
Finally, using independent global variables instead of a struct should eliminate false sharing. However, the performance achieved (124.81 seconds) is comparable to the base case. The perf
report indicates that the high proportion of time spent in pthread_mutex_lock
and pthread_mutex_unlock
reflects the overhead of explicit synchronization. Moreover, although cache invalidation is eliminated, global access may introduce small penalties due to memory access latency, explaining the lack of significant improvement.
Results#
In this section, I show the speed results in seconds for the program with different ways of declaring the variables in 1,000,000 iterations and with different types of optimization.
-O0:
configuration | nº iterations | speed (s) |
---|---|---|
Base case | 1000000 | 123.933620 |
Align y |
1000000 | 83.217787 |
Align complete struct | 1000000 | 99.541299 |
Explicit padding (char[64] ) |
1000000 | 127.010401 |
Variables outside the struct | 1000000 | 124.819451 |
-O1:
configuration | nº iterations | speed (s) |
---|---|---|
Base case | 1000000 | 120.383218 |
Align y |
1000000 | 79.891089 |
Align complete struct | 1000000 | 104.526728 |
Explicit padding (char[64] ) |
1000000 | 79.828506 |
Variables outside the struct | 1000000 | 105.029095 |
-O2:
configuration | nº iterations | speed (s) |
---|---|---|
Base case | 1000000 | 120.622384 |
Align y |
1000000 | 80.128776 |
Align complete struct | 1000000 | 103.547142 |
Explicit padding (char[64] ) |
1000000 | 79.989479 |
Variables outside the struct | 1000000 | 103.280250 |
-O3:
configuration | nº iterations | speed (s) |
---|---|---|
Base case | 1000000 | 120.887245 |
Align y |
1000000 | 79.762786 |
Align complete struct | 1000000 | 103.815620 |
Explicit padding (char[64] ) |
1000000 | 79.586391 |
Variables outside the struct | 1000000 | 101.085188 |
Why optimizations from -O1 improve execution time?#
When a program is compiled with optimizations starting from -O1
, the compiler begins to make fundamental adjustments that significantly improve performance. At this level, redundant calculations are eliminated, instructions are reorganized to better utilize CPU resources, and memory accesses are optimized, reducing latency. These automatic changes complement manual optimizations, such as data alignment, by allowing the hardware to operate more efficiently, especially in cases where problems like false sharing have been mitigated.
At higher levels of optimization, such as -O2
and -O3
, the compiler introduces more advanced techniques, such as vectorization, aggressive branch elimination, and function inlining, which further reduce execution overhead. However, the impact on execution times stabilizes, as the main bottleneck (such as explicit synchronization in multithreaded programs) cannot always be eliminated with just code optimization. This highlights the importance of combining careful data design with the compiler’s capabilities to achieve the best possible performance.
Final Reflection#
Performance in concurrent applications is deeply linked to how data is structured and accessed in memory. As we have seen, issues like False Sharing can arise from small oversights in data structure design but have a disproportionate impact on performance. Fortunately, with a methodical approach and tools like perf
, it is possible to identify these issues and address them with simple but effective optimizations.
This article reflects my current understanding of the topic and my intention to share useful knowledge. However, if there is anything that can be improved, corrected, or expanded, I would be happy to receive your feedback. Feel free to contact me via email to share your comments, experiences, or any corrections you consider necessary.
Glossary of terms:#
- False Sharing: Occurs when multiple threads access different variables that share the same cache line. This causes unnecessary conflicts in the cache, degrading performance.
- Cache Line: The smallest unit of data that the CPU cache exchanges with main memory. Its size is typically 64 bytes in most modern processors.
- Data Alignment: A technique that organizes data in memory so that it matches the cache line boundaries, minimizing conflicts between threads.
- Padding: The process of adding extra bytes between elements of a data structure to prevent them from sharing cache lines.
- Compiler Optimization: Automatic adjustments made by the compiler at different levels (
-O1
,-O2
,-O3
) to improve a program’s performance without altering its functionality. - Prefetching: A technique used by processors to load data into the cache before it is needed, reducing access latency.
- Explicit Synchronization: The use of primitives like mutexes or semaphores to coordinate access to shared resources between threads.
- Cache Data Density: The amount of useful data that can be stored in the cache simultaneously; it affects overall performance in systems that depend on memory access speed.