目录 |
---|
...
Overview
TLS(Thread Local Storage)全称为线程本地存储变量,指每个线程有独立的存储,进程内不共享。对于TLS变量来说,不同的线程指向不同的存储空间。它的实现涉及高级编程语言、编译器和链接器的支持。
TLS变量的申请
在高级编程语言中,在多线程环境下定义申请TLS变量,线程间对TLS变量的修改互不影响,像普通变量一样可以在赋值语句中读写访问。
线程的动态管理
除了主线程外,其它线程都在程序运行时创建和销毁,在加载共享对象文件时,无法确认何时创建线程,创建多少线程等信息。因此对TLS变量libc需要进行动态管理。由于不同线程访问TLS变量的地址空间不一样,libc需要管理TLS变量的地址空间,需要动态获取TLS变量地址。
编译器和链接器将动态库中所有的TLS变量放在同一个TLS程序段中,TLS变量相对于TLS程序段的偏移是固定的。
TLS变量的实现方式
为了优化TLS变量的访问性能,TLS变量的实现采用了以下四种方式,性能逐步增加,实现场景逐步减少。
Generic Dynamic:通用方式,每个TLS变量的访问都需函数调用获取地址。可跨动态库引用访问
Local Dynamic:局部方式,在函数内多个TLS变量引用,通过函数调用获取TLS程序段的地址,TLS变量通过对TLS程序段的偏移获取地址。只能在动态库内引用访问
Initial Exec:段寄存器和TLS变量偏移间接寻址,在主程序起始时静态加载的动态库中引用访问
Local Exec:段寄存器和TLS变量偏移直接寻址,只能在主程序中引用访问
另外有一个对GD的优化的访问模式TLSDESC。优化主要包括优化static TLS block中TLS变量访问,减少函数调用对寄存器污染,减少原子操作访问等
编译器访问TLS变量的方式
编译器一般用tls-dialect和tls-model2个选项支持5种访问方式:
代码块 |
---|
-mtls-dialect=trad -ftls-model=global-dynamic : GD方式
-mtls-dialect=trad -ftls-model=local-dynamic :LD方式
-mtls-dialect=trad -ftls-model=initial-exec :IE方式
-mtls-dialect=trad -ftls-model=local-exec :LE方式,只支持主程序,不支持动态库
-mtls-dialect=desc :TLSDESC方式 |
注:以上为aarch64的编译选项,x86_64的tls-dialec编译选项值与aarch64不同,为gnu和gnu2,分别对应trad和desc
C/C++的TLS变量修饰符
...
Specifier
...
Notes
...
__thread
...
non-standard, but ubiquitous in GCC and Clang
...
cannot have dynamic initialization or destruction
...
_Thread_local
...
a keyword standardized in C11
...
cannot have dynamic initialization or destruction
...
thread_local
...
C11: a macro for _Thread_local via threads.h
...
C++11: a keyword, allows dynamic initialization and/or destruction
TLS数据结构介绍
Drepper根据静态加载和动态加载共享对象的不同场景提供了内存布局的两种实现,基本原理类似。
定义一个动态线程数组,名为dtv,每个元素指向动态库的TLS程序段内容在当前线程的位置,该数组可动态增加
分配一段静态的物理内存,称之为静态TLS块(static tls block),一旦分配不能增减,大部分用来存储TLS段内容,一小段内存用来存储线程控制块(TCB),该TCB指向dtv数组
动态加载的动态库,为每个动态库的TLS程序段分配一段内存,称之为动态TLS块(dynamic tls block),由dtv元素指向这段内存
使用CPU某个专用段寄存器保存线程指针地址(tp),如:x86_64的%fs段寄存器,aarch64的tpidr_el0段寄存器
...
TLS数据布局
这里分别列出gnu和bionic的arrch64 TLS数据布局,gnu的实现与图1的结构类似,bionic的实现的TCB定义不一样,其它类似。
...
图3为bionic线程栈的物理布局图,地址从上往下增加,静态TLS块内存分配在线程栈上,TCB大小为bionic_tcb数据结构大小,而非16字节。静态TLS块紧挨着bionic_tcb数据结构。
...
图4为gnu线程栈的物理布局图,地址从下往上增加,静态TLS块内存分配在线程栈上,TCB大小为tcbhead_t数据结构大小(16字节)。静态TLS块紧挨着tcbhead_t数据结构。
静态TLS空间
涉及静态TLS空间,TCB管理空间(bionic_tcb),线程指针(tp ),动态线程数组(dtv)等。
静态TLS空间:静态TLS空间涉及静态TLS空间和TLS管理空间,一段连续的物理内存,一般分配在线程栈上,动态库的tls_offset相对于该空间计算的
线程指针:线程指针需要保存在段寄存器中,变种1的TCB所占空间需要与编译器达成一致,以满足LE访问方式的要求,如:bionic的TCB所占空间为64,要求编译器生成主程序TLS变量访问指令时,增加64的偏移量。变种2约定偏移为0,因此无此要求。
TCB管理空间:记录线程指针、动态线程数组,线程局部数据管理等地址,bionic在aarch64定义的空间大小为9个8字节数组共72字节,比线程指针与第一个TLS变量的偏移大8,因此线程指针指向其第二个元素地址。另外,TCB管理空间涉及字节对齐问题,因此与静态TLS空间的起始地址可能不一样,这可能与bionic的实现相关
动态线程数组:记录每个动态库的TLS程序段的起始位置,其地址保存在线程指针指定的空间中。
bionic TLS数据结构初始化流程
动态链接器对TLS数据结构的初始化分两部分,一部分在加载主程序过程中,称之为静态加载库,另一部分在主程序运行中调用dlopen加载动态库过程中,称之为动态加载库。
静态加载库初始化TLS
动态链接器在加载主程序过程中,使用StaticTlsLayout和TlsModules两个全局变量初始化TLS数据结构。
StaticTlsLayout类型变量计算每个依赖库TLS程序段在静态TLS块的偏移。
TlsModules类型变量计算拥有TLS程序段的所有动态库
在主程序加载过程中,所有的TLS程序段都存储在静态TLS块内存块中,具体步骤如下:
在linker初始化过程中,调用libc_init_main_thread_early初始化bionic_tcb和tpidr_el0,调用init_tcb_dtv初始bionic_tcb中的TLS_SLOT_DTV,其更新标志值为0
调用linker_setup_exe_static_tls:(假设主程序包含TLS程序段)
a. 在StaticTlsLayout类型变量中预留bionic_tcb对象和TLS空间以及其位置
b. 调用register_tls_module获得module id,并加入TlsModules类型变量中
c. 预留bionic_tls对象空间及其位置
加载主程序依赖库,若存在TLS程序段,调用soinfo::register_soinfo_tls在StaticTlsLayout类型变量中预留TLS空间和位置,并加入TlsModules类型变量中
调用linker_finalize_static_tls,计算StaticTlsLayout类型变量中预留空间总大小,即为静态TLS块大小
调用__allocate_thread_mapping,为静态TLS块分配内存空间,空间分配在主程序的栈上。静态TLS块空间包括:bionic_tcb、bionic_tls以及TLS程序段。
调用__init_static_tls,将TlsModules类型变量中动态库的TLS段内容拷贝至静态TLS块空间中。
调用bionic_tcb::copy_from_bootstrap同步第一步初始化bionic_tcb的内容
__init_tcb更新bionic_tcb
调用__init_bionic_tls_ptrs更新bionic_tls地址
调用__set_tls设置静态TLS块地址至段寄存器tpidr_el0中
重定位初始化TLS变量的GOT表项
GD:重定位类型包括R_AARCH64_TLS_DTPMOD64和R_AARCH64_TLS_DTPREL64,分别将其相邻的GOT表项初始化为module id和变量在其TLS段中的偏移
LD:aarch64对LD的实现与GD相同,而x86_64的实现不同,其重定位类型为R_X86_64_DTPMOD64,分别将其相邻的GOT表项初始化为module id和偏移0
IE:重定位类型为R_AARCH64_TLS_TPREL64,将其GOT表项值初始为static STL block上的偏移
LE:无重定位项和GOT表项,不需要初始化
TLSDESC:重定位类型为R_AARCH64_TLSDESC,将其相邻的GOT表项初始化为tlsdesc_resolver_static函数地址和静态TLS块上的偏移量
从上初始化流程中,可看出dtv数组为空,dtv数组只在访问TLS变量时才会创建,这种延时分配的好处有:
对于TLS变量的IE和LE访问方式来说,可直接通过段寄存器和偏移量来获得TLS变量的地址,不需要通过dtv数组查找。
对于TLS变量的GD和LD访问方式来说,通过调用__tls_get_addr函数获取TLS变量地址,在其检查dtv数组为空时,根据TLS动态库的数量重新分配dtv数组空间
对于TLS变量的TLSDESC访问方式来说,其tlsdesc_resolver_static函数直接返回TLS变量在static STL block上的偏移量,也无需分配dtv数组
从上面的用例看,TLSDSC访问方式相对于GD来说,对静态TLS块上的TLS变量的访问优化是显著的,TLSDESC方式直接返回GOT表项中的静态TLS块偏移量,而GD方式需要访问dtv数组,计算而得其地址。
动态加载库初始化TLS
bionic通过dlopen动态加载库,初始TLS的步骤如下:
在do_dlopen->find_library->soinfo::register_soinfo_tls→register_tls_module流程中,获得module id,并加入至TlsModules类型变量中
在soinfo::relocate->plain_relocate->plain_relocate_impl->process_relocation→process_relocation_impl流程中重定位初始化TLS变量的GOT表项
GD:重定位类型包括R_AARCH64_TLS_DTPMOD64和R_AARCH64_TLS_DTPREL64,分别将其相邻的GOT表项初始化为module id和变量在其TLS段中的偏移
LD:aarch64对LD的实现与GD相同
IE/LE:不支持
TLSDESC:重定位类型为R_AARCH64_TLSDESC,将其相邻的GOT表项初始化为tlsdesc_resolver_dynamic函数地址和TlsDynamicResolverArg类型变量地址
a. 初始化TlsDynamicResolverArg中TlsIndex的module id以及offset,offset的值为TLS变量在其TLS程序段的偏移量。另外初始化更新标志为库的更新标志,该标志表示动态库是否有更新;
b. 为了存储TlsDynamicResolverArg类型变量,将变量保存在soinfo::tlsdescargs数组中,为处理数组重新分配内存,Relocator::deferred_tlsdesc_relocs缓冲重定位信息,当该库的所有重定位操作完成后,再更新TLS变量的GOT表项.
线程创建过程中初始化TLS
调用pthread_create创建线程,需要对主程序上的所有TLS数据结构进行拷贝。(pthread_create->__allocate_thread)
调用__allocate_thread_mapping分配线程栈空间,包含了静态TLS块空间。(Allocate in order: stack guard, stack, static TLS, guard page)
调用__init_static_tls,将TlsModules类型变量中动态库的TLS段内容拷贝至静态TLS块空间中。
调用__init_tcb更新bionic_tcb
调用__init_tcb_dtv初始bionic_tcb中的TLS_SLOT_DTV,其更新标志值为0。
调用__init_bionic_tls_ptrs更新bionic_tls地址
调用clone,将静态TLS块地址传递给clone,由内核设置段寄存器tpidr_el0值
__tls_get_addr函数实现
GD/LD访问方式使用tls_get_addr函数获取TLS变量绝对地址。tls_get_addr函数涉及对dtv数据更新,其更新的条件由3个更新标志(generation)控制
全局generation,保存在__libc_tls_generation_copy,为TlsModules::generation一个副本,每次新增拥有TLS程序段的动态库时,递增该值,表示有动态库新增。不需要处理动态库删除问题
dtv数组中的generation,保存在数组中的第一个元素,初始化为0,每次更新dtv数组时,更新generation只为当时的全局generation值。与全局generation不相等,说明有新的动态库加载,需要更新dtv数组内容
动态库的generation,保存在TlsModule::first_generation,该值初始化为加载该库时全局generation的值。该值用于判断dtv指向的动态库是否有变化,即是否为旧的动态库
代码块 |
---|
struct TlsIndex {
size_t module_id;
size_t offset;
};
// ti的值保存在动态库的GOT表项中,在重定位时初始化,占两个表项内容
extern "C" void* __tls_get_addr(const TlsIndex* ti){
// 获取dtv数组
TlsDtv* dtv = __get_tcb_dtv(__get_bionic_tcb());
// 获取全局动态库更新标志
size_t generation = atomic_load(&__libc_tls_generation_copy);
if (__predict_true(generation == dtv->generation)) {
void* mod_ptr = dtv->modules[__tls_module_id_to_idx(ti->module_id)];
if (__predict_true(mod_ptr != nullptr)) {
// 无动态库更新,且内存已分配,则进入快速路径,返回TLS变量偏移地址
return static_cast<char*>(mod_ptr) + ti->offset + TLS_DTV_OFFSET;
}
// 延时分配动态库的动态TLS块内存,只有访问该动态库的TLS变量时才分配内存,进入慢速路径
}
// 有动态库更新或者第一次访问,进入dtv和动态TLS块的分配和初始化
return tls_get_addr_slow_path(ti);
} |
tls_get_addr_slow_path函数包含dtv和动态TLS块的分配和初始化.
代码块 |
---|
__attribute__((noinline)) static void* tls_get_addr_slow_path(const TlsIndex* ti) {
TlsModules& modules = __libc_shared_globals()->tls_modules;
bionic_tcb* tcb = __get_bionic_tcb();
ScopedSignalBlocker ssb;
// 互斥写,防止多线程同时修改__libc_shared_globals()->tls_modules全局变量
ScopedWriteLock locker(&modules.rwlock);
// 更新dtv数组或者重新分配数组内存
update_tls_dtv(tcb);
TlsDtv* dtv = __get_tcb_dtv(tcb);
const size_t module_idx = __tls_module_id_to_idx(ti->module_id);
void* mod_ptr = dtv->modules[module_idx];
if (mod_ptr == nullptr) {
// 不存在,则分配内存,将动态库TLS程序段内容拷贝至新内存,并初始化该模块指针
const TlsSegment& segment = modules.module_table[module_idx].segment;
mod_ptr = __libc_shared_globals()->tls_allocator.memalign(segment.alignment, segment.size);
if (segment.init_size > 0) {
memcpy(mod_ptr, segment.init_ptr, segment.init_size);
}
dtv->modules[module_idx] = mod_ptr;
// Reports the allocation to the listener, if any.
if (modules.on_creation_cb != nullptr) {
modules.on_creation_cb(mod_ptr, static_cast<void*>(static_cast<char*>(mod_ptr) + segment.size));
}
}
return static_cast<char*>(mod_ptr) + ti->offset + TLS_DTV_OFFSET;
} |
update_tls_dtv动态分配dtv数组空间,代码太多不再列出,其实现步骤如下:
更新dtv数组的条件:dtv数组的更新标志与全局动态库更新标志不相等,说明动态库有更新,当前指向的为旧的动态库
重新分配dtv数组,条件条件为拥有TLS程序段的动态库总数量大于dtv数组大小
a. 根据动态库数量重新分配dtv数组空间
b. 将dtv数组中的内容备份至新的dtv数组空间
c. 调用__set_tcb_dtv更新为新的dtv数组
d. 为实现无锁操作,不释放旧的dtv数组空间,而是将其插入一个垃圾回收队列中,待程序结束时回收
重新更新静态TLS块对应动态库的dtv数组元素
重新更新动态TLS块对应动态库的dtv数组元素,条件:动态库的更新标志大于dtv数组更新标志(表明dtv数组指向的为旧的动态库)
a. 释放旧的动态库动态TLS块内存
b. 将dtv数组元素清零
更新dtv数组的更新标志为全局动态库更新标志
TLSDESC访问方式实现
TLSDESC访问方式有两种方式获取TLS变量相对静态TLS块的偏移地址:一种对于静态TLS块上的TLS变量由tlsdesc_resolver_static获取,另一种对于动态TLS块的TLS变量由tlsdesc_resolver_dynamic获取。这两种方式都采用汇编实现,不遵循C/C++函数调用的寄存器传参规范。其使用规范中的返回值寄存器传参,如aarch64的x0寄存器,x86_64的rax寄存器。
代码块 |
---|
/* Type used to represent a TLS descriptor in the GOT. */
struct TlsDescriptor {
TlsDescResolverFunc* func;
size_t arg;
};
// tlsdesc_resolver_static函数,其TlsDescriptor::arg值为TLS变量的相对偏移
// tlsdesc_resolver_dynamic函数,其TlsDescriptor::arg值为下列类型变量地址
struct TlsDynamicResolverArg {
size_t generation;
TlsIndex index;
};
struct TlsIndex {
size_t module_id;
size_t offset;
}; |
tlsdesc_resolver_static的实现相当简单,返回TlsDescriptor::arg值即可。
tlsdesc_resolver_dynamic对更新标志的判断有所优化,共有4个generation更新标志:
全局generation,保存在__libc_tls_generation_copy,为TlsModules::generation一个副本,每次新增拥有TLS程序段的动态库时,递增该值,表示有动态库新增。不需要处理动态库删除问题
dtv数组中的generation,保存在数组中的第一个元素dtv[0],初始化为0,每次更新dtv数组时,更新generation只为当时的全局generation值。与全局generation不相等,说明有新的动态库加载,需要更新dtv数组内容
动态库的generation,保存在TlsModule::first_generation,该值初始化为加载该库时全局generation的值。该值用于判断dtv指向的动态库是否有变化,即是否为旧的动态库
TLS变量GOT表项中指向的TlsDynamicResolverArg::generation,该值初始化为TlsModule::first_generation值,该值只要不大于dtv数组中的generation不需要重新分配dtv数组,否则表示该动态库的TLS程序段在dtv数组中未初始化。
tlsdesc_resolver_dynamic实现步骤:
快速路径,条件:TlsDynamicResolverArg::generation <= dtv[0] && dtv[mod_id] != NULL a. 返回 dtv[mod_id] + TlsDynamicResolverArg::TlsIndex::offset相对于静态TLS块的偏移
慢速路径,调用__tls_get_addr获取TLS变量的绝对地址,返回与静态TLS块的相对偏移
gnu TLS数据结构初始化流程
原理与bionic TLS数据结构初始化流程类似,不再详述。两者的差异包括:
静态加载库初始化TLS时,gnu会为静态TLS块预留144字节空间
动态加载库初始化TLS时,先从预留的静态TLS块获取空间,不足时采用动态TLS块,这种方式可满足调用dlopen动态加载的TLS变量IE访问方式的动态库
TLSDESC实现函数名称不同:_dl_tlsdesc_return/_dl_tlsdesc_dynamic/_dl_tlsdesc_undefweak -> tlsdesc_resolver_static/tlsdesc_resolver_dynamic/tlsdesc_resolver_unresolved_weak
预留静态TLS空间的作用有两个,一个是支持动态加载IE访问模式的库,另一个是优化TLSDESC访问模式性能。
在linux环境下,图形加速库(OpenGL/EGL)的使用预留静态TLS空间的典型应用。一般linux应用程序会使用图形API转发库,如glvnd,图形API转发库通过dlopen动态加载OpenGL/EGL库,而OpenGL/EGL库一般都使用了IE访问模式的TLS变量,通常是一个指针变量,指向一个数据结构,从而减少静态TLS块预留空间的占用。
预留静态TLS空间注意事项:
...
分配时机:glibc在重定位时尝试分配静态TLS空间,支持的两个重定位类型,分别为R_AARCH64_TLSDESC和R_AARCH64_TLS_TPREL
...
初始化数据:对所有线程的静态TLS空间进行初始化,TLS数据结构在线程栈上,有的线程使用用户栈,有的使用系统栈,在_dl_init_static_tls函数中实现
...
, Thread Local Storage, also known as Thread-Local Storage variables, refers to individual storage for each thread that is not shared among processes. For TLS variables, different threads point to different storage spaces. Its implementation involves support from high-level programming languages, compilers, and linkers.
Allocation of TLS Variables
In high-level programming languages, when defining and allocating TLS variables in a multi-threaded environment, modifications to TLS variables by different threads do not affect each other. TLS variables can be accessed for read and write operations in assignment statements just like regular variables.
Dynamic Management of Threads
Except for the main thread, other threads are created and destroyed during program execution. When loading shared object files, it is not possible to determine when and how many threads will be created. Therefore, dynamic management is required for the TLS variable libc. Since different threads access different address spaces of TLS variables, libc needs to manage the address space of TLS variables and dynamically obtain their addresses.
The compiler and linker place all TLS variables in the same TLS program segment in a dynamic library. The offset of TLS variables relative to the TLS program segment is fixed.
Implementation of TLS Variables
To optimize the performance of accessing TLS variables, the implementation of TLS variables adopts four approaches, with increasing performance and decreasing implementation scenarios:
Generic Dynamic: The generic approach requires a function call to obtain the address of each TLS variable. Can be accessed across dynamic library references.
Local Dynamic: The local approach is used when multiple TLS variables are referenced within a function. The address of the TLS program segment is obtained through a function call, and the address of TLS variables is obtained through the offset from the TLS program segment. Can only be accessed within the dynamic library.
Initial Exec: Indirect addressing between segment registers and TLS variable offsets. Used for referencing and accessing TLS variables in dynamically loaded libraries at the start of the main program.
Local Exec: Direct addressing between segment registers and TLS variable offsets. Can only be accessed within the main program.
Additionally, there is an optimized access mode called TLSDESC, which optimizes access to TLS variables in the static TLS block. It reduces register contamination caused by function calls and minimizes atomic operation access, among other optimizations.
Compiler Access Modes for TLS Variables
The compiler generally supports five access modes through the tls-dialect and tls-model options:
代码块 |
---|
-mtls-dialect=trad -ftls-model=global-dynamic : GD
-mtls-dialect=trad -ftls-model=local-dynamic :LD
-mtls-dialect=trad -ftls-model=initial-exec :IE
-mtls-dialect=trad -ftls-model=local-exec :LE, supports the main program and does not support dynamic libraries.
-mtls-dialect=desc :TLSDESC |
These options determine how the compiler generates code to access TLS variables.
Note: The above compilation options are for aarch64 architecture. For x86_64 architecture, the tls-dialect compilation options have different values: gnu and gnu2, which correspond to trad and desc, respectively.
The TLS variable modifiers in C/C++
Specifier | Notes |
---|---|
__thread |
|
| |
_Thread_local |
|
| |
thread_local |
|
|
Introduction to TLS Data Structure
Drepper provides two implementations of memory layout based on different scenarios: static loading and dynamic loading of shared objects. The basic principles are similar.
Define a dynamic thread-specific array called dtv (dynamic thread vector), where each element points to the position of the TLS program segment content of the dynamic library in the current thread. This array can dynamically grow.
Allocate a static block of physical memory called the static TLS block. Once allocated, it cannot be increased or decreased. Most of the memory is used to store the contents of the TLS segment, and a small portion is used to store the Thread Control Block (TCB), which points to the dtv array.
For dynamically loaded dynamic libraries, allocate a memory block called the dynamic TLS block for each TLS program segment of the dynamic library. The dtv elements point to this memory block.
Use a dedicated segment register in the CPU to store the thread pointer address, such as the %fs segment register on x86_64 or the tpidr_el0 segment register on aarch64.
...
TLS Data Layout
Here, we present the TLS data layout for arrch64 in both the GNU and Bionic implementations. The GNU implementation is similar to the structure shown in Figure 1, while the TCB definition differs in the Bionic implementation. Other aspects remain similar.
...
In the physical layout of the Bionic thread stack shown in Figure 3, the address increases from top to bottom. The allocation of the static TLS block memory occurs on the thread stack. The size of the TCB is determined by the size of the bionic_tcb data structure, which may not be 16 bytes. The static TLS block is located immediately after the bionic_tcb data structure.
...
In the physical layout of the GNU thread stack shown in Figure 4, the address increases from bottom to top. The allocation of the static TLS block memory occurs on the thread stack. The size of the TCB is determined by the tcbhead_t data structure, which is typically 16 bytes. The static TLS block is located immediately after the tcbhead_t data structure.
Static TLS Space:
Static TLS space refers to a contiguous physical memory region allocated on the thread stack. It encompasses both the static TLS space and TLS management space. The dynamic library's tls_offset is calculated relative to this space.
Thread Pointer:
The thread pointer needs to be stored in a segment register. In Variant 1, the TCB's space allocation needs to be consistent with the compiler to meet the requirements of LE (little-endian) access. For example, in Bionic, the TCB occupies 64 bytes, requiring the compiler to generate instructions for accessing main program TLS variables with an additional offset of 64. In Variant 2, the offset is conventionally set to 0, eliminating this requirement.
TCB Management Space:
The TCB management space records addresses for the thread pointer, dynamic thread vector (dtv), and management of thread-local data. In Bionic's aarch64 implementation, the space is defined as nine arrays of eight bytes each, totaling 72 bytes. It is larger than the offset between the thread pointer and the first TLS variable by 8 bytes, causing the thread pointer to point to the address of the second element. Additionally, the TCB management space may have alignment requirements, which can result in a different starting address compared to the static TLS space. This may vary depending on the specific implementation of Bionic.
Dynamic Thread Vector (dtv):
The dynamic thread vector records the starting positions of the TLS program segments for each dynamic library. Its address is stored in the space specified by the thread pointer.
Initialization Process of TLS Data Structure in Bionic
The initialization of the TLS data structure in Bionic by the dynamic linker is divided into two parts: static loading of libraries during the main program's loading process and dynamic loading of libraries using dlopen during the main program's execution.
TLS Initialization of TLS in Static Loading Libraries
During the loading process of the main program, the dynamic linker initializes the TLS data structure using two global variables: StaticTlsLayout and TlsModules.
StaticTlsLayout is of type StaticTlsLayout and is used to calculate the offset of each dependency library's TLS program segment in the static TLS block.
TlsModules is of type TlsModules and is used to calculate all the dynamic libraries that have TLS program segments.
During the main program loading process, all TLS program segments are stored in the static TLS block memory. The specific steps are as follows:
Initialization of bionic_tcb and tpidr_el0: The function libc_init_main_thread_early is called to initialize bionic_tcb and tpidr_el0. This involves setting up the initial values for bionic_tcb and setting the tpidr_el0 register.
Initialization of TLS_SLOT_DTV in bionic_tcb: The function init_tcb_dtv is called to initialize the TLS_SLOT_DTV field in the bionic_tcb structure. The update flag for this field is set to 0.
Calling linker_setup_exe_static_tls: This function is called assuming that the main program contains TLS program segments. The following actions are taken:
a. Reserve space and determine the position of the bionic_tcb object, TLS space, and their locations in a variable of type StaticTlsLayout.
b. Call register_tls_module to obtain the module ID and add it to a variable of type TlsModules.
c. Reserve space and determine the position of the bionic_tls object in the variable.Loading of TLS program segments in the main program dependencies: If there are TLS program segments in the main program's dependencies, the function soinfo::register_soinfo_tls is called. This reserves space for TLS segments and their positions in the StaticTlsLayout variable, and adds them to the TlsModules variable.
Calling linker_finalize_static_tls: This function calculates the total size of the reserved space in the StaticTlsLayout variable, which represents the size of the static TLS block.
Calling __allocate_thread_mapping: This function allocates memory space on the main program's stack for the static TLS block. The static TLS block includes bionic_tcb, bionic_tls, and TLS program segments.
Calling __init_static_tls: This function copies the contents of the TLS segments from the dynamic libraries specified in the TlsModules variable into the reserved space of the static TLS block.
Calling bionic_tcb::copy_from_bootstrap: This function synchronizes the contents of bionic_tcb with the initial values set during the first step of initialization.
Calling __init_tcb: This function updates bionic_tcb with additional information.
Calling __init_bionic_tls_ptrs: This function updates the addresses of bionic_tls.
Relocation Initialization of TLS Variable's GOT Table Entries
GD (Global Dynamic): The relocation types for GD include R_AARCH64_TLS_DTPMOD64 and R_AARCH64_TLS_DTPREL64. These types initialize the adjacent GOT table entries with the module ID and the offset of the variable within its TLS segment, respectively.
LD (Local Dynamic): The implementation of LD is the same as GD for AArch64. However, for x86_64, the relocation type is R_X86_64_DTPMOD64, and the adjacent GOT table entries are initialized with the module ID and an offset of 0.
IE (Initial Executable): The relocation type for IE is R_AARCH64_TLS_TPREL64. It initializes the value of the corresponding GOT table entry with the offset on the static TLS block.
LE (Local Executable): There are no relocation entries or GOT table entries to initialize, so no initialization is needed.
TLSDESC: The relocation type for TLSDESC is R_AARCH64_TLSDESC. It initializes the adjacent GOT table entries with the address of the
tlsdesc_resolver_static
function and the offset on the static TLS block.
From the initialization process described above, it can be seen that the dtv
(Dynamic Thread Vector) array is initially empty. The dtv
array is only created when accessing TLS variables. This delayed allocation has the following benefits:
For IE and LE access to TLS variables, the address of the TLS variable can be obtained directly using the segment register and the offset, without the need to look up the
dtv
array.For GD and LD access to TLS variables, the
__tls_get_addr
function is called to obtain the TLS variable's address. When checking that thedtv
array is empty, thedtv
array is reallocated based on the number of TLS dynamic libraries.For TLSDESC access to TLS variables, the
tlsdesc_resolver_static
function directly returns the offset of the TLS variable on the static TLS block, without the need to allocate thedtv
array.
From the examples mentioned above, it is evident that the TLSDESC access mode provides significant optimization for accessing TLS variables on the static TLS block compared to GD. The TLSDESC mode directly returns the offset of the static TLS block in the GOT table entry, while GD requires accessing the dtv
array to calculate the address.
Dynamic Loading Library Initialization of TLS
In bionic, when dynamically loading a library through dlopen, the initialization steps for TLS are as follows:
In the do_dlopen -> find_library -> soinfo::register_soinfo_tls -> register_tls_module flow, the module ID is obtained and added to the TlsModules variable.
In the soinfo::relocate -> plain_relocate -> plain_relocate_impl -> process_relocation -> process_relocation_impl flow, the relocation initializes the TLS variables' GOT table entries.
For GD (Global Dynamic): The relocation types include R_AARCH64_TLS_DTPMOD64 and R_AARCH64_TLS_DTPREL64. The adjacent GOT table entries are initialized with the module ID and the variable's offset within its TLS segment.
For LD (Local Dynamic): The implementation for AArch64 is the same as GD.
For IE (Initial Executable) / LE (Local Executable):They are not supported.
For TLSDESC: The relocation type is R_AARCH64_TLSDESC. The adjacent GOT table entries are initialized with the address of the tlsdesc_resolver_dynamic function and the address of the TlsDynamicResolverArg variable.
a. Initialize the TlsIndex in TlsDynamicResolverArg with the module ID and the offset of the TLS variable within its TLS program segment. Additionally, initialize the update flag with the library's update flag, which indicates whether the dynamic library has been updated.
b. To store the TlsDynamicResolverArg variable, it is saved in the soinfo::tlsdescargs array. To handle reallocation of the array, the Relocator::deferred_tlsdesc_relocs buffer defers relocation information. The TLS variable's GOT table entries are updated once all relocation operations for the library are completed.
Initialization of TLS during Thread Creation
When creating a thread using pthread_create, the following steps are involved in initializing TLS:
The TLS data structures on the main program are copied. This is done within the pthread_create function, specifically in the __allocate_thread function.
The __allocate_thread_mapping function is called to allocate the thread's stack space, which includes the static TLS block. The allocation order includes the stack guard, stack, static TLS block, and guard page.
The __init_static_tls function is called to copy the contents of the TLS segment from the TlsModules variable of dynamic libraries to the static TLS block.
The __init_tcb function is called to update the bionic_tcb (Thread Control Block).
The __init_tcb_dtv function is called to initialize the TLS_SLOT_DTV in the bionic_tcb, with the update flag set to 0.
The __init_bionic_tls_ptrs function is called to update the bionic_tls addresses.
The clone system call is invoked, passing the address of the static TLS block to clone. The kernel then sets the value of the tpidr_el0 register, which represents the thread pointer, according to the provided static TLS block address.
Implementation of the __tls_get_addr Function
The __tls_get_addr function is used by GD (Global Dynamic) and LD (Local Dynamic) access methods to retrieve the absolute address of a TLS variable. This function involves updating the dtv (Dynamic Thread Vector) data, and the update conditions are controlled by three generation flags.
Global Generation: The global generation is stored in __libc_tls_generation_copy, which is a copy of TlsModules::generation. Each time a dynamic library with a TLS program segment is added, this value is incremented to indicate the addition of a new dynamic library. There is no need to handle dynamic library removal.
Generation in dtv Array: The generation value in the dtv array is stored in the first element of the array, initialized to 0. When updating the dtv array, the generation is updated to match the current global generation value. If it is not equal to the global generation, it indicates that a new dynamic library has been loaded, and the contents of the dtv array need to be updated.
Generation in Dynamic Library: The generation value in the dynamic library is stored in TlsModule::first_generation. This value is initialized with the global generation value when the library is loaded. It is used to determine if the dynamic library pointed to by dtv has changed, i.e., whether it is an old dynamic library.
代码块 |
---|
struct TlsIndex {
size_t module_id;
size_t offset;
};
// The value of "ti" (Thread Index) is stored in the GOT (Global Offset Table) entries of the dynamic library. It is initialized during relocation and occupies two entries in the table.
extern "C" void* __tls_get_addr(const TlsIndex* ti){
// get the dtv
TlsDtv* dtv = __get_tcb_dtv(__get_bionic_tcb());
// retrieve the global dynamic library update flag
size_t generation = atomic_load(&__libc_tls_generation_copy);
if (__predict_true(generation == dtv->generation)) {
void* mod_ptr = dtv->modules[__tls_module_id_to_idx(ti->module_id)];
if (__predict_true(mod_ptr != nullptr)) {
return static_cast<char*>(mod_ptr) + ti->offset + TLS_DTV_OFFSET;
}
}
return tls_get_addr_slow_path(ti);
} |
The tls_get_addr_slow_path function includes the allocation and initialization of dtv (Dynamic Thread Vector) and the dynamic TLS block.
代码块 |
---|
__attribute__((noinline)) static void* tls_get_addr_slow_path(const TlsIndex* ti) {
TlsModules& modules = __libc_shared_globals()->tls_modules;
bionic_tcb* tcb = __get_bionic_tcb();
ScopedSignalBlocker ssb;
// To prevent multiple threads from simultaneously modifying the __libc_shared_globals()->tls_modules global variable, you can use a mutex to enforce mutual exclusion
ScopedWriteLock locker(&modules.rwlock);
// update the dtv array or reallocate its memory
update_tls_dtv(tcb);
TlsDtv* dtv = __get_tcb_dtv(tcb);
const size_t module_idx = __tls_module_id_to_idx(ti->module_id);
void* mod_ptr = dtv->modules[module_idx];
if (mod_ptr == nullptr) {
// If the dtv array does not exist, you would need to allocate memory, copy the contents of the dynamic library's TLS program segment to the new memory, and initialize the module pointer.
const TlsSegment& segment = modules.module_table[module_idx].segment;
mod_ptr = __libc_shared_globals()->tls_allocator.memalign(segment.alignment, segment.size);
if (segment.init_size > 0) {
memcpy(mod_ptr, segment.init_ptr, segment.init_size);
}
dtv->modules[module_idx] = mod_ptr;
// Reports the allocation to the listener, if any.
if (modules.on_creation_cb != nullptr) {
modules.on_creation_cb(mod_ptr, static_cast<void*>(static_cast<char*>(mod_ptr) + segment.size));
}
}
return static_cast<char*>(mod_ptr) + ti->offset + TLS_DTV_OFFSET;
} |
The steps to implement the update_tls_dtv function for dynamically allocating dtv array space are as follows:
Condition for updating the dtv array: If the update flag of the dtv array is not equal to the global dynamic library update flag, it indicates that the dynamic library has been updated and the current dtv array points to the old dynamic library.
Reallocate the dtv array if the total number of dynamic libraries with TLS program segments is greater than the size of the dtv array.
Allocate new space for the dtv array based on the number of dynamic libraries.
Backup the contents of the dtv array to the new space.
Call __set_tcb_dtv to update to the new dtv array.
To achieve lock-free operation, don't free the old dtv array space. Instead, insert it into a garbage collection queue for later cleanup at the end of the program.
Update the dtv array elements corresponding to the static TLS block and dynamic TLS blocks of the dynamic libraries.
Free the memory of the old dynamic TLS block
Clear the dtv array element.
Update the update flag of the dtv array to the global dynamic library update flag.
The access method of TLSDECS
The TLSDESC access method provides two ways to obtain the offset address of a TLS variable relative to the static TLS block: tlsdesc_resolver_static for TLS variables on the static TLS block and tlsdesc_resolver_dynamic for TLS variables on the dynamic TLS block. Both of these methods are implemented using assembly language and do not adhere to the register parameter passing conventions of C/C++ function calls. Instead, they use the return value register for passing parameters, such as the x0 register for AArch64 or the rax register for x86_64.
代码块 |
---|
/* Type used to represent a TLS descriptor in the GOT. */
struct TlsDescriptor {
TlsDescResolverFunc* func;
size_t arg;
};
// tlsdesc_resolver_static, Its TlsDescriptor::arg value is the relative offset of the TLS variable
// tlsdesc_resolver_dynamic, tlsdesc_resolver_dynamic function, its TlsDescriptor::arg value is the address of the following type variable
struct TlsDynamicResolverArg {
size_t generation;
TlsIndex index;
};
struct TlsIndex {
size_t module_id;
size_t offset;
}; |
The implementation of tlsdesc_resolver_static is straightforward as it simply returns the value of TlsDescriptor::arg.
For tlsdesc_resolver_dynamic, there are several optimizations based on the update flags:
Global generation: Stored in __libc_tls_generation_copy, which is a copy of TlsModules::generation. It is incremented each time a dynamic library with TLS program segments is added, indicating the addition of a new dynamic library. There is no need to handle dynamic library removal.
Generation in the dtv array: Stored in the first element of the dtv array, dtv[0]. It is initialized to 0 and updated to the value of the global generation whenever the dtv array is updated. If it is different from the global generation, it indicates that new dynamic libraries have been loaded, and the dtv array needs to be updated.
Generation in the dynamic library: Stored in TlsModule::first_generation. It is initialized with the value of the global generation when the library is loaded. This value is used to determine if the dynamic library pointed to by dtv has changed, indicating an old dynamic library.
Generation in the TLS variable's GOT entry: Stored in TlsDynamicResolverArg::generation. It is initialized with TlsModule::first_generation. As long as it is not greater than the generation in the dtv array, there is no need to reallocate the dtv array. If it is greater, it means that the TLS program segment of that dynamic library is not initialized in the dtv array.
The implementation steps for tlsdesc_resolver_dynamic are as follows:
Fast path (conditions: TlsDynamicResolverArg::generation <= dtv[0] && dtv[mod_id] != NULL): Return dtv[mod_id] + TlsDynamicResolverArg::TlsIndex::offset as the relative offset to the static TLS block.
Slow path:
Call __tls_get_addr to obtain the absolute address of the TLS variable. and Calculate the relative offset to the static TLS block.
The initialization process of the GNU TLS data structure
It is similar to the Bionic TLS data structure, and the principles are comparable. The differences between them include:
When initializing TLS for statically loaded libraries, GNU reserves 144 bytes of space for the static TLS block.
When initializing TLS for dynamically loaded libraries, GNU first tries to obtain space from the reserved static TLS block. If it is insufficient, it falls back to the dynamic TLS block. This approach allows dynamic libraries using the IE access mode for TLS variables to be accessed properly when loaded with dlopen.
The function names for TLSDESC implementation differ: _dl_tlsdesc_return, _dl_tlsdesc_dynamic, and _dl_tlsdesc_undefweak in GNU correspond to tlsdesc_resolver_static, tlsdesc_resolver_dynamic, and tlsdesc_resolver_unresolved_weak respectively.
The purpose of reserving the static TLS space is twofold. Firstly, it supports dynamic libraries loaded with IE access mode for TLS variables. Secondly, it optimizes the performance of the TLSDESC access mode.
In the Linux environment, a typical use case for reserving static TLS space is in graphics acceleration libraries (such as OpenGL/EGL). Linux applications generally use graphics API dispatch libraries like glvnd, which dynamically load OpenGL/EGL libraries using dlopen. These OpenGL/EGL libraries often use TLS variables in IE access mode, typically a pointer variable pointing to a data structure. This reduces the space occupied by the reserved static TLS block.
Some considerations regarding reserving static TLS space are as follows:
Allocation Timing: glibc attempts to allocate the static TLS space during relocation and supports two relocation types: R_AARCH64_TLSDESC and R_AARCH64_TLS_TPREL.
Initialization Data: The static TLS space for all threads is initialized. The TLS data structures are located on the thread stack, with some threads using the user stack and others using the system stack. Initialization is performed in the _dl_init_static_tls function.
Concurrent Access: The allocation of reserved static TLS space is implemented in dl_open_worker_begin, and a large lock (dl_load_tls_lock) is acquired before calling the function. When initializing static TLS data, a lock (dl_stack_cache_lock) is obtained within the _dl_init_static_tls function to ensure thread safety.