Arm linux: November 2014

Memory management ARM

ARM processor linux paging mechanism

ARM MMU supports only two page table address translation, which is mapped using three pages to meet the storage management needs 32-bit CPU

Page size supported by ARM: There are several - 1M, 64K, 4K, 1K. In the linux kernel, ARM uses a page size of 4K, 4K page size determines the virtual address reserved for low-12bit offset address. As can be seen from the figure, the index page global directory of significant digits is 12bit, two significant digits index is 8bit, page offset is 12bit.

According to ARM hardware paging mechanism, we have come to the first level global page directory has 4096, and the second level is 256, so that the second stage can have a lot of bits can be hardware used.

In the arm linux implementation, hardware paging mechanism for ARM made slightly smaller adjustments. The first-level directory retains 2048, each occupy 8 bytes (in other words, two hardware pointers to two page table); second-class hardware PTE put two tables together continuously in both PTE table back then save the corresponding Linux status information, so two entries actually have 512 (each table 256, compared with 512 two). So that each logical PTE table just takes a page.

ARM linux page table layout is as follows:

In the arch/arm/include/asm/pgtable.h, you can see the definition PTRS_PER_PTE and PTRS_PER_PTE

#define PTRS_PER_PTE 512

#define PTRS_PER_PMD 1

#define PTRS_PER_PGD 2048

Since PGD has 2048, occupies 8 bytes each, for a total need 4 * 4K, ie ARM linux the PGD actually occupied four consecutive physical page frame.

http://embeddedma.blogspot.in/2013/06/arm-linux-build-process-page-table.html

startup memory

At start up the kernel (usually) automatically determines the physical memory areas. However, the user can manually specify one or more physical memory areas using "mem=..." kernel command line options. When used, these override any automatically determined values. These are parsed by early_mem().
The initial description of physical memory regions are stored in the global 'meminfo' structure, and each region is described as a 'bank'. Some initial physical memory is utilized by the bootmem allocator. It is from this pool of physical memory, that the page tables are built, which allows the MMU to be turned on, and the kernel switched over to virtual memory. Once this process is done, the bootmem pool is freed and all system pages are turned over to the various page and slab allocators of the system. A good reference for this is Mel Gorman's excellent book on the topic: Understanding the Linux Virtual Memory Manager The prior link is to the PDF, here's a link to html: http://www.kernel.org/doc/gorman/html/understand/ Chapter 5 talks about the bootmem allocator.

page tables

The page tables are, unsurprisingly, initialized by paging_init(). ARM uses a somewhat weird way of mapping the Linux page tables onto the ARM hardware tables. This method is described in comments in the file arch/arm/include/asm/pgtable.h, with additional macros defined in pgtable-hwdef.h and page.h. Basically, Linux supports 4 levels (pgd, pud, pmd, and pte), and ARM maps this onto 2 levels (pgd/pmd and pte). The nomenclature in the code is hard to follow, because Linux generic code thinks that pgd is the top level of page tables, but internally the ARM code uses pmd macros to refer to the top hardware page table.
Originally, Linux used 4KB mappings for ARM, but they have converted over to mostly 1MB mappings (at least for the Linux kernel). According to my colleague, Frank Rowand, bad things happen if a physical page is represented in the page table by more than one entry (for example, if a physical page has both an entry as a small page in a second-level page table, and is inside a region covered by a large "section" page entry in a first-level page table.
At the hardware level, ARM supports two page table trees simultaneously, using the hardware registers TTBR0 and TTBR1. A virtual address is mapped to a physical address by the CPU depending on settings in TTBRC. This control register has a field which sets a split point in the address space. Addresses below the cutoff value are mapped through the page tables pointed to by TTBR0, and addresses above the cutoff value are mapped through TTBR1. TTBR0 is unique per-process, and is in current->mm.pgd (That is, current->mm.pgd == TTBR0 for that process). That is, when a context switch occurs, the kernel sets TTBR0 to the current->mm.pgd for new process. TTBR1 is global for the whole system, and represents the page tables for the kernel. It is referenced in the global kernel variable swapper_pg_dir. Note that both of these addresses are virtual addresses. You can find the physical address of the first-level page table by using virt_to_phys() functions on these addresses.
I found the following ARM reference material helpful in trying to understand the page table layout: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0333h/Babbhigi.html - this is the chapter on the MMU in the ARM1176JZ-S Technical Reference Manual. In particular, diagram 6.9 showing ARMv6 section, supersection and page translation, in section 6.11.2 of the manual is very useful. It is here.
Note that in particular, the second-level page table (PTE) layout is weird. The ARM hardware supports 1K tables at this level (256 entries of 4bytes each). However, the Linux kernel needs some bits that are not provided on some hardware (like DIRTY and YOUNG). These are synthesized by the ARM MMU code via permissions and faulting, effectively making them software-managed. They have to be stored outside the hardware tables, but still in a place where the kernel can get to them easily. In order to keep things aligned into units of 4K, the kernel therefore keeps 2 hardware second-level page tables and 2 parallel arrays (totalling 512 entries) on a single page. When a new second-level page table is created, it is created 512 entries at a time, with the hardware entries at the top of the table, and Linux software entries (synthesized flags and values) at the bottom.
The kernel has a mixture of 1MB mappings and 4KB mappings. This relieves some pressure on the TLB, which is used by the CPU to cache virtual-to-physical address translations.

paging_init()

The page tables and paging infrastructure are initialized as follows:

paging_init() is called by setup_arch() after the meminfo structure has been initialized and the bootmem allocator is ready. It calls the following routines:
- memblock_set_current_limit()
- build_mem_type_table() - builds a table of memory types. This has the page protection flags that are available for different memory types, for the current ARM processor. Different ARM processors have changed what flags they use, and where they are located in the page table entries, over the years. the 'mem_types' table encapsulates the settings for the running processor.
- prepare_page_table() - this zeros out certain areas of the first-level page table (called pmd in this routine). For example, it zeros out the areas of the page table that will be covered by user-space (areas below the start of the kernel address space).
- map_lowmem() - create the memory mappings (page table entries) for the lower portions of kernel memory. This is the "normal" memory that will be used by the kernel for static code and data, stack, and regular dynamic allocations.
- devicemaps_init() - create the memory mappings for special CPU areas (e.g. cache flushing regions, and interrupt vectors) and reserved IO areas in the memory map.
- kmap_init() - create the memory mapping for highmen ('pkmap')

A call tree for a regular page table setup is:

 start_kernel()
   setup_arch()
     paging_init()
       map_lowmem()
         create_mapping()
           alloc_init_pud() - for a range of pgd entries
             alloc_init_section() - for a range of "pud" entries
               *pmd = __pmd(<stuff>) - actually set the pmd/pgd entries for a SECTION mapping
               or
               alloc_init_pte()
                 early_pte_alloc() - to get a page for the pte table
                   pte_offset_kernel() - get address of 'linux' portion of pte table page
                 set_pte_ext(<stuff>) - create the individual PTE entries, for a range of entries
               =>  cpu_set_pte_ext() - macro to wrapper calling through a cpu-specific routine
               =>  processor.set_pte_ext() - function pointer to cpu-specific routine
               =>  cpu_armXXX_set_pte_ext() or cpu_procvX_set_pte_ext - cpu-specific-routine

set_pte_ext ('set page table entry extended') takes the following arguments:

pte - pointer to pte entry to change (the address of the linux entry - not the 'hardware' entry)
value - the value to place in the entry - this is usually created with the macro pfn_pte(x,y), where x has the page frame number (physical address of the page for this mapping), and y represents the protection flags for the page. This should be the 'linux' flags for the page.
value2 - the value to place in the "extended" or "extra" entry

Note that the assembly routine for this, which is cpu-specific, will take the Linux PTE values and derive appropriate hardware mappings for the hardware PTE entry.

Nomenclature

pgd = page global directory
pud = page upper directory
pmd = page middle directory
pte = page table entry (this is confusion, 'pte' can refer to both the table and an entry in the table)
- ptep = pointer to pte entry
pfn = page frame number = the base address (no bottom bits) of a physical page address

(http://pankaj-techstuff.blogspot.in/2007/12/initialization-of-arm-mmu-in-linux.html)
Kernel's execution starts at stext in (arch/arm/kernel/head.S) , at this point kernel expects that MMU is off, I-cache is off, D-cache is off. After looking up of processor type and machine number, we call __create_page_tables which sets up the initial MMU tables for which are used only during initialization of MMU. Here we create MMU mapping tables for 4MB of memory starting from kernel's text (More preciesly it starts at a 1MB section in which kernel's text starts). Following is the code snippet which creates these mappings:

-----------------------------

_create_page_tables:

ldr r5, [r8, #MACHINFO_PHYSRAM] @ physram
pgtbl r4, r5 @ page table address
mov r0, r4
mov r3, #0
add r6, r0, #0x4000
1: str r3, [r0], #4
str r3, [r0], #4
str r3, [r0], #4
str r3, [r0], #4
teq r0, r6
bne 1b
ldr r7, [r10, #PROCINFO_MMUFLAGS] @ mmuflags
mov r6, pc, lsr #20 @ start of kernel section
orr r3, r7, r6, lsl #20 @ flags + kernel base
str r3, [r4, r6, lsl #2] @ identity mapping
add r0, r4, #(TEXTADDR & 0xff000000) >> 18 @ start of kernel
str r3, [r0, #(TEXTADDR & 0x00f00000) >> 18]!
add r3, r3, #1 <<>
str r3, [r0, #4]! @ KERNEL + 1MB
add r3, r3, #1 <<>
str r3, [r0, #4]! @ KERNEL + 2MB
add r3, r3, #1 <<>
str r3, [r0, #4] @ KERNEL + 3MB

-----------------------------
In this function we first of all find out the physical address where our RAM starts, and locate the address where we'll store our inital MMU tables (line #1 and #2). We create these inital MMU tables 16kb below kernel entry point. Line #12- #15 create an identity mapping of 1MB starting from kernel entry point (physical address of kernel entry, i.e stext). Here we do'nt create a second level page table for these mapping, instead of that we specify in first level descriptor that these mappings are for section (each section mapping is of size 1MB). Simiarly we create mapping for 4 more sections (these 4 are non identity mappings, size of each section is 1MB here also) starting at TEXTADDR (virtual address of kernel entry point)

so the initial memory map looks something line this:

After the initial page tables are setup, next step is to enable MMU. This code is tricky, does lot of deep magic.

----------------------------

ldr r13, __switch_data @ address to jump to after mmu has been enabled
adr lr, __enable_mmu @ return (PIC) address
add pc, r10, #PROCINFO_INITFUNC
.type __switch_data, %object

__switch_data: .long __mmap_switched
.long __data_loc @ r4
.long __data_start @ r5
.long __bss_start @ r6
.long _end @ r7
.long processor_id @ r4
.long __machine_arch_type @ r5
.long cr_alignment @ r6
.long init_thread_union+8192 @ sp

-----------------------------

line #1 puts virtual address of __mmap_switched in r13, after enabling MMU kernel will jump to address in r13. The virtual address that is used here is not from identity mapping, but it is PAGE_OFFSET + physical address of __mmap_switched. And now since in __mmap_switched we start refering to virtual addresses of variables and functions, so starting from __mmap_switched is no longer position independent.
At line #2-#3, we put position independent address of __enable_mmu ('adr' Psuedo instruction translates to PC relative addressing, that's why it is position independent) and jumps to at a offset of PROCINFO_INITFUNC (12 bytes) in __v6_proc_info structure (arch/arm/mm/proc-v6.S). At PROCINFO_INITFUNC in __v6_proc_info we have a branch to __v6_setup, which does following setup for enabling MMU:

Clean and Invalidate D-cache and Icache, and invalidate TLB.
Prepare the control register1 (C1) value that needs to be written when enabling mmu, and return the value which needs to be written in C1.

As __v6_setup returns we enter __enable_mmu, which just sets Cache enable bits, branch-prediction enable bits in 'r' which is the value to be written in C1 (value to be written in C1 was returned in 'r0' by __v6_setup) and then calls __turn_mmu_on.
--------------------------------------
__turn_mmu_on:

mov r0, r0
mcr p15, 0, r0, c1, c0, 0 @ write control reg
mrc p15, 0, r3, c0, c0, 0 @ read id reg
mov r3, r3
mov r3, r3
mov pc, r13

--------------------------------------

__turn_mmu_on just writes 'r0' to C1 to enable MMU. Line #4 and #5 are the nops to make sure that pipeline does not contain and invalid address access when C1 is written. Line #6 'r13' is moved in 'pc' and we enter __mmap_switched. Now MMU is ON, every address is virtual no physical addresses anymore. But the final kernel page tables are still not setup (final page tables will be setup by paging_init and mappings created by __create_page_tables will be discarded), we are still running with 4Mb mapping that __create_page_tables had set it for us. __mmap_switched copies the data segment if required, clears the BSS and calls start_kernel. The mappings page table mappings that we discussed above makes sure that the position dependent code of kernel startup runs peacefully, and these mappings are overwritten at later stages by a function called paging_init( ) ( start_kernel( ) -> setup_arch( )-> paging_init( )).
paging_init() populates the master L1 page table (init_mm) with linear mappings of complete SDRAM and the SOC specific address space (SOC specific memory mappings are created by mdesc->mapio() function, this function pointer is initialized by SOS specific code, arch/arm/mach-*). So in master L1 page table (init_mm) we have mappings which map virtual addresses in range PAGE_OFFSET - (PAGE_OFFSET + sdram size) to physical address of SDRAM start - (physical address of SDRAM start + sdram size). Also we have SOC specific mappings created by mdesc->mapio() function.
One more point worth noting here is that whenever a new process is created, a new L1 page table is allocated for it, and the kernel mappings (sdram mapping, SOC specific mappings) are copied to it from the master L1 page table (init_mm). Every process has its own user space mappings so no need to copy anything from anywhere.
Handling of mapings for VMALLOC REGION are is bit tricky, because VMALLOC virtual addressed are allocated when a process calls vmalloc( ). So if we have some processes' which were created before the process which called vmalloc then their L1 page tables will have no maping for new vmalloc'ed region. So how this is taken care of is very simple, the mappings for vmalloc'ed region are updated in the master L1 page table (init_mm) and when a process whose page tables do not have newly created mapping accesses the newly vmalloc'ed region, a page fault is generated. And kernel handles page faults in VMALLOC region specially by copying the mappings for newly vmalloc'ed area to page tables of the process which generated the fault.
Tweaks for integerating ARM 2 level page tables with linux implementation of 3 level ix86 style page tables
1. Linux assumes that it is dealing with 3 level page tables and ARM has 2 level page tables. For handling this, for ARM in ARCH include files __pmd is defined as a identity macro in (include/asm-arm/page.h):
-------------------------------------------------
#define __pmd(x) ((pmd_t) { (x) } )
--------------------------------------------------
So effectively for ARM, linux is told that pmd has just one entry, effectively bypassing pmd.
2. Memory management code of in Linux expects ix86 type page table entries, for example it uses 'P' (present ), 'D' (dirty) bits but ARM page table entries (PTEs) don't have these bits. As a workaround to provide ix86 PTE flags, what ARM page table implementation does is that, it tells linux that PGD has 2048 entries of 8 bytes each (whereas ARM hardware level 1 pagetable has 4096 entries of 4 bytes each).

Also it tells Linux that each PTE table has 512 entries (whereas ARM hardware PTE table of 256 entries).This means that the PTE table that is exposed to Linux is actually 2 ARM PTE tables, arranged contigously in memory. NowAfter these 2 ARM PTE tables (lets say h/w PTE table 1 and h/w PTE table 2), 2 more PTE tables (256 entries each, say linux PTE table 1 and 2) are allocated in memory contiguous to Arm hardware page table 2. Linux PTE table 1 and 2 contains PTE flags of ix86 style corresponding to entries in ARM PTE table 1 and 2. So whenever Linux needs ix86 style PTE flags it uses entries in Linux PTE table 1 and 2. ARM never looks into Linux PTE table 1 and 2 during hardware page table walk, it uses only h/w PTE tables as mentioned above. Refer to include/asm-arm/pgtable.h (line: 20-76) for details of how ARM h/w PTE table 1 and 2, and linux PTE table 1 and 2 are organized in memory.

So here we conclude the architecture specific page table related stuff for ARM.

Video Link(http://freevideolectures.com/Course/2341/Embedded-Systems)

Wednesday, 12 November 2014

ARM processor linux paging mechanism

startup memory

page tables

paging_init()

Nomenclature