



### **Operating Systems**

#### 3. Memory Virtualizing



#### Pablo Prieto Torralbo

DEPARTMENT OF COMPUTER ENGINEERING AND ELECTRONICS

> This material is published under: <u>Creative Commons BY-NC-SA 4.0</u>







# 3.1 Memory Virtualization -Address Space



# Early Days

- No memory abstraction.
- OS as a set of routines (standard library).
  - At address 0 in the example.
- One running program uses the rest of the memory.
- Physical memory  $\rightarrow$







# Multiprogramming



- CPU Virtualization was the illusion of having a private CPU per process.
- Memory Virtualization is the illusion of having a private memory per process.
- CPU: Multiple processes at a time.
  - OS switches between them saving the state of the CPU (Time sharing).
- Memory: Allow a program to use all memory and save it cx disk when switching.

• Too slow.

Leave processes in memory and switch between them.



# Multiprogramming





- Each process has a portion of physical memory.
- Multiple programs reside concurrently in memory.
- Processes need Protection

How does a process know where it is in the memory?



## Address Space

- Easy to use abstraction of physical memory.
  - Illusion of private memory.
  - Virtual Address.
  - Extend LDE (Limited Direct Execution).
- Contains memory state of a program:
  - Code (instructions, static data).
  - Heap (data, dynamically-allocated memory).
  - Stack (local variables and routines parameters).

#### In OSTEP















Segments Arrangement may vary.





## Address Space









## Address Space

- Abstraction provided by the OS.
  - Program is not in physical addresses 0 through 64KB.
  - Loaded in an arbitrary physical address by the OS (In previous example, Process A is in address 320KB).
  - When a process performs an access to an address from its address space (i.e. 1KB, virtual address) → OS with hardware support should load correct physical address (i.e. 321KB for process A).
- This is the key to virtualization of memory.

| ОКВ            | Operating System<br>(code, data, etc.) |  |
|----------------|----------------------------------------|--|
| 64KB           | (Free)                                 |  |
| 128KB          | Process C<br>(code, data, etc.)        |  |
| 192KB          | Process B<br>(code, data, etc.)        |  |
| 256KB          | (Free)                                 |  |
| 320KB<br>384KB | Process A<br>(code, data, etc.)        |  |
|                | (Free)                                 |  |
| 448KB          | (Free)                                 |  |
| 512KB          |                                        |  |

How does a process know where it is?

Processes don't know where they actually are



# Address Space: Goals



- Transparency: invisible to the running programs.
  - Programs do not realize memory is virtualized.
- Efficiency:
  - Time efficient  $\rightarrow$  programs do not run much more slowly.
  - Space efficient  $\rightarrow$  not too much memory is used for supporting virtualization.
- Protection: to protect processes from one another and the OS itself from processes.
  - A process should not be able to access memory outside its address space.
  - Isolation among processes → one can fail without affecting others. Prevents harm.
    - Some modern Operating Systems isolate pieces of the OS from other pieces of the OS providing great reliability.





# 3.2 Memory Virtualization -Address Translation





- Virtualization of the CPU uses Limited Direct Execution:
  - Allow the program run directly on hardware.
  - Make the OS control critical points.
- Two main goals: Efficiency + Control.
- How can we efficiently and flexibly virtualize memory while ensuring protection (control)?
  - Hardware support  $\rightarrow$  address translation.
    - Transform each virtual address used by program instructions to physical address where information is actually located.
  - OS manage memory, keeping track of which locations are used/free.





- Starting Assumptions:
  - User's address space is placed contiguously in physical memory.
  - Size of address space is less than the size of physical memory.
  - Each address space is exactly the same size.
- As with virtualized CPU, unrealistic, but we will relax these assumptions gradually.





#### • Example #1:



- Presuming address of x is in the register ebx.
  - Let's assume 0x3C00 (15KB, in the stack near the bottom on a 16KB address space).







64KB





|         | Base          | Size |
|---------|---------------|------|
| Process | 32KB (0x8000) | 16KB |

#### Example #1

- Fetch instruction at addr 0x8128
- Exec load from addr 0xBC00
- Fetch instruction at addr 0x8132
- Exec, no load
- Fetch instruction at addr 0x8136
- Exec store to addr 0xBC00

0x128: movl 0x3C00, %eax 0x132: addl 0x3, %eax 0x136: movl %eax, 0x3C00







- Also referred to as base and bounds (two hardware registers):
  - Base register
  - Bounds/limit register
- As a process starts running, the OS decides where to load the address space in the physical memory → stores the value in base register.
- ▶ Bounds (limit) register has the size of the address space. → Protection.
- Address translation:

```
Physical address = virtual address + base
```

• Memory reference should be within bounds.





- Example #2: A process with an address space of 16KB loaded at physical address 32KB.
  - Base register: 32KB (32768)
  - Bounds register: 16KB (16384)

| Virtual Address | Physical Address      |
|-----------------|-----------------------|
| 0               | 32KB                  |
| 1KB             | 33KB                  |
| 6000            | 38768                 |
| 17100           | Fault (out of bounds) |





- The part of the processor that helps with address translation is called the Memory Management Unit (MMU)
  - Base/bounds registers.
  - Ability to translate virtual address and check if within bounds.
  - Privileged instructions to update base/bounds.
  - Privileged instructions to register exception handlers ("out of bounds" and "illegal instruction").
  - Ability to raise exceptions.





#### OS management:

- Find space at process creation. Easy with our assumptions. Free list.
- Reclaim all the process memory when it terminates. Back to the free list.
- Save and restore base/bounds registers when a context switch occurs (PCB).
- Provide exception handlers.
- It is possible for the OS to move an address space from one location in memory to another (when the process is not running).
  - memcpy





| OS boot<br>(kernel mode)                         | Hardware                                                                                                                                       | Program<br>(user mode) |
|--------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|------------------------|
| Initialize trap table                            | Remember address of:<br>System call handler, Timer handler, <u>Illegal</u><br><u>mem access handler, Illegal instruction</u><br><u>handler</u> |                        |
| Start interrupt timer                            |                                                                                                                                                |                        |
|                                                  | Start timer. Interrupt after X ms.                                                                                                             |                        |
| Initialize process table<br>Initialize free list |                                                                                                                                                |                        |



To start process A: allocate entry in process table <u>allocate memory for process</u> <u>set base/bounds registers</u> return from trap (into A) Hardware

Program (user mode) open course ware

Restore registers of A Move to user mode Jump to A (initial) PC

> Process A runs Fetch instruction

Translate virtual address and perform fetch

Execute instruction

...

If explicit load/store: Ensure address is in-bounds, Translate virtual address and perform load/store

•••

Timer interrupt: move to kernel mode Jump to interrupt handler

Handle the trap:

•••

Call switch routine: save regs(A) to proc-struct(A) (including base/bounds)( restore regs(B) from proc-struct(B) (including base/bounds)

return from trap (into B)

Restore registers of B Move to user mode Jump to B's PC





#### Pros:

- Fast and Simple.
- Offers protection.
- Little overhead (2 registers per process)

#### Cons: X

- Not flexible.
- Wastes memory for large address spaces
  - Internal Fragmentation.







# 3.3 Memory Virtualization-Segmentation





- Segmentation: instead of one base/bound register in MMU, one pair per logical segment of address space.
  - Avoid internal fragmentation.
  - Allow running programs with address spaces that don't fit entirely into memory.
- A segment is a contiguous portion of the address space. In our simple address space:
  - Code segment
  - Heap segment
  - Stack segment





















#### Which segment does an address refer to?

- Implicit approach:
  - The hardware determines the segment by noticing how the address was formed.
  - i.e. from the program counter (fetch) → code segment
     from the stack pointer → stack segment.
- Explicit approach:
  - Top few bits of virtual address (used in VAX/VMS)
  - i.e. In our example, three segments, we need 2 bits of the 14 bits address. So address 0x1068



| Segment   | Number |
|-----------|--------|
| Code+data | 0      |
| Неар      | 1      |
| Stack     | 2      |



What about the stack? (grows backwards)

| Segment | Base | Size | Grows +? |
|---------|------|------|----------|
| Code    | 32KB | 2KB  | 1        |
| Неар    | 34KB | 2KB  | 1        |
| Stack   | 28KB | 2KB  | 0        |

Example : Access to address 15KB (virtual):

Stack access (negative growth)

Stagk 1000000000000 (hex 0x3000)

Offset: 3KB  $\rightarrow$  in Ca2 -1KB

- The offset is negative (Ca2). VA offset (3KB) minus max. segment size (4KB) equals real offset (-1KB).
  - In the example, physical address 27KB.





| Segment | Base          | Size | Grows +? |
|---------|---------------|------|----------|
| Code    | 32KB (0x8000) | 2KB  | 1        |
| Неар    | 34KB (0x8800) | 2КВ  | 1        |
| Stack   | 28KB (0x7000) | 2KB  | 0        |

- Example #4 with Segmentation
  - Fetch instruction at addr 0x8128
  - Exec load from addr 0x6C00
  - Fetch instruction at addr 0x8132
  - Exec, no load
  - Fetch instruction at addr 0x8136
  - Exec store to addr 0x6C00

0x128: movl 0x3C00, %eax 0x132: addl 0x3, %eax 0x136: movl %eax, 0x3C00







### • Support for sharing $\rightarrow$ protection bits

| Segment | Base | Size | Grows +? | Protection   |              |
|---------|------|------|----------|--------------|--------------|
| Code    | 32KB | 2KB  | 1 🔇      | Read-Execute | Code sharing |
| Неар    | 34KB | 2KB  | 1        | Read-Write   |              |
| Stack   | 28KB | 2KB  | 0        | Read-Write   |              |

- Fine-grained vs. Coarse-grained Segmentation.
  - Large vs. small number of segments.
  - Flexibility vs. Cost.





## Segmentation - Fragmentation





# **Segmentation - Summary**



- Easy and Fast.
- Supports sparse address space (no internal fragmentation).
- Allows sharing and fine-grained protection.
- Little overhead (few registers per process)

#### Cons: 🗙

- External Fragmentation.
  - Free-list Management reduces it, but it still exists.
- Complex Free Space Management.
- Segment growing could mean memcpy.





# 3.4 Memory Virtualization-Free Space Management



# Free Space Management

- Minimize external fragmentation (without compacting).
- Management of the free-list to keep track of free space.
- Basic mechanism: Splitting and Coalescing

- OS with segmentation
- User-level memory-allocation library  $\rightarrow$  heap



































### Free Space Management









#### Binary buddy allocator:

- Free memory as a space of size  $2^{N}$ .
- i.e. request for a 7KB block.









# 3.5 Memory Virtualization-Paging



### Introduction

- Segmentation involves chopping up memory space into variable-sized pieces.
  - Is too coarse grained. Complex free space management.
- Paging chops up memory space into fixed-sized pieces.
  - We divide the address space into fixed-sized units called pages.
  - Correspondingly, the physical memory is viewed as an array of fixed-sized slots called page frames.
  - Each virtual page is independently mapped to a physical page.
  - More flexible and easier free-space management.





### Introduction



- What techniques do we need?
- How much space and overhead does it need?
- What is the correct page size?





- For segmentation:
  - high bits  $\rightarrow$  segment.
  - low bits  $\rightarrow$  offset.
- For paging:
  - high bits  $\rightarrow$  page.
  - low bits  $\rightarrow$  offset.

| VP  | N   | Offset |     |     |     |     |
|-----|-----|--------|-----|-----|-----|-----|
| VA6 | VA5 | VA4    | VA3 | VA2 | VA1 | VA0 |

#### How many bits?





| Page Size | Low Bits<br>(offset) | Virtual Address<br>Bits | High Bits (VPN) | Virtual Pages |
|-----------|----------------------|-------------------------|-----------------|---------------|
| 16 bytes  | 4                    | 10                      | 6               | 64            |
| 1KB       | 10                   | 20                      | 10              | 1K            |
| 1MB       | 20                   | 32                      | 12              | 4К            |
| 512bytes  | 9                    | 16                      | 5               | 32            |
| 4КВ       | 12                   | 32                      | 20              | 1MB           |



#### Where do we store the translations?









- Page table <u>per process</u> to record where each virtual page is placed in physical memory.
- Page table stores address translation for each virtual page of the address space.



to get the Physical Frame Number (PFN).





### Page Table



- Real address space: 32 bits (4GB) or 64 bits...
- Page tables can be terribly large.
  - i.e. 32-bits address space with 4KB pages:
    - 20-bit VPN  $\rightarrow$  2<sup>20</sup> translations (~1 million per process).
    - 12-bit offset (4KB page size).
    - Assuming 4 bytes per page table entry (PTE)
      - $\rightarrow$  4MB of memory for each page table  $\rightarrow$  per Process!

 $31\ 30\ 29\ 28\ 27\ 26\ 25\ 24\ 23\ 22\ 21\ 20\ 19\ 18\ 17\ 16\ 15\ 14\ 13\ 12\ 11\ 10\ 9\ 8\ 7\ 6\ 5\ 4\ 3\ 2\ 1\ 0$ 



x86 Page Table Entry

Info bits: valid, protection, present, reference, dirty...

Not in MMU, but in memory (kernel space).





#### Page Table at addr 0x2000

| Virtual Page | Physical Frame |
|--------------|----------------|
| 0            | 3              |
| 1            | 7              |
| 2            | 5              |
| 3            | 2              |

- Example #5 with Paging
  - Load (PT) from addr 0x02000
  - Fetch instruction at addr 0x0C128
  - Load (PT) from addr 0x0200C
  - Exec load from addr 0x0BC00
  - Load (PT) from addr 0x02000
  - Fetch instruction at addr 0x0C132
  - Exec, no load
  - Load (PT) from addr 0x02000
  - Fetch instruction at addr 0x0C136
  - Load (PT) from addr 0x0200C
  - Exec store to addr 0x0BC00

**Too Slow!** 

0x128: movl 0xFC00, %eax 0x132: addl 0x3, %eax 0x136: movl %eax, 0xFC00







- Being in memory, accessing a page table is slow.
  - Page-table base register (PTBR) points to the page table.
  - Page-table length register (PTLR) indicates the size of the page table.
- Every data/instruction access requires two memory accesses. One for the page table, one for the data/instruction.

movl 21, %eax

Implies:

```
VPN = (VirtualAddress & VPN_MASK) >> offset_bits
PTEAddr = PTBR + (VPN*sizeof(PTE))
```

First memory access to PTEAddr to get PFN

```
Offset = VirtualAddress & OFFSET_MASK
PhysAddr = (PFN << offset bits) | offset</pre>
```

Second memory access to get data in PhysAddr (and store it in %eax)



### Paging

### Pros:

- Very Flexible.
- No external Fragmentation.
- No need to move memory blocks.
- Easy Free Space Management.
  - Simple free list (valid bit).
  - Don't need to find contiguous memory.
  - No need to coalesce (fixed size pages).

### Cons: X

- Expensive translation (too slow).
- Huge Overhead.







# 3.6 Memory Virtualization-Paging Improvements



### Remember...



| Mechanism     | Fragmentation       | Flexibility | Overhead | Speed | Free Space | AS bigger than<br>PMem |
|---------------|---------------------|-------------|----------|-------|------------|------------------------|
| Base & bounds | Internal (big)      | Small       | Small    | Fast  | Simple     | No                     |
| Segmentation  | External (variable) | Medium      | Small    | Fast  | Complex    | Yes                    |
| Paging        | Internal (small)    | High        | Big      | Slow  | Simple     | Yes                    |

Reduce the impact of these



### Paging too slow

open course ware

▶ Too Slow  $\rightarrow$  Translation Steps

#### For each mem reference:

- <sup>cheap</sup> extract VPN (virt page num) from VA (virt addr)
- cheap calculate addr of PTE (page table entry)
- expensive fetch PTE
  - cheap ° extract PFN (page frame num)
  - cheap build PA (phys addr)
- expensive fetch PA to register
  - Which steps are expensive?
  - Which expensive step can we avoid?



### Paging too slow



### int sum = 0; for (i=0; i<N; i++) { sum += a[i]; }</pre>

| Virtual | Physical                        |
|---------|---------------------------------|
| 0x3000  | Load 0x100C (PT)<br>Load 0x7000 |
| 0x3004  | Load 0x100C (PT)<br>Load 0x7004 |
| 0x3008  | Load 0x100C (PT)<br>Load 0x7008 |
| 0x300C  | Load 0x100C (PT)<br>Load 0x700C |
| 0x3010  | Load 0x100C (PT)<br>Load 0x7010 |
|         |                                 |

#### Asume:

- 4KB pages (12 bits offset)
- array  $\mathbf{a}$  in addr 0x3000
- Page Table in addr 0x1000 Then, translation of VA 3 at entry in addr 0x100C
- -Translation: VA 3  $\rightarrow$  PA 7
- Just data array accesses
- Take advantage of repetition/locality
   Common translation:
   0x3000 → 0x7000
- Use some kind of CPU cache for translations.





- The two memory access problem can be solved by the use of a special fast-lookup hardware cache called associative registers or translation look-aside buffers (TLBs).
- A TLB is part of the memory-management unit (MMU).
- It is an address-translation cache that stores popular virtual-to-physical address translations.





- Upon a virtual memory reference:
  - MMU first checks the TLB to see if the translation is stored therein.
  - TLB Hit (quick)
    - $\rightarrow$  Extract PFN and get physical address (PA).
  - TLB Miss (slow)
    - $\rightarrow$  access page table to find the translation.
    - $\rightarrow$  update TLB with the translation.
    - $\rightarrow$  extract the PFN and get the physical address (PA).





- Effective Address Time (EAT)
  - Associative lookup =  $\varepsilon$  nanoseconds.
  - Memory cycle time is β nanoseconds.
  - Hit ratio =  $\alpha$  (percentage of times TLB hits).

EAT = 
$$(\beta + \varepsilon)\alpha + (2\beta + \varepsilon)(1 - \alpha) =$$
  
=  $2\beta + \varepsilon - \alpha\beta$   
$$\int EAT \xrightarrow{\alpha \to 1} \beta + \varepsilon$$
$$\beta >> \varepsilon$$
EAT  $\xrightarrow{\alpha \to 0} 2\beta + \varepsilon$ 





#### Page Table



| Virtual |     | Physic                          | al                  |     |  |
|---------|-----|---------------------------------|---------------------|-----|--|
| 0x3000  |     | Load 0x100C (PT)<br>Load 0x7000 |                     |     |  |
| 0x3004  |     | TLB hit<br>Load 0x7004          |                     |     |  |
| 0x3008  |     | TLB hit<br>Load 0               | :<br>x7008          |     |  |
| 0x300C  |     | TLB hit<br>Load 0x700C          |                     |     |  |
| 0x3010  |     | TLB hit<br>Load 0               | :<br>x7010          |     |  |
|         |     |                                 |                     |     |  |
| 0x4000  |     |                                 | x1010 (PT)<br>x6000 | )   |  |
|         |     |                                 |                     |     |  |
|         |     |                                 | VPN                 | PFN |  |
|         | TIB |                                 | 3                   | 7   |  |
|         |     |                                 | _                   | -   |  |

6

4

- Example #6: Accessing an array
  - First entry (a[0]) at VPN=03.
  - 4KB pages.
  - PT at addr 0x1000

```
int sum = 0;
for (i=0; i<4096; i++)
{
    sum += a[i];
}</pre>
```

- Just consider data array accesses (ignore instructions and sum, and i variables).
  - How many TLB lookups per page? 4096/sizeof(int)=1024
  - How many TLB misses?
     if a%4096 (4K) is 0 then 4, else 5.
  - Miss rate?

 $4/4096 \approx 0.1\%$  or  $5/4096 \approx 0.12\%$ 





### TLB improves performance due to:

- spatial locality → Elements of the array are packed into pages.
- temporal locality → Quick re-referencing of memory items in time.

(like any cache)



- ► TLB is finite → need to replace an entry when installing a new one.
- Goal: Minimize miss rate (increase hit rate)
- Typical policies:
  - Least-recently-used (LRU)
  - Random (sometimes better than LRU!)
  - FIFO

Like most caches! More about replacement policies later





# TLB behavior

- When does TLB perform ok?
  - Sequential accesses can almost always hit in the TLB
  - Fast translation!
- What kind of pattern would be slow?
  - Highly random (no repeat accesses).
  - Sequential accesses that load one page at a time and need more pages than TLB size and LRU.
    - i.e. 4KB pages. 4 entries TLB
       Virtual address accesses to: 0x1000 0x2000 0x3000 0x4000 0x5000 0x6000...



### Who handles the TLB miss?

- Hardware
  - Needs to know the page-table location (PTBR).
  - Hardware-managed TLB, like CISC Intel x86 multi-level page table.
- Operating System
  - Software-managed TLB. RISC systems (MIPS, SPARC...)
  - On a miss, hardware raises an exception  $\rightarrow$  trap handler.
  - Special return-from-trap (same instruction, not next).
  - Avoid chain TLB misses from handler (TLB handler in unmapped physical memory  $\rightarrow$  always hit TLB).
  - Flexibility and Simplicity







- TLB contains virtual to physical translation valid for the current process.
- What happens if a process uses the cached TLB entries from another process?
  - Flush the TLB (set all entries as invalid)  $\rightarrow$  valid bit.
  - Address space identifier  $\rightarrow$  ASID field. (kind of PID).
    - Remember which entries are for each process.
    - Even with ASID, other processes "pollute" the TLB

Context Switches are expensive!







- Typical TLB: 32, 64 or 128 entries fully associative cache (search in parallel in all entries).
- TLB entry:

VPN | PFN | Other bits

- Other bits:
  - valid bit (valid translation or not) ≠ page table valid entry.
  - Protection bits (read/write/execute)
  - Address-space identifier
  - Dirty bit...
- Real MIPS TLB entry (32 bits address space, 4KB pages):





### Paging: Smaller Tables

- Page tables are too big and consume too much memory.
- Why do we want big virtual address spaces?
  - programming is easier
  - applications don't need to worry (as much) about fragmentation







### Simple Solution. Bigger Pages

- 32-bits address space with 4KB pages and 4-byte PTE means 4MB page table size.
- 32-bits address space with 16KB pages (18-bits VPN and 14-bits offset) and PTE 4-bytes means 1MB page table size.

### Why don't we use bigger pages?

 Bigger Pages lead to more internal fragmentation (waste space within each page).

> Many architectures support multiple page sizes (4KB, 2MB, 1GB)



### Page Table – Wasted Space



But present in the page table







# Hybrid: Paging and Segmentation

- Reduce the amount of space allocated for page tables.
  - Wasted non-valid entries.
- One page table per segment  $\rightarrow$  page table of arbitrary size.

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

| Seg VPN Offset |
|----------------|
|----------------|

SN = (VirtualAddress & SEG\_MASK) >> SN\_SHIFT
VPN = (VirtualAddress & VPN\_MASK) >> VPN\_SHIFT
AddressOfPTE = Base[SN] + (VPN\*sizeof(PTE)



Known Segmentation Problems:

- No benefits on sparse Address Spaces (big segments)
- "External Fragmentation" (page table of variable size)



## Multi-level Page Tables

- Reduce the number of invalid regions in the page table converting linear page table into a tree-like page table (multi-level page table).
- Chop up page table into page-sized units.
- New structure called page directory.
  - Where the page of the page table is.
  - Or an entire page of the page table is invalid.



Wasted



VPN

1111 1100 1111 1101 1111 1110 1111 1111

#### **Multi-level Page Tables**



1

0

- Example #7: 16KB address space, 64-byte pages, 4-bytes PTE.
  - 14-bit virtual address space. 0
  - 8-bits 0
  - Linear 0

| ts VPN + 6-bits c | offset.    |                      |        |          |      |            | v       |     |          |          |     |       | 1301   |     |
|-------------------|------------|----------------------|--------|----------|------|------------|---------|-----|----------|----------|-----|-------|--------|-----|
|                   |            | _                    |        |          |      | γ          |         |     | γ        |          |     |       |        |     |
| ear page table: 2 | 56 entries | s (2 <sup>8</sup> ). |        |          | Page | e Director | y Index |     | Page Tab | le Index |     |       |        |     |
| 10                |            | <b>、</b>             |        |          |      |            |         |     |          |          |     |       |        |     |
|                   |            |                      |        |          |      |            |         |     |          |          |     |       |        |     |
|                   |            |                      |        |          | pa   | ge dire    | ectory  |     | pa       | age of   | PT  | pa    | age of | PT  |
| Address Space     |            | Linear               | page t | able     | (1   | LO's by    | /tes)   |     | (6       | 54byte   | s)  | (6    | 64byte | s)  |
| code              | page 0     | (                    | 1KB)   |          | ``   | Valid      | PFN     | ] [ | Valid    | Prot     | PFN | Valid | Prot   | PFN |
| code              | page 1     | Valid                | Prot   | PFN      |      | 1          | 100     |     | 1        | rx       | 10  | 0     |        |     |
| free              | page 2     |                      |        |          |      | 0          |         |     | 1        | rx       | 23  | 0     |        |     |
| free              | page 3     | 1                    | rx     | 10<br>23 |      | 0          |         |     | 0        |          |     | 0     |        |     |
| heap              | page 4     | 1                    | rx     | 23       |      | 0          |         |     | 0        |          |     | 0     |        |     |
| heap              | page 5     | 0                    |        |          |      | 0          |         |     | 1        | rw       | 80  | 0     |        |     |
| free              | page 6     | 0                    |        |          |      | 0          |         |     | 1        | rw       | 59  | 0     |        |     |
| free              | page 7     | 1                    | rw     | 80       |      |            |         |     | _        | 1 VV     | 39  | 0     |        |     |
| all free          |            | 1                    | rw     | 59       |      | 0          |         |     | 0        |          |     | 0     |        |     |
|                   |            | 0                    |        |          |      |            |         |     | 0        |          |     |       |        |     |
| free              | page 252   | 0                    |        |          |      | 0          |         |     | -        |          |     | 0     |        |     |
| free              | page 253   |                      |        |          |      | 0          |         |     | 0        |          |     | 0     |        |     |
| stack<br>stack    | page 254   |                      |        |          |      | 0          |         |     | 0        |          |     | 0     | ļ!     |     |
| SLACK             | page 255   | 0                    |        |          |      | 0          |         |     | 0        |          |     | 0     |        |     |
|                   |            | 0                    |        |          |      | 0          |         |     | 0        |          |     | 0     |        |     |
|                   |            | 1                    | rw     | 55       |      | 0          |         |     | 0        |          |     | 0     |        |     |
|                   |            | 1                    | rw     | 45       |      | 0          |         |     | 0        |          |     | 1     | rw     | 55  |
|                   |            |                      |        |          |      | 1          | 101     |     | 0        |          |     | 1     | rw     | 45  |

13 12 11 10 9

8

VPN

7

6

5

4

3 2

Offset

Total ~140 bytes







- 48-bits address space, 16KB pages, PTE 4-bytes.
- 1 page, 4096 PTEs.



• 4 accesses, first look at the TLB!



## Page Table. Reducing space



- Inverted Page tables (hash-table):
  - No multiple tables (one per process).
  - One single table with an entry for each physical page frame of the system.
    - Each entry has information of the process using the frame, and the virtual page mapped.
  - Expensive linear search  $\rightarrow$  complex search mechanisms.
  - Used in PowerPC (IBM).
- Swapping Page Tables to Disk:
  - Page table resides in kernel-reserved physical memory.
  - Even reducing its size, it could be too big.
  - Some systems place the page table in kernel virtual memory, so it can swap some page tables to disk (i.e. VAX/VMS).
  - More about swapping next...







# 3.7 Memory Virtualization-Beyond Physical Memory





## Mechanisms: Swapping

- We have assumed that every address space fits in physical memory.
  - Should we be aware of the physical memory available when programming?
- Indeed, we wish to support many concurrently-running large address spaces.
  - Not all pages will reside in physical memory.
  - We need a place to stash pages without great demand.
  - Should have more capacity (slower)  $\rightarrow$  usually a hard disk drive.



#### Swap Space

open course ware

- ► OS needs to reserve some space on the disk to allow moving pages → swap space.
  - OS reads from and writes to swap space in page-sized units.
  - OS needs disk address.
  - Swap space size determines the maximum number of memory pages in the system at a given time.

| Ρ | hysical Memory |
|---|----------------|
|   | Proc0 [VPN0]   |
|   | Proc1 [VPN2]   |
|   | Proc0 [VPN2]   |
|   | Proc2 [VPN0]   |
|   | Proc1 [VPN0]   |
|   | Proc0 [VPN1]   |
|   | Proc2 [VPN3]   |
|   | Proc4 [VPN0]   |
|   |                |

| _ | Swap Space   |              |              |               |  |  |  |
|---|--------------|--------------|--------------|---------------|--|--|--|
|   | Proc1 [VPN1] | Proc3 [VPN3] |              | Proc4 [VPN2]  |  |  |  |
|   | Proc1 [VPN3] | Proc3 [VPN0] |              |               |  |  |  |
|   | Proc0 [VPN3] |              |              |               |  |  |  |
|   |              |              | Proc0 [VPN4] | Proc4 [VPN03] |  |  |  |
|   |              | Proc2 [VPN2] |              |               |  |  |  |
|   | Proc3 [VPN1] | Proc2 [VPN1] |              |               |  |  |  |
|   |              | Proc2 [VPN4] | Proc3 [VPN4] |               |  |  |  |
|   | Proc3 [VPN2] |              | Proc4 [VPN1] | Proc4 [VPN4]  |  |  |  |

Swan Space











#### Page Fault

open course ware

- On a TLB miss, hardware locates the page table in memory. If the page is not in physical memory (present bit) → OS is invoked to handle it (page-fault handler).
  - OS swaps the page into memory from disk and updates the page table (present bit and PFN).
  - Next try will fail in TLB, hit in table and update TLB. Last try will hit TLB and request to memory.
- During a page-fault, process will be blocked (I/O).
- What if memory is full? (no place for the swapped-in page).
  - One or more pages need to be swapped-out to disk.
  - This is known as replacement and requires a page-replacement policy.





# 3.8 Memory Virtualization-Swapping: Policies



## Which to Replace



- Physical memory is smaller than the accumulated address spaces of all the processes:
  - Physical memory as a cache of the virtual memory pages.
- How long does it take to access a 4-byte int?
  - RAM: tens ns per int (depending on TLB hit)
  - Disk: tens ms per int
- We want to reduce the number of cache misses (fetch a page from disk).
   AMAT= (P<sub>HIT</sub>·T<sub>M</sub>) + (P<sub>MISS</sub>·T<sub>D</sub>)
   The one evicted impacts hit rate
- The OS decides which page to evict according to the replacement policy.





- Optimal Replacement Policy: Replace the page that will be accessed furthest in the future.
- Access pattern: 0,1,2,0,1,3,0,3,1,2,1



- Hit rate: 54.5% (85.7% ignoring compulsory misses)
- WARNING: Future is not generally known...
  - But serves as a close-to-perfect comparison point.



## Simple Policy: FIFO

- First-in, First-out replacement. Pages are in a queue, when a replacement occurs, page on the tail is evicted.
- Same access pattern: 0,1,2,0,1,3,0,3,1,2,1

| Access | Hit/Miss? | Evict | Cache State |
|--------|-----------|-------|-------------|
| 0      | Miss      | -     | 0           |
| 1      | Miss      | -     | 0,1         |
| 2      | Miss      | -     | 0,1,2       |
| 0      | Hit       | -     | 0,1,2       |
| 1      | Hit       | -     | 0,1,2       |
| 3      | Miss      | 0     | 1,2,3       |
| 0      | Miss      | 1     | 2,3,0       |
| 3      | Hit       | -     | 2,3,0       |
| 1      | Miss      | 2     | 3,0,1       |
| 2      | Miss      | 3     | 0,1,2       |
| 1      | Hit       | -     | 0,1,2       |
|        |           |       |             |

- Hit rate: 36.4% (57.1% ignoring compulsory misses)
- > FIFO cannot determine the relevance of blocks.



### Using History: LRU



- Least Recently Used (LRU): If a page has been accessed in the near past, it is likely to be accessed again in the near future.
- Same access pattern: 0,1,2,0,1,3,0,3,1,2,1

| Access | Hit/Miss? | Evict | Cache State |
|--------|-----------|-------|-------------|
| 0      | Miss      | -     | 0           |
| 1      | Miss      | -     | 0,1         |
| 2      | Miss      | -     | 0,1,2       |
| 0      | Hit       | -     | 1,2,0       |
| 1      | Hit       | -     | 2,0,1       |
| 3      | Miss      | 2     | 0,1,3       |
| 0      | Hit       | -     | 1,3,0       |
| 3      | Hit       | -     | 1,0,3       |
| 1      | Hit       | -     | 0,3,1       |
| 2      | Miss      | 0     | 3,1,2       |
| 1      | Hit       | -     | 3,2,1       |
|        |           |       |             |

Same Hit rate as Optimal:54.5% (85.7% ignoring compulsory misses).

Same as Optimal! (Just an example)



## Using History: LRU

- Use history to guess the future. This family of policies are based on the principle of locality.
  - Usually programs access certain code and data frequently (loops).
  - Temporal locality: pages accessed in the near past are likely be accessed in the near future.
  - Spatial locality: if a page P is accessed, pages around it (P-1, P+1) are likely to be accessed (data arrays).
- Main members of the historically-based algorithms:
  - Least-Recently-Used (LRU): based on recency, replaces the least-recently-used page.
  - Least-Frequently-Used (LFU): based on access frequency, replaces the leastfrequently-used page.
- The opposites of these algorithms exist:
  - Most-Recently-Used (MRU).
  - Most-Frequently-Used (MFU).
  - In most cases (not all), programs exhibit locality and these algorithms do not perform well.





#### Policy Behavior - Workloads







#### Policy Behavior - Workloads







#### Policy Behavior - Workloads





## LRU Implementation



- To be perfect, must grab a timestamp on every memory reference and store it in the PTE (too expensive).
- We need an approximation. Hardware support: use bit or reference bit.
  - Whenever a page is referenced  $\rightarrow$  ref bit set to 1.
- Counter implementation:
  - Keep a counter on PTE.
  - At regular intervals for each page, do:
    - if ref bit == 1, increase counter.
    - if ref bit == 0, zero the counter.
    - regardless, ref bit = 0.
- Clock Algorithm:







#### **Eviction!**







**Physical Mem:** 







**Physical Mem:** 







#### evict **page 2** because it has not been recently used







Physical Mem:

page 0 is accessed...







New eviction





**Physical Mem:** 









#### evict **page 1** because it has not been recently used



#### **Other Factors**

open course ware

- Dirty Pages: Do we have to write to disk on eviction? (Assume page is both in RAM and Disk)
  - Not if the page is clean  $\rightarrow$  "free" eviction.
  - Track with a dirty bit (page has been modified).
  - Can be used in page-replacement algorithm.
- Prefetching: Instead of bringing pages "on demand", the OS guesses which page is about to be used.
  - Only when there is a reasonable chance of success (e.g. spatial locality).
  - A prefetch implies an eviction.



### When to Replace



- We can assume the OS waits until memory is full to replace, but there are many reasons not to do that.
- The OS keeps a small portion of the memory free proactively.
  - High watermark (HW) and Low watermark (LW).
  - When there are fewer than LW pages available, a background thread evicts pages until HW are available again.
  - This background thread is sometimes called swap daemon or page daemon.
- This way, replacement (swapping) does not slow down most of the page-faults.
- Writing to the swap partition can be done in clusters (groups) of pages at once (more efficient).





## Thrashing

- If processes do not have "enough" pages, the page-fault rate will be high. This leads to:
  - low CPU utilization.
  - operating system thinks it needs to increase the degree of multiprogramming.
  - another process is added to the system (less memory available per process).
  - more page-faults.
- Thrashing = processes are busy swapping pages in and out.
- Solution:
  - admission control: reduced set of processes (less work better than no work).
  - buy more memory...
  - Linux out-of-memory killer! → A daemon that chooses a memory intensive process and kills it.





## 3.9 Memory Virtualization -Memory API





#### Stack:

Implicitly allocated/deallocated by the compiler → Automatic.
In C:

```
void func()
{
    int x; //declares an integer on the stack
    ...
}
```

- Compiler makes sure to make space on the stack when you call into func().
- When you return from the function, the compiler deallocates the memory.
- Information does not live beyond the call invocation.





#### Heap:

- Longlife dynamic memory.
- Allocation and deallocation explicitly handled by the programmer (WARNING: bugs!)

```
void func()
{
    int *x = (int *) malloc(sizeof(int));
    ...
    free(x)
```

- Stack allocation of a pointer, heap allocation at malloc().
   Heap memory deallocation at free().
  - free() does not need size argument. It must be tracked by the memory-allocation library.

```
#include <stdlib.h>
void free(void* ptr);
```



## The malloc() Call



The malloc() call allocates space in memory. As much as its single parameter in bytes.

#include <stdlib.h>
void \*malloc(size\_t size);

• The programmer should not type the number of bytes directly, but reference the type and number of elements to be allocated.

int \*x = malloc(10\*sizeof(int)) //allocate an array of 10 int

• The returning pointer is of type void, the programmer decides what to do with it (usually casting the correct pointer type).

int \*x = (int \*) malloc(10\*sizeof(int)) //allocate an array of 10 int



#### Common Errors



#### Forgetting to Allocate Memory

#### Not allocating Enough Memory





### Common Errors



#### Forgetting to Initialize Allocated Memory

- Forgetting to Free Memory
  - Memory leak  $\rightarrow$  huge problem in long runs.
  - OS will take care of leaked memory when the process ends.
- Freeing memory incorrectly
  - Before you are done with it.
  - Freeing repeatedly (double free).
  - $\circ$  Calling <code>free()</code> with wrong pointer.



#### Memory API



- malloc() and free() are library calls from the C library. The malloc library manages space within the virtual address space and calls the corresponding OS system calls:
  - brk: moves the heap pointer to a position.
  - sbrk: moves the heap pointer an increment.
  - Depending on the heap pointer movement, it is a malloc() or a free().
- There are other memory management functions:
  - mmap creates an anonymous memory region within the program, associated to swap space (treated as heap).
  - calloc allocates memory and fills it with zeroes.
  - realloc allocates memory and copies a memory region to it.
    - used to increase the size of an already allocated region.





# 3. Memory Virtualization Exercises





There is a system with base and bounds as memory virtualization mechanism. During the execution of a process, we observe the following address translations:

| Virtual Address       | Translation: Physical Address  |
|-----------------------|--------------------------------|
| 0x0308 (decimal: 776) | Valid: 0x3913 (decimal: 14611) |
| 0x0255 (decimal: 597) | Valid: 0x3860 (decimal: 14432) |
| 0x03A1 (decimal: 929) | Error: Segmentation Fault      |

What can we say about the value of the base register? How about the bounds register?





There is a system that uses segmentation as its memory virtualization mechanism. The address space uses 32 bits for address, where the two most significant bits are the ones that indicate the segment. The following table shows the segment translation and the values of the corresponding registers:

| Segment | Base   | Bounds | Protection |
|---------|--------|--------|------------|
| 0       | 0x1000 | 0x100  | Read       |
| 1       | 0x2000 | 0x200  | Read/Write |
| 2       | 0x5000 | 0x500  | Read/Write |

- Obtain the translation of the following memory accesses:
  - Load 0x0000010
  - Load 0x40000300
  - Load 0x80000300
  - Store 0x0000050
  - Store 0xC0000010



#### Exercise #3



| Virtual Page<br>Number | Valid | Reference | Dirty | Page Frame<br>Number |
|------------------------|-------|-----------|-------|----------------------|
| 0                      | 1     | 0         | 0     | 3                    |
| 1                      | 1     | 1         | 1     | 7                    |
| 2                      | 1     | 0         | 0     | 4                    |
| 3                      | 0     | 1         | 1     | 0                    |
| 4                      | 0     | 0         | 1     | 4                    |
| 5                      | 1     | 0         | 1     | 6                    |
| 6                      | 0     | 0         | 0     | 5                    |
| 7                      | 1     | 1         | 0     | 0                    |
|                        |       |           |       |                      |

Given the previous page table, and considering a page size of 1024 bytes, obtain the physical address (if possible) of each of the following virtual addresses. (note: There is no need to manage page faults if there are any):

 a) 0x0356, b) 0x0DA8, c) 0x8F3, d) 0x14C3, e)0x1F01





There is a system that uses paging as its mechanism for memory virtualization. Specifically, it uses lineal page tables (one single level). The size of the address space of each process is 4GB (32bits) and the page table size is 1KB. Each Page Table Entry (PTE) has: the page frame number of the translation (PFN), a valid bit V, a reference bit R and a dirty bit D. This system allows 2GB of physical memory at most.

#### PTE: PFN V R D

- a) How many entries does a page table have in this system? Is this always the case?
- b) How many pages does a page table occupy in memory?
- c) How much memory do the page tables occupy if there are 100 processes running in the system?



#### Exercise #5

| Virtual Page Number | Present bit | Reference bit | Dirty bit | Page Frame Number |
|---------------------|-------------|---------------|-----------|-------------------|
| 0                   | 1           | 0             | 0         | 3                 |
| 1                   | 1           | 1             | 1         | 2                 |
| 2                   | 1           | 0             | 0         | 4                 |
| 3                   | 1           | 0             | 1         | 7                 |
| 4                   | 0           | 0             | 1         | _                 |
| 5                   | 1           | 0             | 1         | 6                 |
| 6                   | 0           | 0             | 0         | _                 |
| 7                   | 1           | 1             | 0         | 0                 |
|                     |             |               |           |                   |

Address space of 16bits, 1KB page size. There are 4 entries in the TLB with the following contents:

|     | VPN | PFN | LRU |
|-----|-----|-----|-----|
| TLB | 0   | 3   | 1   |
|     | 2   | 4   | 0   |
|     | 5   | 6   | 3   |
|     | 7   | 0   | 2   |

- The following addresses are accessed sequentially: 0x0356, 0x08F3, 0x14C3, 0x0DA8, 0x0180, 0x0E83
- The TLB has an LRU replacement policy. TLB hit time is 2ns, memory access time is 200ns and swapping time is 3000ns. Obtain the Effective Access Time (EAT) for this batch of requests and the hit-rate of the TLB.







#### PAGE TABLE P1

| Virtual page no. | Present bit | Reference<br>bit | Dirty bit | Page frame<br>number |
|------------------|-------------|------------------|-----------|----------------------|
| 0                | 1           | 0                | 0         | 6                    |
| 1                | 1           | 1                | 0         | 7                    |
|                  |             |                  |           | —                    |
| 5                | 1           | 1                | 1         | 0                    |
| 6                | 0           | 0                | 0         |                      |
| 7                | 1           | 0                | 0         | 3                    |
|                  |             |                  |           |                      |
| 30               | 1           | 1                | 1         | 5                    |
| 31               | 1           | 1                | 1         | 2                    |

#### PHYSICAL MEMORY

| Page<br>frame no. | Virtual<br>page | Process | LRU |
|-------------------|-----------------|---------|-----|
| 0                 | 5               | P1      | 4   |
| 1                 | 0               | P2      | 0   |
| 2                 | 31              | P1      | 1   |
| 3                 | 7               | P1      | 5   |
| 4                 | 31              | P2      | 3   |
| 5                 | 30              | P1      | 7   |
| 6                 | 0               | P1      | 2   |
| 7                 | 1               | P1      | 6   |

Two processes run simultaneously in a system with 16KB of physical memory and 2KB page size.

|        | VPN | PFN | LRU | V |
|--------|-----|-----|-----|---|
|        | 0   | 6   | 1   | 1 |
| TLB P1 | 5   | 0   | 2   | 1 |
|        | 7   | 3   | 3   | 1 |
|        | 31  | 2   | 0   | 1 |

| TLB P2 | VPN | PFN | LRU | V |
|--------|-----|-----|-----|---|
|        | 0   | 1   | 0   | 1 |
|        | 31  | 4   | 1   | 1 |
|        | 1   | 3   | 2   | 0 |
|        | 3   | 7   | 3   | 0 |

- Obtain the contents of both TLBs if the following addresses are accessed sequentially: P1-0x0356, P1-0x08F3, P2-0xFDA8, P1-0x346C
- Each TLB uses LRU as its replacement policy, the same as the physical memory.





- There is a system with multi-level page tables (two levels), 15 bits virtual address and 32 bytes page size. Each page directory entry (PDE) and each page table entry (PTE) have the same structure: one valid bit followed by 7 bits that store the page frame number.
- The register that has the address of the page directory has the decimal value 73.
- Below is a physical memory dump of three specific page frames:

- Translate the following addresses and, if the translation is valid, obtain the returning value of the load request (1 byte) :
  - A) load 0x1787
  - B) load 0x2665