• ELRH7x86_64

# 操作系统介绍

Introduction to Operating Systems

## 虚拟化CPU

Virtualizing the CPU

## 虚拟内存

Virtualizing Memory

## 并发

Concurrency

• 一个用于将计数器的值从内存加载到寄存器；
• 一个用于递增；
• 一个用于将其存储回内存。

## 持久化

Persistence

• 第一，调用open()打开文件；
• 第二，调用write()将数据写入文件；
• 第三，调用close()关闭文件。

## 设计目标

Design Goals

• 能源效率：在绿色世界中越发重要；
• 安全：对恶意应用程序的安全性至关重要；
• 可移植性：随着操作系统在越来越小的设备上运行，可移植性也变得很重要。

## 一些历史

Some History

### The Modern Era

Unix环境对编程人员和开发人员都很友好，同时也为C编程语言提供了编译器。编程人员可以轻松编写自己的程序并共享它们，这使得Unix非常受欢迎。它还是免费的。

Summary

# 虚拟化

Virtualization

CPU虚拟化：

• A Dialogue on Virtualization
• The Abstraction: The Process
• Interlude: Process API
• Mechanism: Limited Direct Execution
• CPU Scheduling
• Scheduling: The Multi-Level Feedback Queue
• Scheduling: Proportional Share
• Multi-CPU Scheduling
• Summary Dialogue on CPU Virtualization

MEM虚拟化：

• A Dialogue on Memory Virtualization
• Interlude: Memory API
• Segmentation
• Free-Space Management
• Paging
• Paging: Faster Translations
• Paging: Smaller Tables
• Beyond Physical Memory: Swapping Mechanisms
• Beyond Physical Memory: Swapping Policies
• Complete Virtual Memory Systems
• Summary Dialogue on Memory Virtualization

## 进程

The Abstraction: The Process

TIP: USE TIME SHARING (AND SPACE SHARING)

• 低级机制(low-level machinery mechanisms)，此机制是实现所需功能的低级方法或协议。例如，我们稍后将学习如何实现上下文切换(context switch)，这使操作系统能够停止运行一个程序并在给定的CPU上开始运行另一个程序。所有现代操作系统都采用这个分时机制(time-sharing mechanism)
• 操作系统中还存在一些智能的策略(policy)，策略是在操作系统中做出某种决定的算法。例如，给定一些可能在CPU上运行的程序，操作系统运行哪个程序？操作系统中的调度策略将做出此决定，可能使用历史信息，工作负载信息，性能指标…来做出决定。

### 一个进程

The Abstraction: A Process

TIP: SEPARATE POLICY AND MECHANISM

### Process API

• Create：操作系统必须包含一些创建新进程的方法；
• Destroy：由于存在创建进程的接口，因此系统还提供了强制销毁进程的接口。当然，许多进程都会运行并在完成后自动退出。然而，当它们不这样做时，用户可能希望杀死它们；
• Wait：有时，等待进程停止运行时有用的；
• Miscellaneous Control：除了杀死或等待进程之外，有时还有其它可能的控制措施。如暂停进程，然后恢复它；
• Status：通常还有接口来获取有关进程的一些状态信息，如运行了多久…

### 进程创建

Process Creation: A Little More Detail

### 进程状态

Process States

• Running：进程正在处理器上运行，这意味着它正在执行指令；
• Blocked：进程执行某种操作，使其在其它事件发生之前不准备运行。例如，当进程向磁盘发起I/O请求时，它会被阻塞，因此其它一些进程可以使用该处理器。

Time Process0 Process1 Notes
4 Running Ready Process0 now done
5 Running
6 Running
7 Running
8 Running Process1 now done

Time Process0 Process1 Notes
3 Running Ready Process0 initiates I/O
4 Blocked Running Process0 is blocked
5 Blocked Running so Process1 runs
6 Blocked Running
8 Ready Running Process1 now done
9 Running
10 Running Process0 now done

### 数据结构

Data Structures

ASIDE: DATA STRUCTURE — THE PROCESS LIST

ASIDE: KEY PROCESS TERMS(关键进程术语)

## 进程API

Process API

ASIDE: INTERLUDES

CRUX: HOW TO CREATE AND CONTROL PROCESSES

• fork()
• exec()

• wait()

### fork系统调用

The fork() System Call

fork()系统调用用于创建新进程。但是，要预先警告：这是你将要调用的最奇怪的例行程序。更具体的说，你有一个正在运行的程序，代码如下所示。输入并运行它。

CPU调度器，确定哪个进程在给定时刻运行。因为调度程序很复杂，我们通常不能对它将选择做什么做出强有力的假设，如最先运行哪个进程。这些不确定性导致了一些有趣的问题，特别是在多线程程序中，这将在并发中讨论。

### wait系统调用

The wait() System Call

### exec系统调用

The exec() System Call

fork()系统调用很奇怪，它的伙伴exec()也不是那么正常。它的作用：给定可执行文件的名称和一些参数，从该可执行文件加载代码和静态数据并覆盖其当前代码段，重新初始化堆和栈以及程序内存空间的其它部分。然后操作系统运行该程序，传入任意参数为该进程的argv。因此，它不会创建新的进程。相反，它将当前运行的程序(p3)转换为不同的运行程序(wc)。在子进程的exec()之后，几乎就好像p3从未运行过，成功调用exec()永远不会有返回。

### Motivating The API

Why? Motivating The API

TIP: GETTING IT RIGHT

shell只是一个用户程序。它会向你显示提示，然后等待你输入内容。你输入一个命令，在大多数情况下，shell确定文件系统中可执行文件所在的位置，调用fork()创建一个新的子进程来运行命令，调用exec()的某个变体来运行命令，然后通过调用wati()命令来等待命令的完成。当子进程完成时，shell从wait()返回并再次打印出一个提示，为下一个命令做好准备。

fork()exec()的分离允许shell很容易地完成一堆有用的东西。例如: wc p3.c > 1.txt
shell完成此任务的方式非常简单，在创建子进程时，在调用exec()之前，shell关闭stdout并打开文件1.txt。通过这样做，即将运行的此程序的任何输出都被发送到文件而不是屏幕。

Unix 管道(pipe) 以类似的方式实现，但使用pipe()系统调用。这种情况下，一个进程的输出连接到一个内核管道(即队列)，另一个进程的输入连接到同一个管道。因此，一个进程的输出无缝地用作下一个进程的输入，并且长而有用的命令链可以串在一起。如这个栗子: grep -o foo file | wc -l

### 进程控制和用户

Process Control And Users

ctrl+c SIGINT(2) 中断
ctrl+z SIGTSTOP(19) 停止(暂停)

### 有用的工具

Useful Tools

• ps
• top
• sar
• kill
• killall

### 摘要

Summary

ASIDE: KEY PROCESS API TERMS

• 每个进程都有一个名字，在大多数系统中，该名称为PID
• fork()系统调用在Unix系统中用于创建新进程。创建者被称为父进程（parent），被创建的新进程被称为子进程（child）
• wait()系统调用允许父进程等待其子进程完成执行；
• exec()系统调用允许子进程摆脱与父进程的相似性并执行一个全新的程序；
• Unix Shell通常使用fork(), exec(), wait()来启动用户命令。fork()exec()的分离支持I/O重定向、管道…；
• 进程控制以信号的方式提供，这可能导致作业停止、继续或终止；
• 可由特定用户控制哪些进程被封装在用户的概念中。操作系统允许多个用户同时登录，并确保用户只能控制自己的进程；
• 超级用户(superuser)可以控制所有进程。出于安全考虑，请不要使用此用户进行直接操作。

## 直接执行

Mechanism: Limited Direct Execution

• 首先是性能（Performance）：如何在不增加系统过多开销的情况下实现虚拟化？
• 第二是控制（Control）：如何保持在对CPU的控制的同时有效地运行进程？控制对操作系统尤其重要，，因为它控制资源。如果没有控制权，进程就可以永远运行并接管机器，或者访问不应该被它访问的信息。

### 基本技术：有限的直接执行

Basic Technique: Limited Direct Execution

Direct Execution Protocol (Without Limits)

OS Program

• 如果我们只运行一个程序，操作系统如何确保程序不执行我们不希望它执行的操作，同时仍然有效地运行它？
• 当运行一个程序时，操作系统如何阻止它运行并切换到另一个进程，从而实现我们虚拟化CPU所需的分时共享？

### 问题1：受限制的操作

Problem1: Restricted Operations

Limited Direct Execution Protocol

[email protected] hardware
initialize trap table
syscall handler
[email protected] hardware program(user mode)
Create entry for process list
Allocate memory for program
Setup user stack with argv
Fill kernel stack with reg/PC
return-from-trap
restore regs(from kernel stack)
move to user mode
Run main()

Call system call
trap into OS
save regs(to kernel stack)
move to kernel mode
Handle trap
Do work of syscall
return-from-trap
restore regs(from kernel stack)
move to user mode

return from main
trap (via exit())
Free memory of process
Remove from process list

• 启动时，内核初始化陷阱表，CPU会记住它的位置以供后续使用；
• 内核通过特权指令执行此操作。
内核在使用return-from-trap指令开始执行进程之前设置了一些东西（分配进程列表、内存…）。这会将CPU切换到用户模式并开始运行此进程。当进程希望发出系统调用时，操作系统处理进程并再次通过return-from-trap将控制权返回给进程。然后进程完成其工作，并从main()返回。它通常会返回存根代码，它将正确地退出程序。此时操作系统清理完毕，就完成了。

### 问题2：在进程间切换

Problem2: Switching Between Processes

#### 合作方法：等待系统调用

A Cooperative Approach: Wait For System Calls

#### 非合作方法：操作系统取得控制权

A Non-Cooperative Approach: The OS Takes Control

#### 保存和恢复上下文

Saving and Restoring Context

: Limited Direct Execution Protocol (Timer Interrupt)

[email protected]
kernel mode
hardware
initialize trap table
syscall handler
timer handler
start interrupt timer
start timer
interrupt CPU in X ms
[email protected]
kernel mode
hardware program
user mode
Process A
timer interrupt
save regs(A) → k-stack(A)
move to kernel mode
Handle the trap
Call switch() routine
save regs(A) → proc t(A)
restore regs(B) ← proc t(B)
switch to k-stack(B)
return-from-trap (into B)
restore regs(B) ← k-stack(B)
move to user mode
Process B

### 摘要

CPU虚拟化术语

• CPU至少支持两种执行模式：受限的用户模式特权内核模式（非受限）
• 典型的用户应用程序以用户模式运行，并使用系统调用来陷阱(trap)到内核中以请求操作系统服务
• 陷阱指令小心保存寄存器状态，将硬件状态更改为内核模式，并跳转到操作系统到预先指定的目标：陷阱表（trap table）
• 当操作系统完成对系统调用的服务时，它会通过另一个特殊的return-from-trap指令返回到用户程序，这会降低权限并在跳转到操作系统的陷阱后将控制权返回给指令
• 操作系统必须在引导（boot）时设置陷阱表，并确保用户程序无法轻松修改它们。所有这些都是有限直接执行协议的一部分，改写以有效地运行程序但不会丢失操作系统控制
• 程序运行后，操作系统必须使用硬件机制（定时器中断）来确保用户程序不会永远运行。这种方法是CPU调度的非协作方法
• 有时，在定时器中断或系统调用期间，操作系统可能希望从运行当前进程切换到另一个进程，这是一种被称为上下文切换（context switch）的低级技术

## 调度

CPU Scheduling

### 工作负载假设

• 每个作业运行相同的时间
• 所有作业都在同一时间完成
• 一旦启动，每个作业都会运行完成
• 所有作业仅使用CPU
• 每个作业的运行时间都是已知的

### 调度指标

Scheduling Metrics

### 先进先出

First In, First Out (FIFO)

### 最短作业优先

Shortest Job First (SJF)

### 最短完成时间优先

Shortest Time-to-Completion First (STCF)

### 响应时间指标

A New Metric: Response Time

### 轮询

Round Robin

• 轮询（RR）：$\frac{0+1+2}{3}=1$
• 最短作业优先（SJF）：$\frac{0+5+10}{3}=5$

• SJF, STCF优化了周转时间，但对响应时间不利；
• RR优化了响应时间，但对周转时间不利；

### 合并I/O

Incorporating I/O

## 多级反馈队列

Scheduling: The Multi-Level Feedback Queue

• 首先，它希望优化周转时间(turnaround time)。这是通过先运行较短的作业来完成。不幸的是，操作系统通常不知道作业运行的时间长短，这这是SJF, STCF等算法所需要的；
• 其次，它希望系统能够对交互式用户的敏感响应，从而最大限度地缩短响应时间。不幸的是，像RR这样的算法会缩短响应时间，但对于周转时间来说却很糟糕。

### 基本规则

Basic Rule

• Rule 1: If Priority(A) > Priority(B), A runs (B doesn’t)；
• Rule 2: If Priority(A) = Priority(B), A and B run in RR。

### 如何改变优先级

How To Change Priority

• Rule 3： When a job enters the system, it is placed at the highest priority (the topmost queue).
• Rule 4a： If a job uses up an entire time slice while running, its priority is reduced (i.e., it moves down one queue).
• Rule 4b： If a job gives up the CPU before the time slice is up, it stays at the same priority level.

#### 长作业栗子

Example 1: A Single Long-Running Job

#### 短作业栗子

Example 2: Along Came A Short Job

A（黑色）在最低优先级队列中运行；B（灰色）在T=100到达，因此被插入最高队列；因为它的运行时间很短(20ms)，所以B在到达底部队列之前完成，在两个时间片中；然后A恢复运行(低优先级)。

### 优先级提升

The Priority Boost

• Rule 5： After some time period S, move all the jobs in the system to the topmost queue.

### 更好的计算

Better Accounting

## 彩票调度

Lottery Scheduling

### 票机制

Ticket Mechanisms

## 多CPU调度

Multi-CPU Scheduling

### 多处理器架构

Multiprocessor Architecture

caches )

### 同步

Don’t Forget Synchronization

Simple List Delete Code:

Cache Affinity

### 单队列调度

Single-Queue Scheduling

### 多队列调度

Multi-Queue Scheduling

CRUX: HOW TO DEAL WITH LOAD IMBALANCE
How should a multi-queue multiprocessor scheduler handle load imbalance, so as to better achieve its desired scheduling goals?

### Linux多处理器调度程序

Linux Multiprocessor Schedulers

• O(1) scheduler
• the Completely Fair scheduler(CFS)
• the BF scheduler

O(1)和CFS都是用多队列，二BFS使用单队列，这表明两种方法都可以成功。例如，O(1)调度程序是基于优先级的（类似于前面讨论的MLFQ），随着时间的推移更改进程的优先级，然后调度优先级最高的进程，以满足各种给调度目标。交互性是另一个特别的重点。相比之下，CFS是确定性的比例共享(proportional-share)方法（更像前面讨论的Stride schedulig）。BFS是这三种方法中唯一的但队列方法，也是按比例共享，但它基于更复杂的方案，即Earliest Eligible Virtual Deadline First(EEVDF)。

Summary

## CPU虚拟化总结

Summary Dialogue on CPU Virtualization: http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-dialogue.pdf

## 内存虚拟化对话

A dialogue on Memory Virtualization: http://pages.cs.wisc.edu/~remzi/OSTEP/dialogue-vm.pdf

## 内存地址

Early System

### 多编程和时间共享

Multiprogramming and Time Sharing

### 地址空间

• code
• stack
• heap

THE CRUX: HOW TO VIRTUALIZE MEMORY
How can the OS build this abstraction of a private, potentially large address space for multiple running processes (all sharing memory) on top of a single, physical memory?

### 目标

Goals

TIP: THE PRINCIPLE OF ISOLATION
Isolation is a key principle in building reliable systems. If two entities are properly isolated from one another, this implies that one can fail without affecting the other. Operating systems strive to isolate processes from each other and in this way prevent one from harming the other. By using memory isolation, the OS further ensures that running programs cannot affect the operation of the underlying OS. Some modern OS’s take isolation even further, by walling off pieces of the OS from other pieces of the OS. Such microkernels thus may provide greater reliability than typical monolithic kernel designs.

Summary

## 内存API

CRUX: HOW TO ALLOCATE AND MANAGE MEMORY
In UNIX/C programs, understanding how to allocate and manage memory is critical in building robust and reliable software. What interfaces are commonly used? What mistakes should be avoided?

Types of Memory

### malloc调用

The malloc() Call

malloc()调用十分简单：向其传递一个大小，以要求在heap上留出一定的空间。它要么成功并返回给您指向新分配的空间的指针，要么失败并返回NULL。在命令行中输入man malloc以获取帮助。

malloc()的单个参数的大小为t，它简单地描述了您需要缩少字节。但是，大多数编程人员不会在此处直接输入数字。（实际上，这样做会被认为是糟糕的形式。）而是使用各种例程(routines)和宏(macros)。例如，要为双精度浮点值分配空间，只需执行以下操作:

TIP: WHEN IN DOUBT, TRY IT OUT
If you aren’t sure how some routine or operator you are using behaves, there is no substitute for simply trying it out and making sure it behaves as you expect. While reading the manual pages or other documentation is useful, how it works in practice is what matters. Write some code and test it! That is no doubt the best way to make sure your code behaves as you desire. Indeed, that is what we did to double-check the things we were saying about sizeof() were actually true!

The free() Call

### 常见错误

Common Errors

TIP: IT COMPILED OR IT RAN 6= IT IS CORRECT
Just because a program compiled(!) or even ran once or many times correctly does not mean the program is correct. Many events may have conspired to get you to a point where you believe it works, but then something changes and it stops. A common student reaction is to say (or yell) “But it worked before!” and then blame the compiler, operating system, hardware, or even (dare we say it) the professor. But the problem is usually right where you think it would be, in your code. Get to work and debug it before you blame those other components.

Forgetting To Allocate Memory

Not Allocating Enough Memory

Forgetting to Initialize Allocated Memory

Forgetting To Free Memory

Freeing Memory Before You Are Done With It

Freeing Memory Repeatedly

ASIDE: WHY NO MEMORY IS LEAKED ONCE YOUR PROCESS EXITS
When you write a short-lived program, you might allocate some space using malloc(). The program runs and is about to complete: is there need to call free() a bunch of times just before exiting? While it seems wrong not to, no memory will be “lost” in any real sense. The reason is simple: there are really two levels of memory management in the system.
The first level of memory management is performed by the OS, which hands out memory to processes when they run, and takes it back when processes exit (or otherwise die). The second level of management is within each process, for example within the heap when you call malloc() and free(). Even if you fail to call free() (and thus leak memory in the heap), the operating system will reclaim all the memory of the process (including those pages for code, stack, and, as relevant here, heap) when the program is finished running. No matter what the state of your heap in your address space, the OS takes back all of those pages when the process dies, thus ensuring that no memory is lost despite the fact that you didn’t free it.
Thus, for short-lived programs, leaking memory often does not cause any operational problems (though it may be considered poor form). When you write a long-running server (such as a web server or database management system, which never exit), leaked memory is a much bigger issue, and will eventually lead to a crash when the application runs out of memory. And of course, leaking memory is an even larger issue inside one particular program: the operating system itself. Showing us once again: those who write the kernel code have the toughest job of all…

Calling free() Incorrectly

Summary

### 底层操作系统支持

Underlying OS Support

Other Calls

## 地址转换

HOW TO EFFICIENTLY AND FLEXIBLY VIRTUALIZE MEMORY
How can we build an efficient virtualization of memory? How do we provide the flexibility needed by applications? How do we maintain control over which memory locations an application can access, and thus ensure that application memory accesses are properly restricted? How do we do all of this efficiently?

Assumptions

### 一个栗子

An Example

• Fetch instruction at address 128
• Fetch instruction at address 132
• Execute this instruction (no memory reference)
• Fetch the instruction at address 135
• Execute this instruction (store to address 15 KB)

### 动态(基于硬件)重定位

Dynamic (Hardware-based) Relocation

• base register
• bounds register（有时称为limitregister）

ASIDE: SOFTWARE-BASED RELOCATION
In the early days, before hardware support arose, some systems performed a crude form of relocation purely via software methods. The basic technique is referred to as static relocation, in which a piece of software known as the loader takes an executable that is about to be run and rewrites its addresses to the desired offset in physical memory.
For example, if an instruction was a load from address 1000 into a register (e.g., movl 1000, %eax), and the address space of the program was loaded starting at address 3000 (and not 0, as the program thinks), the loader would rewrite the instruction to offset each address by 3000 (e.g., movl 4000, %eax). In this way, a simple static relocation of the process’s address space is achieved.
However, static relocation has numerous problems. First and most importantly, it does not provide protection, as processes can generate bad addresses and thus illegally access other process’s or even OS memory; in general, hardware support is likely needed for true protection [WL+93]. Another negative is that once placed, it is difficult to later relocate an address space to another location

TIP: HARDWARE-BASED DYNAMIC RELOCATION
With dynamic relocation, a little hardware goes a long way. Namely, a base register is used to transform virtual addresses (generated by the program) into physical addresses. A bounds (or limit) register ensures that such addresses are within the confines of the address space. Together they provide a simple and efficient virtualization of memory.

### 栗子

Example Translations

### 硬件支持：摘要

Hardware Support： A Summary

### 操作系统问题

Operating System Issues

Summary

## 分段

Segmentation

THE CRUX: HOW TO SUPPORT A LARGE ADDRESS SPACE
How do we support a large address space with (potentially) a lot of free space between the stack and the heap? Note that in our examples, with tiny (pretend) address spaces, the waste doesn’t seem too bad. Imagine, however, a 32-bit address space (4 GB in size); a typical program will only use megabytes of memory, but still would demand that the entire address space be resident in memory.

### Generalized Base/Bounds

ASIDE: THE SEGMENTATION FAULT
The term segmentation fault or violation arises from a memory access on a segmented machine to an illegal address. Humorously, the term persists, even on machines with no support for segmentation at all. Or not so humorously, if you can’t figure out why your code keeps faulting.

### 我们指的是哪个段

Which Segment Are We Referring To?

### 支持分享

Support for Sharing

### 细粒度与粗粒度分段

Fine-grained vs. Coarse-grained Segmentation

### 操作系统支持

OS Support

• 第一个是一个老问题：操作系统应该在上下文切换上做什么？您现在应该有一个很好的猜测：段寄存器必须保存(saved)和恢复(restored)。显然，每个进程都有自己的虚拟地址空间，操作系统必须确保正确设置这些寄存器，然后才能再次运行该进程。
• 第二个是当段增长时的操作系统交互(interaction)。例如，程序可以调用malloc()分配对象。在某些情况下，现有的堆将能够处理请求，因此malloc()将为对象找到可用空间，并将指向该对象的指针返回给调用者(caller)。但是，在其它情况下，heap segment本身可能需要增长。在这种情况下，内存分配库将执行系统调用以增长堆（如，传统的UNIX sbrk() 系统调用）。然后，操作系统将提供更多孔家，将segment size register更新为新的(bigger)大小，并通知库成功。然后，库可以为新对象分配空间，并成功返回到调用程序。请注意，如果没有更多的物理内存可用，或操作系统确定调用进程已经有太多内存，则操作系统可能会拒绝该请求。
• 最后，也许是最重要的问题是管理物理内存中的可用空间(free space)。创建新的地址空间时，操作系统必须能够在物理内存中为其segment找到空间。先前，我们假设每个地址空间的大小相同，因此可以将物理内存视为一堆插槽(slots)，其中可以容纳多个进程。现在，每个进程有多个段，每个段可能不同大小。

Summary

## 空闲空间管理

Free-Space Management

CRUX: HOW TO MANAGE FREE SPACE
How should free space be managed, when satisfying variable-sized requests? What strategies can be used to minimize fragmentation? Whatare the time and space overheads of alternate approaches?

Assumptions

### 低级机制

Low-level Mechanisms

Splitting and Coalescing

Tracking The Size Of Allocated Regions

Embedding A Free List

Growing The Heap

Basic Strategies

### 其余方法

Other Approaches

ASIDE: GREAT ENGINEERS ARE REALLY GREAT
Engineers like Jeff Bonwick (who not only wrote the slab allocator mentioned herein but also was the lead of an amazing file system, ZFS) are the heart of Silicon Valley. Behind almost any great product or technology is a human (or small group of humans) who are way above average in their talents, abilities, and dedication. As Mark Zuckerberg (of Facebook) says: “Someone who is exceptional in their role is not just a little better than someone who is pretty good. They are 100 times better.” This is why, still today, one or two people can start a company that changes the face of the world forever (think Google, Apple, or Facebook). Work hard and you might become such a “100x” person as well. Failing that, work with such a person; you’ll learn more in a day than most learn in a month. Failing that, feel sad.

Buddy Allocation

buddy allocation之所以有效的原因是，确定特定块的buddy很简单。考虑上面的可用空间中块的地址。如果您仔细考虑，就会发现每个buddy pair的地址仅仅相差一位(a single bit)。因此，您对binary buddy allocation方案的工作原理有基本了解。

*其它想法(Other Ideas)*

## 分页

Paging

HOW TO VIRTUALIZE MEMORY WITH PAGES
How can we virtualize memory with pages, so as to avoid the problems of segmentation? What are the basic techniques? How do we make those techniques work well, with minimal space and time overheads?

### 栗子和概述

A Simple Example And Overview

• virtual page 0 -> physical frame 3
• vp 1 -> pf 7
• vp 2 -> pf 5
• vp 3 -> pf 2

• virtual page number(VPN)
• page offset

### 分页表存储在哪里

Where Are Page Tables Stored?

20位VPN表示操作系统必须为每个进程管理2^20个转换，假设每个分页表条目(page table entry)需要4字节来保存物理转换以及其它任何有用的东西，我们将为每个分页表获得4MB的巨大内存。那是很大的，现在想象有100个进程正在运行：这意味这操作系统仅需要所有这些地址转换就需要400MB内存！即使在机器拥有GB内存的现代时代，将很大一部分内存用于翻译也似乎有些疯狂，不是吗？而且，我们甚至都不会考虑这样的分页表对于64位地址空间会有多大。那太可怕了，也许会把你吓跑。

ASIDE: DATA STRUCTURE — THE PAGE TABLE
One of the most important data structures in the memory management subsystem of a modern OS is the page table. In general, a page table stores virtual-to-physical address translations, thus letting the system know where each page of an address space actually resides in physical memory. Because each address space requires such translations, in general there is one page table per process in the system. The exact structure of the page table is either determined by the hardware (older systems) or can be more flexibly managed by the OS (modern systems).

### 分页表中实际上是什么？

What’s Actually In The Page Table?

• present bit(P)；
• user/supervisor bit(U/S)，用于确定用户模式进程是否可以访问该页面；
• 一些位(PWT, PCD, PAT, G)确确定这些页面的硬件缓存工作方式；
• accessed bit(A)；
• dirty bit(D)；
• page frame number(PFN)；

ASIDE: WHY NO VALID BIT?
You may notice that in the Intel example, there are no separate valid and present bits, but rather just a present bit (P). If that bit is set (P=1), it means the page is both present and valid. If not (P=0), it means that the page may not be present in memory (but is valid), or may not be valid. An access to a page with P=0 will trigger a trap to the OS; the OS must then use additional structures it keeps to determine whether the page is valid (and thus perhaps should be swapped back in) or not (and thus the program is attempting to access memory illegally). This sort of judiciousness is common in hardware, which often just provide the minimal set of features upon which the OS can build a full service.

### 分页: 也很慢

Paging: Also Too Slow

### 内存追踪

A Memory Trace

• vpn 39 -> pfn 7
• vpn 40 -> pfn 8
• vpn 41 -> pfn 9
• vpn 42 -> pfn 10

Summary

## 地址转换缓存

Paging: Faster Translations (TLBs)

Translation Lookaside Buffer

HOW TO SPEED UP ADDRESS TRANSLATION
How can we speed up address translation, and generally avoid the extra memory reference that paging seems to require? What hardware support is required? What OS involvement is needed?

### TLB基本算法

TLB Basic Algorithm

### 访问数组的栗子

Example: Accessing An Array

TIP: USE CACHING WHEN POSSIBLE
Caching is one of the most fundamental performance techniques in computer systems, one that is used again and again to make the “commoncase fast” [HP06]. The idea behind hardware caches is to take advantage of locality in instruction and data references. There are usually two types of locality: temporal locality and spatial locality. With temporal locality, the idea is that an instruction or data item that has been recently accessed will likely be re-accessed soon in the future. Think of loop variables or instructions in a loop; they are accessed repeatedly over time. With spatial locality, the idea is that if a program accesses memory at address x, it will likely soon access memory near x. Imagine here streaming through an array of some kind, accessing one element and then the next. Of course, these properties depend on the exact nature of the program, and thus are not hard-and-fast laws but more like rules of thumb. Hardware caches, whether for instructions, data, or address translations (as in our TLB) take advantage of locality by keeping copies of memory in small, fast on-chip memory. Instead of having to go to a (slow) memory to satisfy a request, the processor can first check if a nearby copy exists in a cache; if it does, the processor can access it quickly (i.e., in a few CPU cycles) and avoid spending the costly time it takes to access memory (many nanoseconds).
You might be wondering: if caches (like the TLB) are so great, why don’t we just make bigger caches and keep all of our data in them? Unfortunately, this is where we run into more fundamental laws like those of physics. If you want a fast cache, it has to be small, as issues like the speed-of-light and other physical constraints become relevant. Any large cache by definition is slow, and thus defeats the purpose. Thus, we are stuck with small, fast caches; the question that remains is how to best use them to improve performance.

### 谁处理TLB缺失

Who Handles The TLB Miss?

ASIDE: RISC VS. CISC
In the 1980’s, a great battle took place in the computer architecture community. On one side was the CISC camp, which stood for Complex Instruction Set Computing; on the other side was RISC, for Reduced Instruction Set Computing [PS81]. The RISC side was spear-headed by David Patterson at Berkeley and John Hennessy at Stanford (who are also co-authors of some famous books [HP06]), although later John Cocke was recognized with a Turing award for his earliest work on RISC [CM00]. CISC instruction sets tend to have a lot of instructions in them, and each instruction is relatively powerful. For example, you might see a string copy, which takes two pointers and a length and copies bytes from source to destination. The idea behind CISC was that instructions should be high-level primitives, to make the assembly language itself easier to use, and to make code more compact.
RISC instruction sets are exactly the opposite. A key observation behind RISC is that instruction sets are really compiler targets, and all compilers really want are a few simple primitives that they can use to generate high-performance code. Thus, RISC proponents argued, let’s rip out as much from the hardware as possible (especially the microcode), and make what’s left simple, uniform, and fast.
In the early days, RISC chips made a huge impact, as they were noticeably faster [BC91]; many papers were written; a few companies were formed (e.g., MIPS and Sun). However, as time progressed, CISC manufacturers such as Intel incorporated many RISC techniques into the core of their processors, for example by adding early pipeline stages that transformed complex instructions into micro-instructions which could then be processed in a RISC-like manner. These innovations, plus a growing number of transistors on each chip, allowed CISC to remain competitive. The end result is that the debate died down, and today both types of processors can be made to run fast.

ASIDE: TLB VALID BIT 6= PAGE TABLE VALID BIT
A common mistake is to confuse the valid bits found in a TLB with those found in a page table. In a page table, when a page-table entry (PTE) is marked invalid, it means that the page has not been allocated by the process, and should not be accessed by a correctly-working program. The usual response when an invalid page is accessed is to trap to the OS, which will respond by killing the process.
A TLB valid bit, in contrast, simply refers to whether a TLB entry has a valid translation within it. When a system boots, for example, a common initial state for each TLB entry is to be set to invalid, because no address translations are yet cached there. Once virtual memory is enabled, and once programs start running and accessing their virtual address spaces, the TLB is slowly populated, and thus valid entries soon fill the TLB.
The TLB valid bit is quite useful when performing a context switch too, as we’ll discuss further below. By setting all TLB entries to invalid, the system can ensure that the about-to-be-run process does not accidentally use a virtual-to-physical translation from a previous process.>

### TLB内容是什么

TLB Contents: What’s In There?

### TLB问题：上下文切换

TLB Issue: Context Switches

HOW TO MANAGE TLB CONTENTS ON A CONTEXT SWITCH
When context-switching between processes, the translations in the TLB for the last process are not meaningful to the about-to-be-run process. What should the hardware or OS do in order to solve this problem?

### 问题：替换策略

Issue: Replacement Policy

HOW TO DESIGN TLB REPLACEMENT POLICY
Which TLB entry should be replaced when we add a new TLB entry? The goal, of course, being to minimize the miss rate (or increase hit rate) and thus improve performance.

### 真实的TLB条目

A Real TLB Entry

MIPS R4000支持具有4KB页面的32位地址空间。因此，我们期望在典型的虚拟地址中有20位的VPN和12位的偏移量。但是，正如您在TLB中所看大的，VPN只有19位。事实证明，用户地址将仅来自地址空间的一般（其余部分保留给内核），因此仅需19位VPN。VPN最多可转换为24位物理帧(pfn)，因此可以支持具有64GB(物理)主内存(2^24 4KB pages)。

MIPS TLB中还有一些其它有趣的地方。我们看到一个全局为(global bit, G)，用于进程之间全局共享的页面。因此，如果设置了全局位，则将忽略ASID。我们也看到了8位ASID，操作系统可以使用它来区分地址空间。您遇到一个问题：如果一次运行的进程超过256(2^8)个，操作系统应怎么做？最后，我们看到3个 Coherence(C) bits，这些位确定硬件如何缓存页面。当页面被写入时被标记的脏位；一个有效位，该值高速硬件条目中是否存在有效的转换。还有一个页面掩码字段，它支持多种页面大小。我们将在后面看到为什么较大的页面可能会有用。最后，这64位中的一些未使用。

MIPS TLB通常包含32或64个条目，其中大多数在运行时由用户进程使用。但是，为操作系统保留了一些。可以由操作系统设置wired register，以告知硬件为操作系统保留多少个TLB插槽。操作系统会使用这些保留的映射来存储要在关键时间访问的代码和数据，而这在TLB丢失中会造成问题。

• TLBP，它探测TLB以查看其中是否存在特定转换；
• TLBR，它将TLB条目的内容读入寄存器；
• TLBWI，用于替换特定的TLB条目；
• TLBWR，它将替换随机的TLB条目。

Summary

## 高级的分页表

Paging: Smaller Tables

HOW TO MAKE PAGE TABLES SMALLER?
Simple array-based page tables (usually called linear page tables) are too big, taking up far too much memory on typical systems. How can we make page tables smaller? What are the key ideas? What inefficiencies arise as a result of these new data structures?

### 简单的解决方案：更大的页面

Simple Solution: Bigger Pages

ASIDE: MULTIPLE PAGE SIZES
As an aside, do note that many architectures (e.g., MIPS, SPARC, x86-64) now support multiple page sizes. Usually, a small (4KB or 8KB) page size is used. However, if a “smart” application requests it, a single large page (e.g., of size 4MB) can be used for a specific portion of the address space, enabling such applications to place a frequently-used (and large) data structure in such a space while consuming only a single TLB entry. This type of large page usage is common in database management systems and other high-end commercial applications. The main reason for multiple page sizes is not to save page table space, however; it is to reduce pressure on the TLB, enabling a program to access more of its address space without suffering from too many TLB misses. However, as researchers have shown [N+02], using multiple page sizes makes the OS virtual memory manager notably more complex, and thus large pages are sometimes most easily used simply by exporting a new interface to applications to request large pages directly

### 混合方法：分页和段

Hybrid Approach: Paging and Segments

TIP: USE HYBRIDS
When you have two good and seemingly opposing ideas, you should always see if you can combine them into a hybrid that manages to achieve the best of both worlds. Hybrid corn species, for example, are known to be more robust than any naturally-occurring species. Of course, not all hybrids are a good idea; see the Zeedonk (or Zonkey), which is a cross of a Zebra and a Donkey. If you don’t believe such a creature exists, look it up, and prepare to be amazed.

### 多级分页表

Multi-level Page Tables

When building a data structure, one should always consider time-space trade-offs in its construction. Usually, if you wish to make access to a particular data structure faster, you will have to pay a space-usage penalty for the structure.

A Detailed Multi-Level Example

More Than Two Levels

The Translation Process: Remember the TLB

### 反向分页表

Inverted Page Tables

### 将分页表交换到磁盘

Swapping the Page Tables to Disk

Summary

## 交换机制

Beyond Physical Memory(Swap) Mechanisms: http://pages.cs.wisc.edu/~remzi/OSTEP/vm-beyondphys.pdf

THE CRUX: HOW TO GO BEYOND PHYSICAL MEMORY
How can the OS make use of a larger, slower device to transparently provide the illusion of a large virtual address space?

ASIDE: STORAGE TECHNOLOGIES
We’ll delve much more deeply into how I/O devices actually work later (see the chapter on I/O devices). So be patient! And of course the slower device need not be a hard disk, but could be something more modern such as a Flash-based SSD. We’ll talk about those things too. For now, just assume we have a big and relatively-slow device which we can use to help us build the illusion of a very large virtual memory, even bigger than physical memory itself.

Swap Space

### 当前位

The Present Bit

ASIDE: SWAPPING TERMINOLOGY AND OTHER THINGS
Terminology in virtual memory systems can be a little confusing and variable across machines and operating systems. For example, a page fault more generally could refer to any reference to a page table that generates a fault of some kind: this could include the type of fault we are discussing here, i.e., a page-not-present fault, but sometimes can referto illegal memory accesses. Indeed, it is odd that we call what is definitely a legal access (to a page mapped into the virtual address space of a process, but simply not in physical memory at the time) a “fault” at all; really, it should be called a page miss. But often, when people say a program is “page faulting”, they mean that it is accessing parts of its virtual address space that the OS has swapped out to disk.
We suspect the reason that this behavior became known as a “fault” relates to the machinery in the operating system to handle it. When something unusual happens, i.e., when something the hardware doesn’t know how to handle occurs, the hardware simply transfers control to the OS, hoping it can make things better. In this case, a page that a process wants to access is missing from memory; the hardware does the only thing it can, which is raise an exception, and the OS takes over from there. As this is identical to what happens when a process does something illegal, it is perhaps not surprising that we term the activity a “fault.”

The Page Fault

### 如果内存满了怎么办

What If Memory Is Full?

### 页面错误控制流

Page Fault Control Flow

• 首先，该页面既存在又有效(line 18-21)。在这种情况下，TLB未命中处理程序可以简单地从PTE中获取PFN，重试指令（这一次导致TLB命中），从而按照前面所述继续操作。
• 在第二种情况下(line 22-23)，必须运行页面错误处理程序。尽管这是供进程访问的合放页面（毕竟它是有效的），但它不存在于物理内存中。
• 第三，访问可能是无效页面。例如由于程序中的错误(line 13-14)。在这种情况下，PTE中的其它位都不重要。硬件会批捕此无效访问，并且操作系统给陷阱处理程序将运行，可能会终止有问题的进程。

### 真正发生替换时

When Replacements Really Occur

TIP: DO WORK IN THE BACKGROUND
When you have some work to do, it is often a good idea to do it in the background to increase efficiency and to allow for grouping of operations. Operating systems often do work in the background; for example, many systems buffer file writes in memory before actually writing the data to disk. Doing so has many possible benefits: increased disk efficiency, as the disk may now receive many writes at once and thus better be able to schedule them; improved latency of writes, as the application thinks the writes completed quite quickly; the possibility of work reduction, as the writes may need never to go to disk (i.e., if the file is deleted); and better use of idle time, as the background work may possibly be done when the system is otherwise idle, thus better utilizing the hardware.

Summary

## 交换策略

Beyond Physical Memory(Swap) Policies: http://pages.cs.wisc.edu/~remzi/OSTEP/vm-beyondphys-policy.pdf

THE CRUX: HOW TO DECIDE WHICH PAGE TO EVICT
How can the OS decide which page (or pages) to evict from memory? This decision is made by the replacement policy of the system, which usually follows some general principles (discussed below) but also includes certain tweaks to avoid corner-case behaviors.

### 缓存管理

Cache Management

• $T_{M}$: 表示访问内存的成本
• $T_{D}$: 表示访问磁盘的成本
• $P_{miss}$: 表示在高速缓存中找不到数据的可能性(未命中)，在0.0-1.0之间变化，有时我们使用百分比来表示

### 最优替换策略

The Optimal Replacement Policy

TIP: COMPARING AGAINST OPTIMAL IS USEFUL
Although optimal is not very practical as a real policy, it is incredibly useful as a comparison point in simulation or other studies. Saying that
your fancy new algorithm has a 80% hit rate isn’t meaningful in isolation; saying that optimal achieves an 82% hit rate (and thus your new approach is quite close to optimal) makes the result more meaningful and gives it context. Thus, in any study you perform, knowing what the optimal is lets you perform a better comparison, showing how much improvement is still possible, and also when you can stop making your policy better, because it is close enough to the ideal.

ASIDE: TYPES OF CACHE MISSES
In the computer architecture world, architects sometimes find it useful to characterize misses by type, into one of three categories: compulsory, capacity, and conflict misses, sometimes called the Three C’s [H87]. A compulsory miss (or cold-start miss [EF78]) occurs because the cache is
empty to begin with and this is the first eference to the item; in contrast, a capacity miss occurs because the cache ran out of space and had to evict an item to bring a new item into the cache. The third type of miss (a conflict miss) arises in hardware because of limits on where an item can be placed in a hardware cache, due to something known as setassociativity; it does not arise in the OS page cache because such caches are always fully-associative, i.e., there are no restrictions on where in memory a page can be placed. See HP for details [HP06].

### 先进先出策略

A Simple Policy: FIFO

Belady (of the optimal policy) and colleagues found an interesting reference stream that behaved a little unexpectedly [BNS69]. The memoryreference stream: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5. The replacement policy they were studying was FIFO. The interesting part: how the cache hit rate changed when moving from a cache size of 3 to 4 pages.
In general, you would expect the cache hit rate to increase (get better) when the cache gets larger. But in this case, with FIFO, it gets worse! Calculate the hits and misses yourself and see. This odd behavior is generally referred to as Belady’s Anomaly (to the chagrin of his co-authors).
Some other policies, such as LRU, don’t suffer from this problem. Can you guess why? As it turns out, LRU has what is known as a stack property [M+70]. For algorithms with this property, a cache of size N + 1 naturally includes the contents of a cache of size N. Thus, when increasing the cache size, hit rate will either stay the same or improve. FIFO and Random (among others) clearly do not obey the stack property, and thus are susceptible to anomalous behavior.

### 其它简单策略：随机

Another Simple Policy: Random

### 使用历史记录

Using History: LRU(Least Rcently Used)

ASIDE: TYPES OF LOCALITY
There are two types of locality that programs tend to exhibit. The first is known as spatial locality, which states that if a page P is accessed, it is likely the pages around it (say P − 1 or P + 1) will also likely be accessed. The second is temporal locality, which states that pages that have been accessed in the near past are likely to be accessed again in the near future. The assumption of the presence of these types of locality plays a large role in the caching hierarchies of hardware systems, which deploy many levels of instruction, data, and address-translation caching to help programs run fast when such locality exists.
Of course, the principle of locality, as it is often called, is no hard-andfast rule that all programs must obey. Indeed, some programs access memory (or disk) in rather random fashion and don’t exhibit much or any locality in their access streams. Thus, while locality is a good thing to keep in mind while designing caches of any kind (hardware or software), it does not guarantee success. Rather, it is a heuristic that often proves useful in the design of computer systems.

### 实施历史算法

Implementing Historical Algorithms

CRUX: HOW TO IMPLEMENT AN LRU REPLACEMENT POLICY
Given that it will be expensive to implement perfect LRU, can we approximate it in some way, and still obtain the desired behavior?

### 近似地LRU

Approximating LRU

### 脏页

Considering Dirty Pages

### 其它虚拟内存策略

Other VM Policies

Thrashing

Summary

## 完整的虚拟内存系统

Complete Virtual Memory Systems: http://pages.cs.wisc.edu/~remzi/OSTEP/vm-complete.pdf

THE CRUX: HOW TO BUILD A COMPLETE VM SYSTEM
What features are needed to realize a complete virtual memory system? How do they improve performance, increase security, or otherwise improve the system?

### VAX/VMS虚拟内存

VAX/VMS Virtual Memory

VAX-11微型计算机体系结构是由Digital Equipment Corporation(DEC)在1970年代末期引入的。在小型计算机时代，DEC在计算机行业扮演着重要的角色。该体系结构通过多种方式实现，包括VAX-11/780和较弱的VAX-11/750。

VAX-11为每个进程提供了32位虚拟地址空间，分为512-byte pages。因此，虚拟地址由23位VPN和9位偏移量组成。此外，VPN的高两位用来区分页面驻留在哪个段中。因此，如前所述，该系统是分页(paging)和分段(segmentation)的混合体。

ASIDE: THE CURSE OF GENERALITY
Operating systems often have a problem known as the curse of generality, where they are tasked with general support for a broad class of applications and systems. The fundamental result of the curse is that the OS is not likely to support any one installation very well. In the case of VMS, the curse was very real, as the VAX-11 architecture was realized in a number of different implementations. It is no less real today, where Linux is expected to run well on your phone, a TV set-top box, a laptop computer, desktop computer, and a high-end server running thousands of processes in a cloud-based datacenter.

ASIDE: WHY NULL POINTER ACCESSES CAUSE SEG FAULTS
You should now have a good understanding of exactly what happens on a null-pointer dereference. A process generates a virtual address of 0, by doing something like this。
The hardware tries to look up the VPN (also 0 here) in the TLB, and suffers a TLB miss. The page table is consulted, and the entry for VPN 0 is found to be marked invalid. Thus, we have an invalid access, which transfers control to the OS, which likely terminates the process (on UNIX systems, processes are sent a signal which allows them to react to such a fault; if uncaught, however, the process is killed).

VAN中的分页表条目(PTE)包含以下位：有效位(a valid bit), 保护字段(protection bit 4 bits), 修改(脏)位(modify/dirty bit), 保留给操作系统使用的字段(5 bits), 将页面的位置存储在物理内存中的物理帧号(PFN)。机敏的读者可能会注意到：没有引用位(reference bit)！因此，VMS替换算法必须在没有硬件支持的情况下确定哪些页面处于活动状态。

ASIDE: EMULATING REFERENCE BITS
As it turns out, you don’t need a hardware reference bit in order to get some notion of which pages are in use in a system. In fact, in the early 1980’s, Babaoglu and Joy showed that protection bits on the VAX can be used to emulate reference bits [BJ81]. The basic idea: if you want to gain some understanding of which pages are actively being used in a system, mark all of the pages in the page table as inaccessible (but keep around the information as to which pages are really accessible by the process, perhaps in the “reserved OS field” portion of the page table entry). When a process accesses a page, it will generate a trap into the OS; the OS will then check if the page really should be accessible, and if so, revert the page to its normal protections (e.g., read-only, or read-write). At the time of a replacement, the OS can check which pages remain marked inaccessible, and thus get an idea of which pages have not been recently used.
The key to this “emulation” of reference bits is reducing overhead while still obtaining a good idea of page usage. The OS must not be too aggressive in marking pages inaccessible, or overhead would be too high. The OS also must not be too passive in such marking, or all pages will end up referenced; the OS will again have no good idea which page to evict.

VMS中使用的另一种优化还有助于克服VMS中较小的页面使用。特别是，使用如此小的页面(small pages)，交换过程中的磁盘I/O效率可能非常低，因为磁盘在进行大传输时表现更好。为了使swapping I/O更加有效，VMS添加了许多优化，但最重要的是集群(clustering)。通过集群，VMS将全局脏列表中的大批页面组合在一起，并一口气将它们写入磁盘（从而使它们干净）。在大多数现代系统中都使用集群，因为自由地将页面放置在交换空间的任何位置都可以使操作系统组页面(group pages)执行更少和更大的写入，从而提高性能。

VMS还有另外两个现在的标准技巧：demand zero, copy on write。现在我们来描述这些惰性优化。VMS中的一种惰性形式是页面需求归零(demand zeroing of pages)。为了更好地理解这一点，让我们考虑将页面添加到地址空间的示例，如，在堆中。在幼稚的实现中，操作系统通过在物理内存中查找页面并将其归零来响应将页面添加到堆中的请求，然后将其映射到您的地址空间。但是，幼稚的实现可能会付出高昂的代价，特别是如果该页面没有被进程使用的话。

TIP: BE LAZY
Being lazy can be a virtue in both life as well as in operating systems. Laziness can put off work until later, which is beneficial within an OS for a number of reasons. First, putting off work might reduce the latency of the current operation, thus improving responsiveness; for example, operating systems often report that writes to a file succeeded immediately, and only write them to disk later in the background. Second, and more importantly, laziness sometimes obviates the need to do the work at all; for example, delaying a write until the file is deleted removes the need to do the write at all. Laziness is also good in life: for example, by putting off your OS project, you may find that the project specification bugs are worked out by your fellow classmates; however, the class project is unlikely to get canceled, so being too lazy may be problematic, leading to a late project, bad grade, and a sad professor. Don’t make professors sad!

### Linux虚拟内存系统

The Linux Virtual Memory System

• 用户部分(user portion)：user program code, stack, heap, other parts reside
• 内核部分(kernel prtion)：kernel code, stack, heap, other parts reside

• 用户虚拟地址范围：0-0xBFFFFFFF
• 内核虚拟地址范围：0xC0000000-0xFFFFFFFF

Linux的一个有趣的方面是它包含两种类型的内核虚拟地址。

Intel x86允许使用多种页面大小，而不仅仅是标准的4KB页面。具体而言，最近的设计在硬件上支持2MB甚至1GB的页面。因此，随着时间的流逝，Linux逐渐发展为允许应用程序利用这些巨大的页面(huge page)。

Linux对大页面的支持的一个有趣方面是它如何逐步完成的。刚开始，Linux开发人员直到这种支持仅对少数应用程序很重要，例如对性能有严格要求的大型数据库。因此，决定允许应用程序显式请求大页面的内存分配（通过mmap()shmget()调用）。这样，大多数应用程序将不会受到影响。

Linux页面缓存(page cache)是统一的，可以从三个主要来源将页面保留在内存中。这些实体保存在页面缓存哈希表中(page cache hash table)，以便在需要所述数据时进行快速查找。

• 内存映射文件(memory mapped files)
• 来自设备的文件数据(file data)和元数据(metadata)：通常通过将read()write()调用定向到文件系统来访问
• 堆和栈(heap and stack)组成每个进程的页面：有时称为匿名内存(anonmous memory)，因为其下没有命名文件，而是交换空间

Linux版本的2Q替换算法通过保留两个列表并在两个列表之间划分内存来解决此问题。首次访问时，页面被放在一个队列中（Linux中称为 inactive list）。重新引用该页面后，该页面将被提升到另一个队列（Linux中为active list）。当需要进行替换时，替换候选者将从不活跃列表中获取。Linux还定期将页面从活跃列表的底部移至不活跃列表，使活跃列表保持在页面高速缓存总大小的三分之二处。

Linux最好以完美的LRU顺序管理这些列表，但是，正如前面各章所讨论的那样，这样做非常昂贵。因此，与许多操作系统一样，使用LRU的近似值。

ASIDE: THE UBIQUITY OF MEMORY-MAPPING
Memory mapping predates Linux by some years, and is used in many places within Linux and other modern systems. The idea is simple: by calling mmap() on an already opened file descriptor, a process is returned a pointer to the beginning of a region of virtual memory where the contents of the file seem to be located. By then using that pointer, a process can access any part of the file with a simple pointer dereference.
Accesses to parts of a memory-mapped file that have not yet been brought into memory trigger page faults, at which point the OS will page in the relevant data and make it accessible by updating the page table of the process accordingly (i.e., demand paging).
Every regular Linux process uses memory-mapped files, even the code in main() does not call mmap() directly, because of how Linux loads code from the executable and shared library code into memory. Below is the (highly abbreviated) output of the pmap command line tool, which shows what different mapping comprise the virtual address space of a running program (the shell, in this example, tcsh). The output shows four columns: the virtual address of the mapping, its size, the protection bits of the region, and the source of the mapping:
0000000000400000 372K r-x— tcsh
00000000019d5000 1780K rw—- [anon ]
00007f4e7cf06000 1792K r-x— libc-2.23.so
00007f4e7d2d0000 36K r-x— libcrypt-2.23.so
00007f4e7d508000 148K r-x— libtinfo.so.5.9
00007f4e7d731000 152K r-x— ld-2.23.so
00007f4e7d932000 16K rw—- [stack ]
As you can see from this output, the code from the tcsh binary, as well as code from libc, libcrypt, libtinfo, and code from the dynamic linker itself (ld.so) are all mapped into the address space. Also present are two anonymous regions, the heap (the second entry, labeled anon) and the stack (labeled stack). Memory-mapped files provide a straightforward and efficient way for the OS to construct a modern address space.

ASLR是对用户级程序的一种有用防御措施，它还被并入内核，这是一种没有想象的功能，称为内核地址空间布局随机化(kernel address space layout randomization, KASLR)。但是，事实证明，内核可能还有更大的问题要处理。

Other Security Problems: Meltdown And Spectre

# 并发

Concurrency

## 并发和线程

### 为什么变得更糟：共享数据

Why It Gets Worse: Shared Data

TIP: KNOW AND USE YOUR TOOLS
You should always learn new tools that help you write, debug, and understand computer systems. Here, we use a neat tool called a disassembler. When you run a disassembler on an executable, it shows you what assembly instructions make up the program. For example, if we wish to understand the low-level code to update a counter (as in our example), we run objdump (Linux) to see the assembly code: objdump -d main.
Doing so produces a long listing of all the instructions in the program, neatly labeled (particularly if you compiled with the -g flag), which includes symbol information in the program. The objdump program is just one of many tools you should learn how to use; a debugger like gdb, memory profilers like valgrind or purify, and of course the compiler itself are others that you should spend time to learn more about; the better you are at using your tools, the better systems you’ll be able to build.

### 问题的核心：不受控制的调度

The Heart Of The Problem: Uncontrolled Scheduling

TIP: USE ATOMIC OPERATIONS
Atomic operations are one of the most powerful underlying techniques in building computer systems, from the computer architecture, to concurrent code (what we are studying here), to file systems (which we’ll study soon enough), database management systems, and even distributed systems [L+93].
The idea behind making a series of actions atomic is simply expressed with the phrase “all or nothing”; it should either appear as if all of the actions you wish to group together occurred, or that none of them occurred, with no in-between state visible. Sometimes, the grouping of many actions into a single atomic action is called a transaction, an idea developed in great detail in the world of databases and transaction processing [GR92].
In our theme of exploring concurrency, we’ll be using synchronization primitives to turn short sequences of instructions into atomic blocks of execution, but the idea of atomicity is much bigger than that, as we will see. For example, file systems use techniques such as journaling or copyon-write in order to atomically transition their on-disk state, critical for operating correctly in the face of system failures. If that doesn’t make sense, don’t worry — it will, in some future chapter

### 原子性

The Wish For Atomicity

THE CRUX: HOW TO SUPPORT SYNCHRONIZATION
What support do we need from the hardware in order to build useful synchronization primitives? What support do we need from the OS? How can we build these primitives correctly and efficiently? How can programs use them to get the desired results?

### 等待另一个

One More Problem: Waiting For Another

ASIDE: KEY CONCURRENCY TERMS CRITICAL SECTION, RACE CONDITION, INDETERMINATE, MUTUAL EXCLUSION
These four terms are so central to concurrent code that we thought it worth while to call them out explicitly. See some of Dijkstra’s early work [D65,D68] for more details.
A critical section is a piece of code that accesses a shared resource, usually a variable or data structure.
A race condition (or data race [NM92]) arises if multiple threads of execution enter the critical section at roughly the same time; both attempt to update the shared data structure, leading to a surprising (and perhaps undesirable) outcome.
An indeterminate program consists of one or more race conditions; the output of the program varies from run to run, depending on which threads ran when. The outcome is thus not deterministic, something we usually expect from computer systems.
To avoid these problems, threads should use some kind of mutual exclusion primitives; doing so guarantees that only a single thread ever enters a critical section, thus avoiding races, and resulting in deterministic program outputs.

### 总结

Summary: Why in OS Class?

CRUX: HOW TO CREATE AND CONTROL THREADS
What interfaces should the OS present for thread creation and control? How should these interfaces be designed to enable ease of use as well as utility?

### 线程创建

• thread：指向pthread_t类型结构的指针。我们将使用此结构与线程进行交互，因此我们需要将其传递个pthread_create()进行初始化。
• attr：指定此线程可能具有的任何属性。一些示例包括设置栈大小，或者可能设置有关线程的调度优先级的信息。通过对pthread_attr_init()的单独调用来初始化属性。但是，在大多数情况下，默认设置会很好。在这种情况下，我们将简单地传入NULL值。
• start routine：第三个参数是最复杂的，但实际上只是在问：该线程应该在哪个函数开始运行？在C语言中，我们将其称为函数指针(function pointer)，它告诉我们以下内容：函数名称(start routine)，该函数名称传递了单个类型为void *(start routine)，并且它返回void *类型的值（即void pointer）。
• arg：传递给线程开始执行的函数的参数。您可能会问：为什么我们需要这些空指针？好吧，答案很简单——将void pointer用作函数start routine的参数可以使我们传入任何类型的参数。将其作为返回值允许线程返回任何类型的结果。

### 线程完成

• pthread_t：用于指定要等待的线程。该变量由线程创建例程初始化（当您将指向它的指针作为参数传递给pthread_create()）。如果保留它，则可以使用它来等待该线程终止。
• 第二个参数是指向您希望返回的返回值的指针。由于该例程可以返回任何内容，因此将其定义为返回指向void的指针。因为pthread_join()例程会更改传入参数的值，所以您需要传递指向该值的指针，而不仅仅是传递值本身。

## 锁

### 锁的基本思想

Locks: The Basic Idea

lock()unlock()例程的语义很简单。调用例程lock()尝试获取锁，如果没有其它线程持有该锁（即，它是空闲的），则该线程将将获取该锁并进入关键部分。有时将此线程称为锁的所有者(owner)。如果另一个线程然后对同一个锁(示例中的mutex)调用lock()，则当另一个线程持有该锁时它将不会返回。这样，可以防止其它线程进入关键部分，而持有锁的第一个线程在那里。

POSIX library用于锁的名称是一个互斥锁(mutex)，因为它用于提供线程之间的互斥(mutex exclusion)。即，如果一个线程在关键部分中，它会在该部分完成之前阻止其它进入。因此，当您看到以下POSIX线程代码时，您应该了解它正在执行与上述相同的操作：

Building A Lock

Evaluating Locks

### 控制中断

Controlling Interrupts

A Failed Attempt: Just Using Loads/Stores

Trace: No Mutual Exclusion

call lock()
while(flag == 1)
- call lock()
while(flag == 1)
flag =1;
flag = 1; //set flag to 1 (too!)

### 通过test and set建立工作自旋锁

Building Working Spin Locks with Test-And-Set

Peterson后来完善了Dekker的方法。再一次，仅使用loads and stores，其思想是确保两个线程永远不会同时进入关键部分。下面是Peterson的算法（针对两个线程），看看您是否能够理解代码。

test and set指令的作用如下。它返回old_ptr指向的旧值，并同时将该值更新为new。当然，关键是此操作是原子执行的。之所以称之为test and set，是因为它使您可以测试旧值，同时将内存位置设置为新值。事实证明，此功能稍强的指令足以构建一个简单的自旋锁(spin-lock)，如下面代码所示。

TIP: THINK ABOUT CONCURRENCY AS A MALICIOUS SCHEDULER
From this example, you might get a sense of the approach you need to take to understand concurrent execution. What you should try to do is to pretend you are a malicious scheduler, one that interrupts threads at the most inopportune of times in order to foil their feeble attempts at building synchronization primitives. What a mean scheduler you are! Although the exact sequence of interrupts may be improbable, it is possible, and that is all we need to demonstrate that a particular approach does not work. It can be useful to think maliciously! (at least, sometimes)

### 评估自旋锁

Evaluating Spin Locks

### Compare-And-Swap

compare and swap的基本思想是测试ptr地址处的值是否等于expected。如果是，请使用新值更新ptr指向的内存地址。如果不是，什么也不做。无论哪一种情况，都应在该内存位置返回原始值，从而使调用compare-adn-swap的代码知道其是否成功。

load-linked的操纵非常类似于典型的load指令，并仅从内存中获取一个值并将其放置在寄存器中。关键的区别在于store-conditional，只有在没有发生中间存储到地址的情况下，该条件才会成功（并更新存储在刚刚load-linked的地址上的值）。如果成功，则存储条件返回1并将ptr处的值更新为value，如果失败，则不会更新ptr的值，并返回0。

lock()代码是唯一有趣的部分。首先，线程自旋以等待将标志设置为0（从而指示为持有该锁）。一旦这样，线程尝试通过存储条件获取锁。如果成功，则该线程自动将标志的值更改为1，因此可以进入关键部分。

TIP: LESS CODE IS BETTER CODE (LAUER’S LAW)
Programmers tend to brag about how much code they wrote to do something. Doing so is fundamentally broken. What one should brag about, rather, is how little code one wrote to accomplish a given task. Short, concise code is always preferred; it is likely easier to understand and has fewer bugs. As Hugh Lauer said, when discussing the construction of the Pilot operating system: “If the same people had twice as much time, they could produce as good of a system in half the code.” [L81] We’ll call this Lauer’s Law, and it is well worth remembering. So next time you’re bragging about how much code you wrote to finish the assignment, think again, or better yet, go back, rewrite, and make the code as clear and concise as possible.

### 自旋太多：现在怎么办

Too Much Spinning: What Now?

THE CRUX: HOW TO AVOID SPINNING
How can we develop a lock that doesn’t needlessly waste time spinning on the CPU?

### Just Yield

A Simple Approach: Just Yield, Baby

### Using Queues

Using Queues: Sleeping Instead Of Spinning

ASIDE: MORE REASON TO AVOID SPINNING: PRIORITY INVERSION
One good reason to avoid spin locks is performance: as described in the main text, if a thread is interrupted while holding a lock, other threads that use spin locks will spend a large amount of CPU time just waiting for the lock to become available. However, it turns out there is another interesting reason to avoid spin locks on some systems: correctness. The problem to be wary of is known as priority inversion, which unfortunately is an intergalactic scourge, occurring on Earth [M15] and Mars [R97]!
Let’s assume there are two threads in a system. Thread 2 (T2) has a high scheduling priority, and Thread 1 (T1) has lower priority. In this example, let’s assume that the CPU scheduler will always run T2 over T1, if indeed both are runnable; T1 only runs when T2 is not able to do so (e.g., when T2 is blocked on I/O).
Now, the problem. Assume T2 is blocked for some reason. So T1 runs, grabs a spin lock, and enters a critical section. T2 now becomes unblocked (perhaps because an I/O completed), and the CPU scheduler immediately schedules it (thus descheduling T1). T2 now tries to acquire the lock, and because it can’t (T1 holds the lock), it just keeps spinning. Because the lock is a spin lock, T2 spins forever, and the system is hung.
Just avoiding the use of spin locks, unfortunately, does not avoid the problem of inversion (alas). Imagine three threads, T1, T2, and T3, with T3 at the highest priority, and T1 the lowest. Imagine now that T1 grabs a lock. T3 then starts, and because it is higher priority than T1, runs immediately (preempting T1). T3 tries to acquire the lock that T1 holds, but gets stuck waiting, because T1 still holds it. If T2 starts to run, it will have higher priority than T1, and thus it will run. T3, which is higher priority than T2, is stuck waiting for T1, which may never run now that T2 is running. Isn’t it sad that the mighty T3 can’t run, while lowly T2 controls the CPU? Having high priority just ain’t what it used to be.
You can address the priority inversion problem in a number of ways. In the specific case where spin locks cause the problem, you can avoid using spin locks (described more below). More generally, a higher-priority thread waiting for a lower-priority thread can temporarily boost the lower thread’s priority, thus enabling it to run and overcoming the inversion, a technique known as priority inheritance. A last solution is simplest: ensure all threads have the same priority

### 不同的操作系统，不同的支持

Different OS, Different Support

nptl库(gnu libc库的一部分)中lowlevellock.h中的代码很有趣，原因有几个。首先，它使用单个整数来跟踪是否持有锁和锁上的等待数。因此，如果锁为负，则将其保留。其次，代码片段展示了如何针对常见情况进行优化，特别是在没有争用锁的情况下；仅使用一个线程来获取和释放锁，就完成了很少的工作。

## 基于锁的并发数据结构

CRUX: HOW TO ADD LOCKS TO DATA STRUCTURES
When given a particular data structure, how should we add locks to it, in order to make it work correctly? Further, how do we add locks such that the data structure yields high performance, enabling many threads to access the structure at once, i.e., concurrently?

### 并发计数器

Concurrent Counters

TIP: MORE CONCURRENCY ISN’T NECESSARILY FASTER
If the scheme you design adds a lot of overhead (for example, by acquiring and releasing locks frequently, instead of once), the fact that it is more concurrent may not be important. Simple schemes tend to work well, especially if they use costly routines rarely. Adding more locks and complexity can be your downfall. All of that said, there is one way to really know: build both alternatives (simple but less concurrent, and complex but more concurrent) and measure how they do. In the end, you can’t cheat on performance; your idea is either faster, or it isn’t.