CLRS Hash

2026-03-21

study-notes CS CLRS

以下内容参考过 Solutions to Introduction to Algorithms Third Edition

散列表是支持动态集合操作的数据结构，支持以下操作：

INSERT，将一个新的元素放入集合中。
DELETE，从集合中移除一个特定的元素。
SEARCH，在集合中查找一个特定的元素。

以下的定义和 CLRS 的定义略不同，CLRS 中的散列表就是动态集合，键是元素，值是卫星数据。

或者说，散列表是一种通过散列函数（Hash Function）将特定的键（Key）映射到表中的一个位置来访问记录的数据结构。

键值对是由一个键和一个值组成的数据项，其中键是唯一的标识符，而值是与该键相关联的数据。

Direct-address tables

全域 $U = \{0, 1, ..., m-1\}$ 指的是键的所有取值集合。

直接寻址法是一种简单的散列表实现方法，适用于全域范围较小且连续的情况。它使用一个数组来存储键值对，其中数组的索引直接对应于键的值。

class DirectAddressTable{
    int table[m]

    public:

    void insert(int key, int value):
        table[key] = value
    
    void delete(int key):
        table[key] = null

    int search(int key):
        return table[key]
}

显然三个操作时间复杂度为 O(1)。然而，直接寻址法的空间复杂度为 O(M)。如果键范围很大但实际使用的键较少，这种方法会浪费大量的空间。

11.1-1
Suppose that a dynamic set S isrepresented byadirect-address table T of length m.
Describe a procedure that finds the maximum element of S. What is the worst-case
performance of your procedure?

要找到动态集合 S 中的最大元素，可以直接从后向前遍历表 T，第一个非空元素就是最大的元素，索引就是顺序

11.1-2
A bit vector is simply an array of bits (0s and 1s). A bit vector of length m takes
much less space than an array of m pointers. Describe how to use a bit vector
to represent a dynamic set of distinct elements with no satellite data. Dictionary
operations should run in $O(1)$ time.

因为不需要存储指向卫星数据的指针，直接用 0 表示该值为索引值的元素不存在，用 1 表示存在，其他和直接寻址法一样。

11.1-3
Suggest how to implement a direct-address table in which the keys of stored el
ements do not need to be distinct and the elements can have satellite data. All
three dictionary operations (INSERT, DELETE, and SEARCH) should run in $O(1)$
time. (Don’t forget that DELETE takes as an argument a pointer to an object to be
deleted, not a key.)

直接寻址法且允许键不唯一，那么可以在每个索引位置创建一个双向链表，存储具有相同键的元素。

INSERT 操作将新元素添加到列表中，DELETE 操作直接删除该链表节点，SEARCH 操作返回链表中第一个匹配的元素。显然符合 O(1) 时间复杂度。

11.1-4 ?
Wewish to implement a dictionary by using direct addressing on a huge array. At
the start, the array entries may contain garbage, and initializing the entire array
is impractical because of its size. Describe a scheme for implementing a direct
address dictionary on a huge array. Each stored object should use $O(1)$ space;
the operations SEARCH, INSERT, and DELETE should take $O(1)$ time each; and
initializing the data structure should take $O(1)$ time. (Hint: Use an additional array,
treated somewhat like a stack whose size is the number of keys actually stored in
the dictionary, to help determine whether a given entry in the huge array is valid or
not.)

TLDR: $dense[sparse[key]] = key$

新建一个栈 $S$ ，存储实际的键值对，在大数组 $T$ 上按照直接寻址法存储指向 $S$ 的对应索引的指针。初始时，栈 $S$ 为空，大数组 $T$ 不需要处理。

通过检查 $S$ 所存的键是否与 $T$ 上的索引匹配来判断 $T$ 上的索引是否有效。（或者简单点，不在 $S$ 中存键只存值，直接以指针有效与否来判断，但脏数据可能意外与有效的指针碰撞）

INSERT 操作将键值对添加到栈顶，然后在大数组 $T$ 上存储指向该键值对的指针。
DELETE 操作将栈顶元素与要删除的元素交换，然后弹出栈顶元素，并在大数组 $T$ 上将待删除索引置空，将交换后的元素的索引更新为新的位置。
SEARCH 操作直接访问大数组 $T$ 上的对应索引，如果指针有效则返回对应的键值对。

我没想出来，尽管我在 Double-Array Trie 中已经见过类似的 “check” 机制…

这种东西似乎叫做 Sparse Set，稀疏集合。

除了 INSERT、DELETE 和 SEARCH 操作之外，实际上还有个隐含的操作是初始化。虽然一般而言，分配到的总是会已经清空的内存块，但在编译器优化方面非常有用，例如寄存器分配等。

见 https://www.geeksforgeeks.org/dsa/sparse-set/ https://research.swtch.com/sparse

Hash Table

显然直接寻址法的空间复杂度对于全域 $U = \{0, 1, ..., m-1\}$ 来说是 $O(M)$ 。

显然压缩的方法是将多个键映射到同一个索引位置，也就是散列函数 $h: U \to \{0, 1, ..., m-1\}$ ，其中 $m \ll |U|$ 。

那么显然会有冲突（Collision），也就是不同的键被映射到同一个索引位置。

除了如何使得散列函数分布均匀之外，接下来的问题就剩下如何处理冲突了。下面会提供两种常见的处理冲突的方法：链地址法（Separate Chaining）和开放地址法（Open Addressing）。

Collision resolution by chaining

顾名思义，就是在每个索引位置创建一个链表，存储所有被映射到该位置的键值对。

以下 $n$ 为链表长度

INSERT 操作将新元素添加到链表中，时间复杂度为 $O(1)$ 。
SEARCH 操作需要遍历链表，最坏情况下时间复杂度为 $O(n)$ 。
DELETE 操作：
- 已知元素指针：
- - 若双向链表，则直接删除，时间复杂度为 $O(1)$ 。
- - 若单向链表，则需要遍历链表找到前驱节点，时间复杂度为 $O(n)$ 。
- 已知元素键，必须先进行 SEARCH 操作找到元素指针，然后再按已知元素指针的方式删除，时间复杂度为 $O(n)$ 。

下面进行具体分析：

以下 $n$ 为散列表 $T$ 的总元素数量， $m$ 为槽位的数量。

假设描述链的平均长度——散列表 $T$ 的装载因子 $\alpha = \frac{n}{m}$ ，其中 $n$ 是存储的元素数量， $m$ 是散列表的大小。

若所有键存储在一个槽位（Solt）中，则装载因子 $\alpha = n, (0 \leq \alpha \leq 1)$ ，时间复杂度为 $O(n)$ 。

假设散列函数 $h$ 将每个键均匀地随机分布在 $m$ 个槽位中，称为简单均匀散列（Simple Uniform Hashing），那么，对于 $j = 0, 1, ..., m-1$ ，槽位 $T[j]$ 中的元素数量 $n_j$ 的期望值为 $E[n_j] = \frac{n}{m} = \alpha$ 。

假设计算散列函数 $h(k)$ 的时间复杂度为 $O(1)$ ，那么：

在简单均匀散列的假设下，对于链表法，一次不成功 SEARCH 操作的期望时间复杂度为 $O(1 + \alpha)$ 。

由于任意键均匀分布在任意槽位上，而任意槽位的链表长度为 $\alpha$ ，而遍历完链表才能确定不成功，所以包括计算散列函数的时间复杂度在内，时间复杂度为 $O(1 + \alpha)$ 。

在简单均匀散列的假设下，对于链表法，一次成功 SEARCH 操作的期望时间复杂度为 $O(1 + \alpha)$ 。

设 $x_i$ 是第 $i$ 个被插入的元素， $i = 1, 2, ..., n$ ，设 $k_i = x_i.key$ 是元素 $x_i$ 的键，定义 $X_ij = I\{h(k_i) = h(k_j)\}$ ，在简单均匀散列的假设下，两个元素冲突的期望值为 $E[X_ij] = P\{h(k_i) = h(k_j)\} = \frac{1}{m}$ 。

由于比 $x_i$ 早插入的元素不可能在 $x_i$ 的查找路径上，只有比 $x_i$ 后插入的才可能在查找路径上（因为链表是头插法），但不一定会在查找路径上，只有当 $x_i$ 和 $x_j$ 冲突时才会在查找路径上，所以：

\sum_{j=i+1}^n E[X_ij] = \sum_{j=i+1}^n P\{h(k_i) = h(k_j)\} = \sum_{j=i+1}^n \frac{1}{m} = \frac{n - i}{m}

那么所有元素的期望查找长度为：

E = \frac{1}{n} \sum_{i=1}^n (1 + \sum_{j=i+1}^n E[X_ij]) = 1 + \frac{1}{nm} \sum_{i=1}^n (n - i) = 1 + \frac{1}{nm} ( \sum_{i=1}^n n - \sum_{i=1}^n i ) = 1 + \frac{1}{nm} ( n^2 - \frac{n(n+1)}{2} ) = 1 + \frac{n-1}{2m} = 1 + \frac{\alpha}{2} - \frac{\alpha}{2n}

所以一次成功 SEARCH 操作的期望时间复杂度为 $O(1 + \frac{\alpha}{2} - \frac{\alpha}{2n}) = O(1 + \alpha)$ 。

若散列表的槽数至少与元素数量成正比，也就是动态调整散列表大小以保持装载因子 $\alpha$ 在一个常数范围内，那么 $\alpha = \frac{n}{m} = \frac{O(m)}{m} = O(1)$ ，使得采用双向链表的情况下，三种操作的平均时间复杂度为 $O(1)$ 。

11.2-1
Suppose we use a hash function h to hash n distinct keys into an array T of
length m. Assuming simple uniform hashing, what is the expected number of
collisions? More precisely, what is the expected cardinality of $\{\{k, l\}: k \neq l, h(k) = h(l)\}$ ?

上文提到 $E[X_ij] = P\{h(k_i) = h(k_j)\} = \frac{1}{m}$ ，所以：

E = \sum_{i=1}^n \sum_{j=i+1}^n E[X_ij] = \sum_{i=1}^n \sum_{j=i+1}^n P\{h(k_i) = h(k_j)\} = \sum_{i=1}^n \sum_{j=i+1}^n \frac{1}{m} = \frac{n(n-1)}{2m}

11.2-2
Demonstrate what happens when we insert the keys 5,28,19,15,20,33,12,17,10
into a hash table with collisions resolved by chaining. Let the table have 9 slots,
and let the hash function be $h(k) = k \bmod 9$ .

$h(k) = k \bmod 9$ 的结果是：

$h(5) = 5$
$h(28) = 1$
$h(19) = 1$
$h(15) = 6$
$h(20) = 2$
$h(33) = 6$
$h(12) = 3$
$h(17) = 8$
$h(10) = 1$

所以槽位上应该是:

0: null
1: 10 -> 19 -> 28
2: 20
3: 12
4: null
5: null
6: 33 -> 15
7: null
8: 17

11.2-3
Professor Marley hypothesizes that he can obtain substantial performance gains by
modifying the chaining scheme to keep each list in sorted order. How does the pro
fessor’s modification affect the running time for successful searches, unsuccessful
searches, insertions, and deletions?

成功搜索没有区别（链表不能二分），是 $\Theta(1 + \alpha)$
失败搜索更快（当已搜索值大于/小于可以直接判断失败），但应该依旧是 $\Theta(1 + \alpha)$
插入更慢，是 $\Theta(1 + \alpha)$ （需要找到插入点）
删除和之前相同

11.2-4
Suggest how to allocate and deallocate storage for elements within the hash table
itself by linking all unused slots into a free list. Assume that one slot can store
a flag and either one element plus a pointer or two pointers. All dictionary and
free-list operations should run in $O(1)$ expected time. Does the free list need to be
doubly linked, or does a singly linked free list suffice?

注意 one slot can store a flag and either one element plus a pointer or two pointers 指的是一个槽位可以存一个标志位和 “一个元素+一个指针” 或者 “两个指针”

既然需要自由链表操作为 $O(1)$ ，那么绕不开 DELETE 需要双向链表，也就是作为自由槽位的时候，将 flag 设置为 0，必须有两个指针指向前驱和后驱节点。

但是当其存储元素，作为冲突链，不作为自由槽位的时候，需要有指向下一个元素的指针。插入时应该从自由链表弹出一个节点，头插到冲突链；删除时如果拿到的是元素（冲突链节点）的指针，没法获得前驱节点，只能作为单链表在 $O(1 + \alpha)$ 删除。

但注意 allocate and deallocate storage for elements within the hash table itself 和 $O(1)$ expected time

意思是 $n \le m$ ，那么 $\alpha = \frac{n}{m} \le 1$ ，确实 $O(1 + \alpha) = O(1)$ 。

这么细…？

11.2-5
Suppose that we are storing a set of $n$ keys into a hash table of size $m$ . Show that if
the keys are drawn from a universe $U$ with $|U| \gt nm$ ,then $U$ hasasubset of size n
consisting of keys that all hash to the same slot, so that the worst-case searching
time for hashing with chaining is $\Theta(n)$ .

设 $U_j = \{x∈U∣h(x)=j\}$ ，那么 $\sum_{j=1}^{m} U_j = |U| \gt nm$

那么 $E(|U_j|) = \frac{\sum_{j=1}^{m} U_j}{m} \gt n$ ，显然存在 $|U_j| \gt n$

所以确实存在大小为 $n$ 的子集，全部映射到同一个槽。

11.2-6
Suppose wehave stored $n$ keys in ahash table of size $m$ , with collisions resolved by
chaining, and that we know the length of each chain, including the length $L$ of the
longest chain. Describe a procedure that selects a key uniformly at random from
among the keys in the hash table and returns it in expected time $O(L \cdot (1 + \frac{1}{\alpha}))$ .

查到是类似蒙特卡洛拒绝采样，IGNORED

Hash functions

A good hash function satisfies (approximately) the assumption of simple uniform
hashing: each key is equally likely to hash to any of the m slots, independently of
where any other key has hashed to.

一个好的哈希函数应（尽可能）满足简单均匀散列假设：每个键等可能映射到任意槽位，且与其他键映射到哪个槽位无关

哈希表的性能主要取决于冲突的分布情况。在满足简单均匀散列假设的前提下，对于任意大小为 $n$ 的键子集，哈希值在槽位间均匀分布，保证期望搜索时间为 $O(1)$ 。

然而在实际应用中，输入键往往并非从全域中均匀随机选取，而是可能包含特定模式或高度相似的数据。这意味着哈希函数的设计必须在输入键非常相似或存在规律时，也能将其映射为截然不同的哈希值。

Interpreting keys as natural numbers

数组索引只能是自然数，但键可能是字符串，需要将其转为自然数。可以用 ASCII 或者 UTF-8 等。

The division method

简而言之就是 $h(k) = k \mod m$ 。

重点在于如何选取 $m$ 。书中提到 $m = 2^p$ 时， $h(k)$ 实际上就是 $k$ 的二进制表示的低 $k$ 位数字； $m = 2^p - 1$ 时，如果字符串是以 $2^p$ 为基数来编码字符， $h(k)$ 就是取所有字符编码的和再模 $m$ 。（详情见 11.3-3）

The multiplication method

简而言之是 $h(k) = \lfloor m (k A \mod 1) \rfloor$ ，其中 $k A \mod 1$ 指的是取 $kA$ 的小数部分。

书中的内容是用位运算代替浮点数计算。首先要求 $m = 2^p$ ，设字长为 $w$ ，限制 $A = \frac{s}{2^w}, 0 \lt s \lt 2^w$ ， $k \cdot s$ 得到 $r_1 2^w + r_0$ ，两个长度为 $w$ 的整数相乘结果长度为 $2w$ ，其中 $r_1$ 是乘积的高位字，$r_0 是乘积的低位字，最后取 $r_0$ 的最高 $p$ 位。

~~虽然我不知道为什么~~，书中提到推荐 $A \approx (\sqrt 5 - 1) / 2 = 0.618 \dots$ 。

Universal Hashing

以上两种方法都可以针对性的进行碰撞，如果知道 $m$ 或者 $A$ 和 $m$ ，因为它们是确定性散列函数。

对于一组有限散列函数 $\mathcal{H}$ ，能将键映射到 $\{0, 1, 2, 3 \dots, m - 1 \}$ ，且对于不同的键 $k, l \in U$ ，满足 $h(k) = h(l)$ 的散列函数不超过 $| \mathcal{H} | / m$ ，也就是任选散列函数，发生 $h(k) = h(l)$ 的概率不超过 $1 / m$ 。

$n_i$ 表示链表 $T[i]$ 的长度。

证明：如果 $h$ 选自 $\mathcal{H}$ ，将 $n$ 个键散列到大小为 $m$ 的表 $T$ 中，用链接法处理冲突。如果 $k$ 不在表中，则 $k$ 被散列到的链表期望长度 $E[n_h(k)]$ 至多为 $\alpha = n / m$ 。如果 $k$ 在表中，则包含 $k$ 的链表的期望长度 $E[n_h(k)]$ 至多为 $1 + \alpha$