集合覆盖贪心算法

在即将结课的斯坦福算法公开课上我遇到了一些特别难的题目(画风突变的那种难),觉得还蛮有意思,因此做一个随手的记录。

问题 1

问题描述

In the set cover problem, you are given m sets $ S_1,...,S_m$ each a subset of a common set \(U\) with size \(|U|=n\). The goal is to select as few of these sets as possible, subject to the constraint that the union of the chosen sets is all of \(U\). (You can assume that $ ∪^m_{i=1} S_i=U $.) For example, if the given sets are {1,2}, {2,3}, and {3,4}, then the optimal solution consists of the sets {1,2} and {3,4}.

Here is a natural iterative greedy algorithm. First, initialize \(C=\emptyset\), where \(C\) denotes the sets chosen so far. The main loop is: as long as the union \(∪_{S \in C}S\) of the sets chosen so far is not the entire set \(U\):

Let \(S_i\) be a set that includes the maximum-possible number of elements not in previously-chosen sets (i.e., that maximizes \(|S_i−∪_{S \in C}S|\)). Add \(S_i\) to \(C\).

Consider the following statement: for every instance of the set cover problem (with \(|U|=n\)), this greedy algorithm returns a set cover with size at most \(f(n)\) times that of an optimal (minimum-size) set cover. Which of the following is the smallest choice of the function \(f(n)\) for which this statement is true?

  • 2
  • \(O(log \ n)\)
  • \(O(\sqrt{n})\)
  • \(O(n)\)

例子

首先来个例子

\[\begin{split} U&=[1,16],U \in N \\\ S_1 &= \{1\} \\\ S_2 &= \{16\} \\\ S_3 &= \{2,15\} \\\ S_4 &= \{3,4,13,14\} \\\ S_5 &= \{5,6,7,8,9,10,11,12\} \\\ S_6 &= \{9,10,11,12,13,14,15,16\} \\\ S_7 &= \{1,2,3,4,5,6,7,8\}\end{split}\]

显然,最优解应该是集合 \(S_6,S_7\)。但在 Greedy algorithm 中,我们可能会依次选择 \(S_5,S_4,S_3,S_2,S_1\)。直觉上我们基本可以判断 \(f(n)=O(log \ n)\)

解答

证明的关键在于 Greedy algorithm 每一次迭代中能新包含的元素个数与最优解的不等关系。我们设最优解(在上面例子中是 2,因为选择了 \(S_6,S_7\) 两个集合)为 \(OPT\)。设在第 \(i\) 次迭代运算之前,尚未包含的元素个数为 \(x_i\),在 \(i\) 次迭代运算时,选择的集合为 \(S_i\),新包含的元素个数为 \(y_i\),注意 \(y_i \le |S_i|,x_{i+1}=x_i-y_i\)。Greedy algorithm 经过 \(k\) 次运算,最终得到的结果是 \(k\),即结果为 \({S_1,S_2,...,S_k}\)

引理

\[\frac{x_i}{OPT} \le y_i\]

证明:当 \(i=1\) 时,此时 \(C=\emptyset\), Greedy algorithm 会选择包含元素最大的子集,记为 \(S_1=S_{max}\),那么此时 \(y_1=|S_{max}|\)

因为 \(|S_{max}| \ge S_1,S_2,...,S_m\)\(x_1=n\),显然有 \(\frac{x_1}{OPT} \le y_1\)

接下来只要证明当 \(\frac{x_i}{OPT} \le y_i\) 时,\(\frac{x_{i+1}}{OPT} \le y_{i+1}\) 即可。

假设我们已经进行了 \(i\) 次迭代。我们定义一个子问题,记 \(S = S_1 \bigcup S_2 \bigcup ...S_i\)\(\hat{S_{i+1}}= S_{i+1}-S,\hat{S_{i+2}}= S_{i+2}-S,...\hat{S_{m}}= S_{m}-S\)。子问题即在这些子集 \(\hat{S_{i+1}},\hat{S_{i+2}},...\hat{S_{m}}\) 求最小覆盖。定义该子问题的最优解为 \(\hat{OPT}\),显然我们有 \(OPT \ge \hat{OPT}\)。根据前述,我们有 \(\frac{\hat{x_1}}{\hat{OPT}} \le \hat{y_1}\)。而且根据子问题定义有 \(x_{i+1}=\hat{x_1},y_{i+1}=\hat{y_1}\),所以

\[\frac{x_{i+1}}{OPT} \le \frac{x_{i+1}}{\hat{OPT}} =\frac{\hat{x_1}}{\hat{OPT}} \le \hat{y_1} = y_{i+1}\]

证毕。

计算

对于引理,我们可以理解为每一次迭代都会新包含剩下未包含元素的 \(\frac 1{OPT}\) 。那么有

\[\begin{split} n(1-\frac 1{OPT})^k < 1 \\\ n(1- \frac 1{OPT})^{k-1} > 1\\\ n(1-\frac 1{OPT})^k \approx 1 \\\ n(1- \frac 1{OPT})^{OPT} \approx n \frac 1 e \end{split}\]

上述的最后两个式子取 \(log\) 再相除就有 \(f(n) = \frac k {OPT} = O(log \ n)\)

问题 2

Suppose you are given m sets $ S_1,...,S_m$, each a subset of a common set \(U\). The goal is to choose 2 of the m sets, \(S_i\) and \(S_j\), to maximize the size \(|S_i∪S_j|\) of their union. One natural heuristic is to use a greedy algorithm:

  1. choose the first set \(S_i\) to be as large as possible, breaking ties arbitrarily;

  2. choose the second set Sj to maximize \(|S_i∪S_j|\) (i.e., as the set that includes as many elements as possible that are not already in \(S_i\)), again breaking ties arbitrarily.

For example, if the given sets are {1,2}, {2,3}, and {3,4}, then the algorithm might pick the set {1,2} in the first step; if it does so, it definitely picks the set {3,4} in the second step (for an objective function value of 4).

Consider the following statement: for every instance of the above problem, the greedy algorithm above chooses two sets \(S_i,S_j\) such that \(|S_i∪S_j|\) is at least \(c\) times the maximum size of the union of two of the given sets. Which of the following is the largest choice of the constant \(c\) for which this statement is true?

  • 2/3
  • 1
  • 1/2
  • 3/4

例子

还是先来一个例子说明一下。

\[\begin{split} U&=[1,8],U \in N \\\ S_1 &= \{2,7\} \\\ S_2 &= \{3,4,5,6\} \\\ S_3 &= \{1,2,3,4\} \\\ S_4 &= \{5,6,7,8\} \end{split}\]

最优解为 \(|S_3 \bigcup S_4|=8\),但是在 Greedy algorithm 中,我们可能会依次选择 \(|S_2 \bigcup S_1|=6\)。直觉上,我们可以判断 3/4 是正确答案。

解答

设最优解其选择的集合为 \(S_i,S_j\)。在 Greedy algorithm 中,选择的集合依次是 \(S_1,S_2\)。根据定义,\(S_1\) 是所有集合中最包含元素最多的。那么就有

实际上这里问题 2 可以转化为问题 1。即只要记 \(U = S_1 \bigcup S_2 \bigcup S_i \bigcup S_j\)。此时,问题 1 中的 \(OPT=2\)。根据之前的结论我们有

\[\begin{split} \frac {x_1}2 \le y_1 \\\ \frac {x_2}2 \le y_2 \end{split}\]

其中 \(x_1 = |S_i \bigcup S_j|,y_1 = |S_1|,x_2 = |S_i \bigcup S_j|-|S_1|,y_2=|S_1 \bigcup S_2| - |S_1|\),所以有

\[\begin{split} |S_1| \ge \frac{|S_i \bigcup S_j|}2 \\ 2|S_1 \bigcup S_2| - |S_1| \ge |S_i \bigcup S_j|\end{split}\]

将上式相加

\[2 |S_1 \bigcup S_2| \ge \frac32 |S_i \bigcup S_j|\]

因此得到答案 3/4