|
196 | 196 | "StaffScheduling": [Staff Scheduling], |
197 | 197 | "SteinerTree": [Steiner Tree], |
198 | 198 | "SteinerTreeInGraphs": [Steiner Tree in Graphs], |
| 199 | + "MinimumExternalMacroDataCompression": [Minimum External Macro Data Compression], |
199 | 200 | "StringToStringCorrection": [String-to-String Correction], |
200 | 201 | "StrongConnectivityAugmentation": [Strong Connectivity Augmentation], |
201 | 202 | "SubgraphIsomorphism": [Subgraph Isomorphism], |
@@ -4983,6 +4984,88 @@ A classical NP-complete problem from Garey and Johnson @garey1979[Ch.~3, p.~76], |
4983 | 4984 | ] |
4984 | 4985 | } |
4985 | 4986 |
|
| 4987 | +#{ |
| 4988 | + let x = load-model-example("MinimumExternalMacroDataCompression") |
| 4989 | + let alpha-size = x.instance.alphabet_size |
| 4990 | + let s = x.instance.string |
| 4991 | + let n = s.len() |
| 4992 | + let h = x.instance.pointer_cost |
| 4993 | + let alpha-map = range(alpha-size).map(i => str.from-unicode(97 + i)) |
| 4994 | + let s-str = s.map(c => alpha-map.at(c)).join("") |
| 4995 | + let opt-val = metric-value(x.optimal_value) |
| 4996 | + [ |
| 4997 | + #problem-def("MinimumExternalMacroDataCompression")[ |
| 4998 | + Given a finite alphabet $Sigma$ of size $k$, a string $s in Sigma^*$ of length $n$, and a pointer cost $h in ZZ^+$, find a dictionary string $D in Sigma^*$ and a compressed string $C in (Sigma union {p_1, dots, p_n})^*$, where each $p_i$ is a pointer referencing a contiguous substring of $D$, such that $s$ can be obtained from $C$ by replacing every pointer with its referenced substring, minimizing the total cost $|D| + |C| + (h - 1) times$ (number of pointer occurrences in $C$). |
| 4999 | + ][ |
| 5000 | + A classical NP-hard data compression problem, listed as SR22 in Garey and Johnson @garey1979. The macro model of data compression was introduced by #cite(<storer1977>, form: "prose"), who proved NP-completeness via transformation from Vertex Cover. #cite(<storer1982>, form: "prose") provided a comprehensive analysis of the macro compression framework, showing that NP-completeness persists even when $h$ is any fixed integer $gt.eq 2$, when the alphabet has $gt.eq 3$ symbols, and when $D$ contains no pointers (the "external" variant). The LZ-family of practical compression algorithms (LZ77, LZSS, LZ78) are restricted forms of this general macro model. The related Smallest Grammar Problem is APX-hard @charikar2005.#footnote[No algorithm improving on brute-force enumeration is known for optimal external macro compression.] |
| 5001 | + |
| 5002 | + *Example.* Let $Sigma = {#alpha-map.join(", ")}$ and $s = #s-str$ (length #n) with pointer cost $h = #h$. |
| 5003 | + |
| 5004 | + #pred-commands( |
| 5005 | + "pred create --example MinimumExternalMacroDataCompression -o min-emdc.json", |
| 5006 | + "pred solve min-emdc.json", |
| 5007 | + "pred evaluate min-emdc.json --config " + x.optimal_config.map(str).join(","), |
| 5008 | + ) |
| 5009 | + |
| 5010 | + #figure({ |
| 5011 | + let blue = graph-colors.at(0) |
| 5012 | + let green = graph-colors.at(1) |
| 5013 | + let cell(ch, highlight: false, ptr: false) = { |
| 5014 | + let fill = if ptr { green.transparentize(70%) } else if highlight { blue.transparentize(70%) } else { white } |
| 5015 | + box(width: 0.5cm, height: 0.55cm, fill: fill, stroke: 0.5pt + luma(120), |
| 5016 | + align(center + horizon, text(8pt, weight: "bold", ch))) |
| 5017 | + } |
| 5018 | + let ptr-cell(label) = { |
| 5019 | + box(width: 1.5cm, height: 0.55cm, fill: green.transparentize(70%), stroke: 0.5pt + luma(120), |
| 5020 | + align(center + horizon, text(7pt, weight: "bold", label))) |
| 5021 | + } |
| 5022 | + // D = first 6 symbols of s (one copy of the pattern) |
| 5023 | + let d-len = alpha-size |
| 5024 | + let d-syms = s.slice(0, d-len) |
| 5025 | + // C = 3 pointers, each referencing D[0..6] |
| 5026 | + let num-ptrs = calc.div-euclid(n, d-len) |
| 5027 | + align(center, stack(dir: ttb, spacing: 0.5cm, |
| 5028 | + // Source string |
| 5029 | + stack(dir: ltr, spacing: 0pt, |
| 5030 | + box(width: 1.5cm, height: 0.5cm, align(right + horizon, text(8pt)[$s: quad$])), |
| 5031 | + ..s.map(c => cell(alpha-map.at(c))), |
| 5032 | + ), |
| 5033 | + // Dictionary D |
| 5034 | + stack(dir: ltr, spacing: 0pt, |
| 5035 | + box(width: 1.5cm, height: 0.5cm, align(right + horizon, text(8pt)[$D: quad$])), |
| 5036 | + ..d-syms.map(c => cell(alpha-map.at(c), highlight: true)), |
| 5037 | + ), |
| 5038 | + // Compressed string C = 3 pointers |
| 5039 | + stack(dir: ltr, spacing: 0pt, |
| 5040 | + box(width: 1.5cm, height: 0.5cm, align(right + horizon, text(8pt)[$C: quad$])), |
| 5041 | + ..range(num-ptrs).map(_ => ptr-cell[$arrow.r D[0..#d-len]$]), |
| 5042 | + ), |
| 5043 | + )) |
| 5044 | + }, |
| 5045 | + caption: [Minimum External Macro Data Compression: with $s = #s-str$ (length #n) and pointer cost $h = #h$, the optimal compression stores $D = #s-str.slice(0, alpha-size)$ (#alpha-size symbols) and uses #calc.div-euclid(n, alpha-size) pointers in $C$, achieving cost $#alpha-size + #calc.div-euclid(n, alpha-size) + (#h - 1) times #calc.div-euclid(n, alpha-size) = #opt-val$ vs.~uncompressed cost #n.], |
| 5046 | + ) <fig:emdc> |
| 5047 | + |
| 5048 | + This instance has a repeating pattern of length #alpha-size, allowing the dictionary $D$ to store one copy and the compressed string $C$ to reference it via pointers. Each pointer costs $h = #h$ (the pointer symbol itself plus $h - 1 = #(h - 1)$ extra), so the total cost is $|D| + |C| + (h - 1) times |"pointers"| = #alpha-size + #calc.div-euclid(n, alpha-size) + #(h - 1) times #calc.div-euclid(n, alpha-size) = #opt-val$, saving $#(n - int(opt-val))$ over the uncompressed cost of #n. |
| 5049 | + ] |
| 5050 | + ] |
| 5051 | +} |
| 5052 | + |
| 5053 | +#reduction-rule("MinimumExternalMacroDataCompression", "ILP")[ |
| 5054 | + The compression problem decomposes into a dictionary selection (which symbols appear at which positions in $D$) and a string partitioning (which segments of $s$ are literals vs.~pointers). Both are naturally expressed with binary variables and linear constraints. The partition structure is modeled as a flow on a DAG whose nodes are string positions and whose arcs are candidate segments. |
| 5055 | +][ |
| 5056 | + _Construction._ For alphabet $Sigma$ of size $k$, string $s$ of length $n$, and pointer cost $h$: |
| 5057 | + |
| 5058 | + _Variables:_ (1) Binary $d_(j,c) in {0,1}$ for each dictionary position $j in {0, dots, n-1}$ and symbol $c in Sigma$: $d_(j,c) = 1$ iff $D[j] = c$. (2) Binary $u_j in {0,1}$: $u_j = 1$ iff dictionary position $j$ is used. (3) Binary $ell_i in {0,1}$ for each string position $i$: $ell_i = 1$ iff position $i$ is covered by a literal. (4) Binary $p_(i,lambda,delta) in {0,1}$ for each valid triple $(i, lambda, delta)$ with $i + lambda <= n$ and $delta + lambda <= n$: $p_(i,lambda,delta) = 1$ iff positions $[i, i + lambda)$ are covered by a pointer referencing $D[delta .. delta + lambda)$. |
| 5059 | + |
| 5060 | + _Constraints:_ (1) Dictionary one-hot: $sum_(c in Sigma) d_(j,c) <= 1$ for all $j$. (2) Linking: $d_(j,c) <= u_j$ for all $j, c$. (3) Contiguity: $u_(j+1) <= u_j$ for all $j < n - 1$. (4) Partition flow: the segments form a partition of ${0, dots, n-1}$ via flow conservation on nodes $0, dots, n$. (5) Pointer matching: $p_(i,lambda,delta) <= d_(delta+r, s[i+r])$ for all offsets $r in {0, dots, lambda - 1}$. |
| 5061 | + |
| 5062 | + _Objective:_ Minimize $sum_j u_j + sum_i ell_i + h sum_(i,lambda,delta) p_(i,lambda,delta)$. |
| 5063 | + |
| 5064 | + _Correctness._ ($arrow.r.double$) An optimal $(D, C)$ pair determines a feasible ILP assignment: set $d_(j,c) = 1$ for each symbol in $D$, $u_j = 1$ for used positions, and activate the corresponding literal or pointer variables for each $C$-slot. The partition flow is satisfied by construction. ($arrow.l.double$) Any feasible ILP solution defines a valid dictionary (one-hot + contiguity) and a valid partition of $s$ into literal and pointer segments (flow conservation + matching), with cost equal to the objective. |
| 5065 | + |
| 5066 | + _Solution extraction._ Read $D$ from the $d_(j,c)$ indicators. Walk through the active segments (via $ell_i$ and $p_(i,lambda,delta)$) to reconstruct $C$. |
| 5067 | +] |
| 5068 | + |
4986 | 5069 | #{ |
4987 | 5070 | let x = load-model-example("MinimumFeedbackArcSet") |
4988 | 5071 | let nv = x.instance.graph.num_vertices |
|
0 commit comments