codingjourney.github.io/the-levenshtein-puzzle.html at master · codingjourney/codingjourney.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
<!DOCTYPE html>
<html lang="en">
<head>
                <title>meandering journey</title>
        <meta charset="utf-8" />
        <link href="./theme/css/main.css" type="text/css" rel="stylesheet" />
                <link href="http://meandering.journey.sk/feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="meandering journey Full Atom Feed" />
                                                <link href="http://meandering.journey.sk/feeds/misc.atom.xml" type="application/atom+xml" rel="alternate" title="meandering journey Categories Atom Feed" />
                                        </head>

<body id="index" class="home">
  <div class="all">
  <div class="extra-info">

    <aside>
  <h3>About the blog</h3>
  A platform to practice my written communication skills.
  The range of topics tends to surprise even myself.
</aside>
<aside>
  <h3>About the author</h3>
  My name is Ján Hušták. I live near
  <a href="http://maps.google.com/maps?q=bratislava&z=6">Bratislava</a>.
  I've been developing software professionally since 1998.
  The Java platform has served me well but I don't dwell on it.
</aside>
<aside>
  <h3>Links</h3>
  <a href="http://www.journey.sk">Main site</a>,
  <a href="http://coding.journey.sk">projects page</a>,
  <a href="https://github.com/codingjourney">GitHub</a>.
  Sorry, no social networks. I do read mail sent to
  coding &#97;&#116; journey.sk.
</aside>
    <aside id="tags">
  <h3>Tags</h3>
          <a href="./tag/motivation.html">motivation</a>
      -     <a href="./tag/htpc.html">HTPC</a>
      -     <a href="./tag/openbsd.html">OpenBSD</a>
      -     <a href="./tag/qt.html">Qt</a>
      -     <a href="./tag/upsheet.html">upsheet</a>
      -     <a href="./tag/python.html">Python</a>
      -     <a href="./tag/kde.html">KDE</a>
      -     <a href="./tag/cloud-computing.html">cloud computing</a>
      -     <a href="./tag/caldav.html">CalDAV</a>
      -     <a href="./tag/howto.html">howto</a>
      -     <a href="./tag/jetty.html">Jetty</a>
      -     <a href="./tag/craftsmanship.html">craftsmanship</a>
      -     <a href="./tag/meta.html">meta</a>
      -     <a href="./tag/music.html">music</a>
      -     <a href="./tag/it-misadventures.html">IT misadventures</a>
      -     <a href="./tag/algorithms.html">algorithms</a>
      -     <a href="./tag/android.html">Android</a>
      -     <a href="./tag/cups.html">CUPS</a>
      -     <a href="./tag/security.html">security</a>
      -     <a href="./tag/html5.html">HTML5</a>
    </aside>
    <aside class="links">
  <h3>Recent articles</h3>
      <a href="./much-more-fun-with-planning-poker.html">Much more fun with Planning poker</a>
      <a href="./the-child-that-grew-too-fast.html">The child that grew too fast</a>
      <a href="./mare-nostrum-at-konzerthaus.html">Mare Nostrum at Konzerthaus</a>
      <a href="./too-much-fun-with-planning-poker.html">Too much fun with Planning poker</a>
      <a href="./long-time-no-blog.html">Long time no blog</a>
      <a href="./what-i-did-last-summer.html">What I did last summer</a>
      <a href="./october-2013-is-here.html">October 2013 is here</a>
      <a href="./october-2013.html">October 2013</a>
    </aside>

  </div><!-- /.extra-info -->
  <div class="main-column">
  <header id="banner" class="body">
    <h1><a href=".">meandering<img src="./theme/images/logo.png"/>journey</a></h1>
  </header><!-- /#banner -->
        <section id="content" class="body">
  <header>
    <h3>
      <a href="the-levenshtein-puzzle.html" rel="bookmark"
         title="Permalink to The Levenshtein puzzle">The Levenshtein puzzle</a></h2>

  </header>
  <footer class="post-info">
        Published on <abbr class="published" title="2013-06-09T04:30:00"> Sun 09 June 2013 </abbr> under
        <a href="./tag/algorithms.html">algorithms</a>      </footer><!-- /.post-info -->
  <div class="entry-content">
    <p>I've read a few blog posts recently that mentioned the concept of Levenshtein distance. It's a measure of the difference between two strings <a href="http://en.wikipedia.org/wiki/Levenshtein_distance">defined as</a> <em>"the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other"</em>. The definition is very straightforward but when I thought about calculating it I saw no immediately obvious way. Rather than looking it up, I decided to discover an algorithm on my own, with nothing but the definition to start from.</p>
<p>After a period of requisite head-scratching and ad-hoc attempts I identified two trivial corner cases: identical words (<em>d=0</em>) and words with no letters in common (<em>d=max(first.length, second.length)</em>, let's call them <em>orthogonal</em>). Then came the crucial realization: any pair of words can be chopped up into a sequence of identical and orthogonal sub-words:</p>
<p style="font-family: monospace">
d<span style="color: #e08020">ar</span>t<span style="color: red">e</span>d<br/>
st<span style="color: #e08020">ar</span><span style="color: red">e</span>
</p>

<p>The total distance is then the sum of the distances of the orthogonal parts. Note that an orthogonal pair may consist of one empty and one non-empty string as well, such as "t" vs. "" in the example above.</p>
<p>Trouble is, there may be more than one way to slice the words:</p>
<p style="font-family: monospace">
b<span style="color: #e08020">a</span>r<span style="color: red">b</span>a<span style="color: #6060e0">ra</span><br/>
<span style="color: #e08020">a</span><span style="color: red">b</span><span style="color: #6060e0">ra</span>cadabra
</p>

<p style="font-family: monospace">
<span style="color: #e08020">b</span><span style="color: red">a</span><span style="color: #6060e0">r</span>bara<br/>
a<span style="color: #e08020">b</span>r<span style="color: red">a</span>cadab<span style="color: #6060e0">r</span>a
</p>

<p style="font-family: monospace">
b<span style="color: #e08020">a</span><span style="color: red">r</span><span style="color: #6060e0">b</span>a<span style="color: #a06060">ra</span><br/>
<span style="color: #e08020">a</span>b<span style="color: red">r</span>acada<span style="color: #6060e0">b</span><span style="color: #a06060">ra</span>
</p>

<p>and so on. The distances corresponding to these splits are 10, 11 and 8, respectively. The actual minimal distance is 6; discovering the correct split (or splits) is left as an exercise for the reader. The way my algorithm goes about it is a straightforward exhaustive trawling of the solution space. In pseudo-Python:</p>
<div class="codehilite"><pre><span class="n">def</span> <span class="n">distance</span><span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">)</span><span class="o">:</span>
  <span class="n">result</span> <span class="o">=</span> <span class="n">max</span><span class="p">(</span><span class="n">left</span><span class="p">.</span><span class="n">length</span><span class="p">,</span> <span class="n">right</span><span class="p">.</span><span class="n">length</span><span class="p">)</span>
  <span class="k">for</span> <span class="n">each</span> <span class="n">index</span> <span class="n">x</span> <span class="n">in</span> <span class="n">left</span><span class="o">:</span>
     <span class="n">letter</span> <span class="o">=</span> <span class="n">left</span><span class="p">[</span><span class="n">x</span><span class="p">]</span>
     <span class="k">for</span> <span class="n">each</span> <span class="n">location</span> <span class="n">y</span> <span class="n">of</span> <span class="n">letter</span> <span class="n">in</span> <span class="n">right</span><span class="o">:</span>
       <span class="n">subLeft</span>  <span class="o">=</span> <span class="n">left</span><span class="p">[</span><span class="n">x</span><span class="o">:</span><span class="p">]</span>
       <span class="n">subRight</span> <span class="o">=</span> <span class="n">right</span><span class="p">[</span><span class="n">y</span><span class="o">:</span><span class="p">]</span>
       <span class="n">beforeMatch</span> <span class="o">=</span> <span class="n">max</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
       <span class="n">match</span> <span class="o">=</span> <span class="n">length</span> <span class="n">of</span> <span class="n">the</span> <span class="n">identical</span> <span class="n">prefix</span> <span class="n">of</span> <span class="n">subLeft</span> <span class="n">and</span> <span class="n">subRight</span>
       <span class="n">afterMatch</span> <span class="o">=</span> <span class="n">distance</span><span class="p">(</span><span class="n">subLeft</span><span class="p">[</span><span class="n">match</span><span class="o">:</span><span class="p">],</span> <span class="n">subRight</span><span class="p">[</span><span class="n">match</span><span class="o">:</span><span class="p">])</span>
       <span class="n">newDistance</span> <span class="o">=</span> <span class="n">beforeMatch</span> <span class="o">+</span> <span class="n">afterMatch</span>
       <span class="n">result</span> <span class="o">=</span> <span class="n">min</span><span class="p">(</span><span class="n">result</span><span class="p">,</span> <span class="n">newDistance</span><span class="p">)</span>
  <span class="k">return</span> <span class="n">result</span>
</pre></div>


<p>As you can see, it's a recursive function not amenable to tail-call optimization so it's prone to overflowing the stack, among other things. There's a ton of potential for performance improvement. One thing I actually have in my implementation is that when beforeMatch &gt;= result I don't go into recursion as it can't possibly produce a lower newDistance. This mini-optimization is omitted from the pseudo-code for clarity.</p>
<p>Other than that, proper ordering seems to be the key. The algorithm is asymmetric in that it "starts from" the left word and tries to "reach" the right one. Should it always start from the shorter word or from the longer one? Or from the word with fewer unique letters or more unique letters? Should the letters with most locations in the starting word be tried first? Or the ones closest to its beginning?</p>
<p>A proper complexity analysis (or a robust set of test pairs with known Levenshtein distances) would answer those questions, increasing the likelihood of encountering good matches early, cutting off bigger branches of the search tree. Alas, I have no time for such work, no matter how much fun it would be and how much I'd learn from it. I've solved the puzzle itself and the optimization paths are all well trodden by more capable explorers, I'm sure. Also, given that my solution completely ignores the distribution of letters in the "target" word, there's bound to be a fundamentally better, more symmetric approach. I'm looking forward to reading up on it :-)</p>
<p>My original implementation is in JavaScript with a <a href="./static/snippets/levenshtein.html">web page</a> for easy testing of correctness. I later re-wrote it as command-line scripts in <a href="./static/snippets/levenshtein.js">node.js</a>, <a href="./static/snippets/levenshtein.py">Python</a> and <a href="./static/snippets/levenshtein.go">Go</a> in order to compare performance. Surprisingly, Go seems to be only about 33 % faster than both node.js and Python. Mind you, I don't know any of those languages intimately enough to be sure I didn't screw something up performance-wise so the comparison is very rough and not serious in any way. Tasting the different langages' flavors was great fun, though, and I'm itching for C and Drools implementations if I find the time. A functional variant in Scala or Clojure would also be nice but swapping between the imperative and functional mind-sets is <em>really</em> time-consuming.</p>
  </div><!-- /.entry-content -->
</section>
        <footer id="contentinfo" class="body">
                <address id="about" class="vcard body">
                Proudly powered by <a href="http://getpelican.com/">Pelican</a>,
                which takes great advantage of <a href="http://python.org">Python</a>.
                </address><!-- /#about -->
        </footer><!-- /#contentinfo -->
  </div><!-- /.main-column -->
  </div><!-- /.all -->
</body>
</html>