For some reason it occurred to me tonight to find the longest common substring in various classical novels.
There’s a bunch of interesting algorithmics involved here. The naive O(n3) approach is slower than a slow snail on a slow day. Dynamic programming gets you to an easy-to-understand O(n2) solution, but that’s still impractically slow for story-length strings. Things start getting really clever with Ukkonen’s algorithm and now we’re into practical runtime territory. You could get a constant-factor speedup by working on sequences of intern’d words rather than chars, but the above algorithm runs in under a minute as-is so I didn’t bother.
I downloaded and normalised some books from Project Gutenberg and ran my longest-common-substring on them.
- Emma vs Pride and Prejudice: “when the ladies returned to the drawing room”
- Dracula vs Frankenstein: “between two and three in the morning”
- Dracula vs Pride and Prejudice: “but there was much to be talked of”
- Dracula vs. Fiat Money Inflation in France: “the difficulties and dangers of”
- Treasure Island vs Dracula: “threw himself on his knees and held”
- Treasure Island vs Jekyll & Hyde: “and to make assurance doubly sure”
- US Constitution vs Dracula: “to the best of my ability”
.. and last, but most self-referentially not least …
- Emma vs Frankenstein: “there is a great difference”.