Last week I got curious about “skip-words.” Skip-words are words that when you remove every other letter, you get a new word. Some examples are: good –> go and great –> get. Here’s how I found some interesting skip-words.
Start with a lot of words. My list is based on TWL06 for Scrabble.
>>> import wordsegment as ws >>> ws.load() >>> words = ws.WORDS
The word list includes all variations of words including plural and conjugations.
>>> len(words) 178758 >>> from pprint import pprint >>> pprint(words[:10]) ['aa', 'aah', 'aahed', 'aahing', 'aahs', 'aal', 'aalii', 'aaliis', 'aals', 'aardvark']
Python’s slicing feature conveniently lets you indicate a start, stop, and step. So we can use a step of two to find all skip-2 words.
>>> wordset = set(words) >>> skip2 = [w for w in words if w[::2] in wordset] >>> len(skip2) 8482
Now 8,482 words is a lot to review and we need some way of finding interesting ones. Interesting pairs are likely more common words so we can start with the sum of the popularity of each word. The wordsegment module contains a mapping of words-to-counts with the popularity of words on the internet. Now, the sum will be bias towards pairs where only one word is very popular. The bias can be corrected by penalizing the sum by the difference in popularity. Here’s the score function for skip-2 words.
>>> def score(word): ... word_count = ws.UNIGRAMS.get(word, 0) ... word2 = word[::2] ... word2_count = ws.UNIGRAMS.get(word2, 0) ... total = word_count + word2_count ... penalty = abs(word_count - word2_count) ... return (word, word2), total - penalty
With our score function, we can find the most common words using a Counter object.
>>> from collections import Counter >>> scores = Counter(dict(map(score, skip2))) >>> pairs = [p for p, _ in scores.most_common()]
The ten highest scoring word and skip-2 word pairs are then:
>>> pprint(pairs[:10]) [('our', 'or'), ('may', 'my'), ('when', 'we'), ('also', 'as'), ('been', 'be'), ('its', 'is'), ('into', 'it'), ('two', 'to'), ('buy', 'by'), ('good', 'go')]
And if we want to find larger words, we can filter by length:
>>> big_pairs = [p for p in pairs if len(p) >= 7] >>> pprint(big_pairs[:10]) [('support', 'spot'), ('supports', 'spot'), ('greatest', 'gets'), ('learned', 'land'), ('counties', 'cute'), ('princess', 'pics'), ('torture', 'true'), ('shoulder', 'sole'), ('targeted', 'tree'), ('courtesy', 'cuts')]
Here’s the top-5 skip-2 words with length eight through ten:
>>> for n in range(8, 11): ... print('Word length:', n) ... sub_pairs = [p for p in pairs if len(p) == n] ... pprint(sub_pairs[:5]) Word length: 8 [('supports', 'spot'), ('greatest', 'gets'), ('counties', 'cute'), ('princess', 'pics'), ('shoulder', 'sole')] Word length: 9 [('pregnancy', 'penny'), ('thesaurus', 'tears'), ('situation', 'stain'), ('footnotes', 'fonts'), ('salvation', 'slain')] Word length: 10 [('blackberry', 'baker'), ('situations', 'stain'), ('plagiarism', 'pairs'), ('superseded', 'spree'), ('supersedes', 'spree')]
There are some interesting combinations in that list:
I think my favorite that I’ve found so far is barbarian –> brain.
Last updated on February 16, 2022.