Using a Different Corpus

WordSegment makes it easy to use a different corpus for word segmentation.

If you simply want to “teach” the algorithm a single phrase it doesn’t know then read this StackOverflow answer.

Now, let’s get a new corpus. For this example, we’ll use the text from Jane Austen’s Pride and Prejudice.

import requests

response = requests.get('https://www.gutenberg.org/ebooks/1342.txt.utf-8')

text = response.text

print len(text)
717573

Great. We’ve got a new corpus for wordsegment. Now let’s look at what parts of the API we need to change. There’s one function and two dictionaries: wordsegment.clean, wordsegment.BIGRAMS and wordsegment.UNIGRAMS. We’ll work on these in reverse.

import wordsegment
wordsegment.load()
print type(wordsegment.UNIGRAMS), type(wordsegment.BIGRAMS)
<type 'dict'> <type 'dict'>
print wordsegment.UNIGRAMS.items()[:3]
print wordsegment.BIGRAMS.items()[:2]
[('biennials', 37548.0), ('verplank', 48349.0), ('tsukino', 19771.0)]
[('personal effects', 151369.0), ('basic training', 294085.0)]

Ok, so wordsegment.UNIGRAMS is just a dictionary mapping unigrams to their counts. Let’s write a method to tokenize our text.

import re

def tokenize(text):
    pattern = re.compile('[a-zA-Z]+')
    return (match.group(0) for match in pattern.finditer(text))

print list(tokenize("Wait, what did you say?"))
['Wait', 'what', 'did', 'you', 'say']

Now we’ll build our dictionaries.

from collections import Counter

wordsegment.UNIGRAMS.clear()
wordsegment.UNIGRAMS.update(Counter(tokenize(text)))

def pairs(iterable):
    iterator = iter(iterable)
    values = [next(iterator)]
    for value in iterator:
        values.append(value)
        yield ' '.join(values)
        del values[0]

wordsegment.BIGRAMS.clear()
wordsegment.BIGRAMS.update(Counter(pairs(tokenize(text))))

That’s it.

Now, by default, wordsegment.segment lowercases all input and removes punctuation. In our corpus we have capitals so we’ll also have to change the clean function. Our heaviest hammer is to simply replace it with the identity function. This will do no sanitation of the input to segment.

from wordsegment import _segmenter

def identity(value):
    return value

_segmenter.clean = identity
wordsegment.segment('wantofawife')
['want', 'of', 'a', 'wife']

If you find this behaves poorly then you may need to change the _segmenter.total variable to reflect the total of all unigrams. In our case that’s simply:

_segmenter.total = float(sum(wordsegment.UNIGRAMS.values()))

WordSegment doesn’t require any fancy machine learning training algorithms. Simply update the unigram and bigram count dictionaries and you’re ready to go.