Python: Load Dict Fast from File¶
Python wordsegment
uses two text files to store unigram and bigram
count data. The files currently store records separated by newline
characters with fields separated by tabs.
with open('../wordsegment_data/unigrams.txt', 'r') as reader:
print repr(reader.readline())
'the\t23135851162\n'
When the wordsegment
module is imported these files are read from
disk and used to construct a Python dict
mapping word
to
count
pairs.
That function works like so:
# %%timeit
with open('../wordsegment_data/unigrams.txt') as reader:
lines = (line.split('\t') for line in reader)
dict((word, float(number)) for word, number in lines)
1 loops, best of 3: 286 ms per loop
Since we’re talking about performance, here’s some details about my platform.
import subprocess
print subprocess.check_output([
'/usr/sbin/sysctl', '-n', 'machdep.cpu.brand_string'
])
import sys
print sys.version
Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
2.7.10 (default, May 25 2015, 13:06:17)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.56)]
Loading the files in about a second is plenty fast for me but I wondered if there was a faster way. Here’s a few things I tried.
Simply reading all the lines from the file takes 27ms:
# %%timeit
with open('../wordsegment_data/unigrams.txt') as reader:
lines = [line for line in reader]
10 loops, best of 3: 26.6 ms per loop
Another way to accomplish the same:
# %%timeit
with open('../wordsegment_data/unigrams.txt') as reader:
lines = reader.read().split('\n')
10 loops, best of 3: 20.7 ms per loop
That’s 30% faster but it’s a small part of 286ms. What takes the majority of the time?
# %%timeit
with open('../wordsegment_data/unigrams.txt') as reader:
lines = (line.split('\t') for line in reader)
for word, number in lines:
pass
10 loops, best of 3: 115 ms per loop
So splitting each line takes nearly 90ms. That’s a bit surprising to me. What else takes so long?
# %%timeit
with open('../wordsegment_data/unigrams.txt') as reader:
lines = (line.split('\t') for line in reader)
for word, number in lines:
float(number)
10 loops, best of 3: 167 ms per loop
Wow, 51ms to convert strings to floats. Maybe later we can optimize
that. Finally, the last chunk must be constructing the dict
.
# %%timeit
with open('../wordsegment_data/unigrams.txt') as reader:
lines = (line.split('\t') for line in reader)
result = dict()
for word, number in lines:
result[word] = float(number)
1 loops, best of 3: 254 ms per loop
By calling __setitem__
repeatedly we avoid the construction of the
tuple used to construct the dict using its constructor. Let’s experiment
with that.
# %%timeit
with open('../wordsegment_data/unigrams.txt') as reader:
lines = (line.split('\t') for line in reader)
dict((word, float(number)) for word, number in lines)
1 loops, best of 3: 303 ms per loop
This isn’t Python 2.6 compatible but what about a dict
comprehension?
# %%timeit
with open('../wordsegment_data/unigrams.txt') as reader:
lines = (line.split('\t') for line in reader)
{word: float(number) for word, number in lines}
1 loops, best of 3: 275 ms per loop
It’s a bit disappointing that the constructor is slower than calling
__setitem__
repeatedly. But maybe that just reflects how much
optimization has gone into making __setitem__
really fast.
Here’s a breakdown of how long various steps are taking:
Operation | Time |
---|---|
Read file and parse lines | 26ms |
Split lines by tab character | 90ms |
Convert strings to floats | 50ms |
Creating dict(...) |
135ms |
Unfortunately, constructing the dict
is hard to optimize. So let’s
look at the other steps. If we stored the counts on disk in binary
format then we could avoid parsing them. If we did so, we might likewise
store the words in a separate file. Let’s convert our unigrams file into
two.
with open('../wordsegment_data/unigrams.txt') as reader:
pairs = [line.split('\t') for line in reader]
words = [pair[0] for pair in pairs]
counts = [float(pair[1]) for pair in pairs]
with open('words.txt', 'wb') as writer:
writer.write('\n'.join(words))
from array import array
values = array('d')
values.fromlist(counts)
with open('counts.bin', 'wb') as writer:
values.tofile(writer)
Now we have two files: words.txt
and counts.bin
. The first
stores words separated by newline characters in ascii. The latter stores
double-precision floating-point numbers in binary. Together we can use
these to construct our dict
.
from itertools import izip as zip
# %%timeit
with open('words.txt', 'rb') as lines, open('counts.bin', 'rb') as counts:
words = lines.read().split('\n')
values = array('d')
values.fromfile(counts, 333333)
dict(zip(words, values))
10 loops, best of 3: 106 ms per loop
Wow. We started at a time of 286ms and worked down to 106ms. That’s 62%
faster. The key to the speedup is separating the dict
keys and
values and using fast methods for parsing each. Reading words from a
file now uses str.split
which is actually faster than Python’s
built-in buffered-file readline mechanism. The array
module parses
counts directly from a binary-formatted file. Finally, the dict
constructor is used with arguments izipped together. I tried using the
__setitem__
trick here but results were within error of one another
and I prefer this style.
At the end of the day, I’m not that impressed. 62% is faster but I expected to improve things by 10x not 2x. Even with this speedup, you’ll notice a delay on module import. And now the format of the files is funky. They don’t play nice with grep, etc. I’m going to leave things as-is for now.
I’d be happy to hear what others have tried. Note in this case that I don’t care how long it takes to write the files. That would be another interesting thing to benchmark.
I also tried formatting the dict
in a Python module which would be
parsed on import. This was actually a little slower than the initial
code. My guess is the Python interpreter is doing roughly the same
thing.