Fastest way to serialize an array in Python
To kick start the discussion, let’s first assume that we have the following 2-D numpy array:
import numpy as np
matrix = np.random.random((10000, 384)).astype(np.float32)
Approaches tried
-
The built-in
pickleIn Python, the built-in
picklemodule is often used to serialize/deserialize data. Here’s how it works:import pickle pickled_dumped = pickle.dumps(matrix, protocol=pickle.HIGHEST_PROTOCOL) pickle.loads(pickled_dumped) -
MessagePack
MessagePack is known to a fast and compact serialization format. But it can’t serialize a
numpyarray directly, so let’s first assume that we’ll use a nested Pythonlistinstead.matrix_as_list = matrix.tolist()Here’s how it works:
# Note how we set `use_single_float` to make sure float32 is used msgpack_dumped = msgpack.dumps(matrix_as_list, use_single_float=True) msgpack.loads(msgpack_dumped) -
pyarrow
pyarrowis the Python library for Apache Arrow, which enabled fast serialization through zero-copy memory sharing.pyarrow_dumped = pyarrow.serialize(matrix).to_buffer().to_pybytes() pyarrow.deserialize(pyarrow_dumped) -
Non-general DIY Serialization
This is not a general serialization solution. It can only serialize/deserialize a 2-D
numpyarray.import struct def serialize(mat): n_rows = len(mat) n_columns = len(mat[0]) shape = struct.pack('>II', n_rows, n_columns) return shape + mat.tobytes() def deserialize(data): n_rows, n_columns = struct.unpack('>II', data[:8]) mat = np.frombuffer(data, dtype=np.float32, offset=8) mat.shape = (n_rows, n_columns) return mat diy_dumped = serialize(matrix) deserialize(diy_dumped)
Comparison
Now let’s compare these approaches in the following metrics:
- Size (of serialized data, in bytes)
- T_dump (Serializing time)
- T_load (Deserializaing time)
| Approach | Size | T_dump | T_load |
|---|---|---|---|
| pickle | 15360156 | 15.7 ms | 6.01 ms |
| MessagePack | 19230003 | 97.6 ms | 179 ms |
| pyarrow | 15360704 | 17.3 ms | 32.8 µs |
| DIY | 15360008 | 12.5 ms | 2.32 µs |
Conclusion
MessagePack dumps the biggest output and take the longest to serialize/deserialize. This may be related to the way MessagePack encode floating point numbers: each float32 takes 5 bytes, not 4 bytes.
The DIY solution has only 8 bytes of overhead in size for the shape of the 2-D array, and is the fastest one. It’s not a general solution, but it demonstrate that we can fallback to this approach when size and speed of serialization are the bottlenecks.
pyarrow is the fastest general solution I have tried in this experiment, though the overhead in size is a little larger than when using pickle.
The good thing about pyarrow is that the time taken to deserialize doesn’t grow linearly with the size of array, it’s about 32 µs even when I make the array 5 times larger (This is because pyarrow use zero-copy memory sharing to avoid moving the array around. It’s why I use np.frombuffer in the DIY approach).
But if what you have is not a small number of big arrays, but a massive number of small arrays, using pyarrow may not be the right choice. Because even for a 2-D array with only 1 row, it still takes about 20 µs, there are some fixed overhead there.