sketchgraphs.data.flat_array¶
This module implements helpers to read the pre-processed data files. Due to the large size of the data, using plain pickle files is inefficient (and leads to extremely long loading times). We implement a memory-mappable compressed format that scales to large files while offering both compression and constant-time access to elements in the dataset.
Classes
|
This class implements a flat pickle-serialized array of python objects backed by a byte buffer. |
Functions
-
sketchgraphs.data.flat_array.
human_bytes
(num_bytes)¶ Formats a byte count as a human-readable byte (using KB, MB, GB units).
- Parameters
num_bytes (int) – An integer representing the number of bytes
- Returns
number – The fractional number of bytes in the given unit
str – The unit of the returned value
-
sketchgraphs.data.flat_array.
load_dictionary_flat
(data)¶ Loads a flat dictionary from the given data buffer.
-
sketchgraphs.data.flat_array.
load_flat_array
(path)¶ Loads a flat array from the given path.
- Parameters
path (file-like object, string, or pathlib.Path) – The file to read. Must be compatible with
np.load
.- Returns
A FlatSerializedArray containing the data.
- Return type
-
sketchgraphs.data.flat_array.
merge_raw_list
(offset_arrays, data_arrays)¶ Merges a list of raw flat lists (as produced by
raw_list_flat
) into one single raw flat list.- Parameters
offset_arrays (list of np.ndarray) – A list of arrays representing the offsets.
data_arrays (list of np.ndarray) – A list of the same length as
offset_arrays
representing the data.
- Returns
np.ndarray – An array of offsets, of length one plus the number of elements in the data.
np.ndarray – An array of bytes, representing the concatenate serialized data.
-
sketchgraphs.data.flat_array.
pack_dictionary_flat
(dict_)¶ Saves a dictionary into a flat structure which is compatible with the structure used for
FlatSerializedArray
.This structure is a light extension of the
FlatSerializedArray
, and is able to encode the array inline so that the memory used by the array can be laid-out directly. This is mainly used to serialize a set of related arrays.- Parameters
dict (dict) – An arbitrary dictionary containing the elements to be serialized.
- Returns
An array of bytes representing the the serialized data.
- Return type
np.ndarray
-
sketchgraphs.data.flat_array.
pack_list_flat
(offsets, data_bytes)¶ Packs the given offsets and corresponding data array into a flat array format.
This function simply adds some metadata headers in order to create a contiguous packed format. See also
raw_list_flat
to obtain the rawoffsets
anddata_bytes
, orsave_list_flat
for serializing a flat array.- Parameters
offsets (array_like) – Array of offsets delineating each element in the
data_bytes
arraydata_bytes (array_like) – Array of bytes containing the raw serialized data
- Returns
An array of bytes representing the raw value.
- Return type
np.ndarray
-
sketchgraphs.data.flat_array.
raw_list_flat
(data, nthreads=None)¶ Serializes the provided list into an array of bytes and offsets.
Note that this function only provides the raw offsets and serialized bytes. It is advised to use
save_list_flat
or similar to encode other associated metadata.- Parameters
data (iterable) – An iterable containing the data to be serialized
nthreads (int, optional) – If not None, the number of threads to be used for performing the serialization in parallel
- Returns
np.ndarray – An array of offsets, of length one plus the number of elements in the data
np.ndarray – An array of bytes, representing the concatenated serialized data
-
sketchgraphs.data.flat_array.
save_list_flat
(data, nthreads=None)¶ Saves a list of python objects into a flat bytes format.