sketchgraphs.data.flat_array

This module implements helpers to read the pre-processed data files. Due to the large size of the data, using plain pickle files is inefficient (and leads to extremely long loading times). We implement a memory-mappable compressed format that scales to large files while offering both compression and constant-time access to elements in the dataset.

Classes

FlatSerializedArray(offsets, pickle_data)

This class implements a flat pickle-serialized array of python objects backed by a byte buffer.

Functions

sketchgraphs.data.flat_array.human_bytes(num_bytes)

Formats a byte count as a human-readable byte (using KB, MB, GB units).

Parameters

num_bytes (int) – An integer representing the number of bytes

Returns

  • number – The fractional number of bytes in the given unit

  • str – The unit of the returned value

sketchgraphs.data.flat_array.load_dictionary_flat(data)

Loads a flat dictionary from the given data buffer.

Parameters

data (Union[np.ndarray, str]) – ndarray of bytes representing the underlying data for the flat dictionary, or a string representing the filename from which to load the dictionary.

Returns

A dictionary containing elements serialized in the data buffer.

Return type

dict

sketchgraphs.data.flat_array.load_flat_array(path)

Loads a flat array from the given path.

Parameters

path (file-like object, string, or pathlib.Path) – The file to read. Must be compatible with np.load.

Returns

A FlatSerializedArray containing the data.

Return type

FlatSerializedArray

sketchgraphs.data.flat_array.merge_raw_list(offset_arrays, data_arrays)

Merges a list of raw flat lists (as produced by raw_list_flat) into one single raw flat list.

Parameters
  • offset_arrays (list of np.ndarray) – A list of arrays representing the offsets.

  • data_arrays (list of np.ndarray) – A list of the same length as offset_arrays representing the data.

Returns

  • np.ndarray – An array of offsets, of length one plus the number of elements in the data.

  • np.ndarray – An array of bytes, representing the concatenate serialized data.

sketchgraphs.data.flat_array.pack_dictionary_flat(dict_)

Saves a dictionary into a flat structure which is compatible with the structure used for FlatSerializedArray.

This structure is a light extension of the FlatSerializedArray, and is able to encode the array inline so that the memory used by the array can be laid-out directly. This is mainly used to serialize a set of related arrays.

Parameters

dict (dict) – An arbitrary dictionary containing the elements to be serialized.

Returns

An array of bytes representing the the serialized data.

Return type

np.ndarray

sketchgraphs.data.flat_array.pack_list_flat(offsets, data_bytes)

Packs the given offsets and corresponding data array into a flat array format.

This function simply adds some metadata headers in order to create a contiguous packed format. See also raw_list_flat to obtain the raw offsets and data_bytes, or save_list_flat for serializing a flat array.

Parameters
  • offsets (array_like) – Array of offsets delineating each element in the data_bytes array

  • data_bytes (array_like) – Array of bytes containing the raw serialized data

Returns

An array of bytes representing the raw value.

Return type

np.ndarray

sketchgraphs.data.flat_array.raw_list_flat(data, nthreads=None)

Serializes the provided list into an array of bytes and offsets.

Note that this function only provides the raw offsets and serialized bytes. It is advised to use save_list_flat or similar to encode other associated metadata.

Parameters
  • data (iterable) – An iterable containing the data to be serialized

  • nthreads (int, optional) – If not None, the number of threads to be used for performing the serialization in parallel

Returns

  • np.ndarray – An array of offsets, of length one plus the number of elements in the data

  • np.ndarray – An array of bytes, representing the concatenated serialized data

sketchgraphs.data.flat_array.save_list_flat(data, nthreads=None)

Saves a list of python objects into a flat bytes format.

Parameters
  • data (list) – a list of python objects to serialize

  • nthreads (int, optional) – If not None, the number of processes to use.

Returns

A numpy array of bytes representing the serialized data.

Return type

np.ndarray