Datasets¶

We provide three datasets at various levels of pre-processing in order to enable a wide variety of uses while ensuring that it is possible to quickly get started on this data. We describe each dataset below.

Tarballs¶

The raw json data as obtained from Onshape. This is provided as a set of 128 tar archives which are compressed using zstandard. They total about 43GB of data. In general, we only recommend these for advanced usage, as they are heavy and require extensive processing to manipulate. Users who wish to directly access the json from python may be interested in the utility function sketchgraphs.pipeline.make_sketch_dataset.load_json_tarball, which iterates through a single compressed tarball and enumerates the sketches in the tarball. The data is available for download here.

Filtered sequence¶

The filtered sequences are the most convenient dataset to use, and are provided ready to use with ML models. They are stored in a custom binary format, which can be read using sketchgraphs.data.flat_array.load_dictionary_flat, by for example executing:

>>> from sketchgraphs.data import flat_array
>>> data = flat_array.load_dictionary_flat('sg_t16_train.npy')
>>> data['sequences']
FlatArray(len=9179789, mem=5.3 GB)

In addition to the sequences, which may be accessed through the data['sequences'] key, the sequence files also contain an integer array of the same length which record the length of each sequence, accessed through the data['sequence_lengths'] key, and a structured array which contains an identifier uniquely identifying the sketch, accessed through the data['sketch_ids'] key. The latter is an array of tuples, the first element being the document id of the sketch, the second being the part index in that document, and the last being the sketch index in that part.

This data is pre-split into a train, test and validation set (the training set is formed from the shards 1-120, the validation set from the shards 121-124, and the testing set from the shards 125-128). Additionally, we also provide pre-computed quantization statistics, computed on the training set using the sketchgraphs.pipeline.make_quantization_statistics script. These quantization statistics are used by the models in order to handle continuous parameters.

Full sequence¶

As a middle-ground between the filtered sequences and the json tarballs, we also provide the full sequences, which are directly converted from the json tarballs with no filtering (except the exclusing of empty sketches). This full sequence file is available here (warning: 15GB download), and contains the sequence representation of all the sketches in the dataset. We note that although it is mostly equivalent to the data contained in the tarballs for ML purposes, it is much smaller as many of the original identifiers are discarded and renamed to sequential indices.

The full sequence may be accessed in the same fashion as the filtered sequences (using the sketchgraphs.data.flat_array module). Note that the full sequence file contains all sketches, and does not provide any train / test split. Additionally, users should be warned that the dataset contains substantial outliers (for example, the largest sketch in the dataset contains more than 350 thousand operations, whereas the 99th percentile is “only” 640 operations).