🚀 Streaming v0.5.0

Streaming v0.5.0 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.5.0

New Features

🆕 Cold Shard Eviction. ( #219 )

Dynamically delete least recently used shards in order to keep disk usage under a specified limit. This is enabled by setting the StreamingDataset argument cache_limit. See the shuffling guide for more details.

from streaming import StreamingDataset

dataset = StreamingDataset(
    cache_limit='100gb',
    ...
)

🤙 Fetch sample using NumPy style indexing. ( #120 )

Users can now randomly access samples using NumPy-style indexing with StreamingDataset. For example,

import numpy as np
from streaming import StreamingDataset

dataset = StreamingDataset(local=local, remote=remote)

dataset[0]  # Fetch sample 0
dataset[-1]  # Fetch last sample
dataset[[10, 20]]  # Fetch sample 10 and 20
dataset[slice(1, 10, 2)]  # Fetch sample 1, 3, 5, 7, and 9
dataset[5:0:-1]  # Fetch sample 5, 4, 3, 2, 1
dataset[np.array([4, 7])]  # Fetch sample 4 and 7

🦾 Any S3 compatible object store. ( #265 )

Support of any S3 compatible object stores, meaning, an object store which uses the S3 API to communicate with any connected device or system. Some of the S3 compatible object stores are Cloudflare R2, Coreweave, Backblaze b2, etc. User needs to provide an environment variable S3_ENDPOINT_URL based on the object store that you are using. Details on how to configure credentials can be found here.

🦾 Azure cloud blob storage. ( #256 )

Support of Azure cloud blob storage. Details on how to configure credentials can be found here.

Bug Fixes

Wait for download and ready thread to finish before terminating job. ( #286 )
Fixed length calculation to use resampled epoch size, not underlying num samples. ( #278 )
Fixed mypy errors by adding a py.typed marker file. ( #245 )
Create a new boto3 session per thread to avoid sharing resources. ( #241 )

🔧 API changes

The argument samples_per_epoch has been renamed to epoch_size in StreamingDatasetto better distinguish the actual number of underlying samples as serialized and the number of observed samples when iterating (which may be different due to weighting sub-datasets).
The argument samples has been renamed to choose in Stream to better distinguish the underlying sample vs resampled data.
The argument keep_raw has been removed in StreamingDataset in the process of finalizing the design for shard eviction (see the newly-added cache_limit parameter).
The default value of predownload in StreamingDataset was updated; it is now derived using batch size and number of canonical nodes instead of previous constant value of 100_000. This is to prevent predownloaded shards from getting evicted before ever being used.
The default value of num_canonical_nodes in StreamingDataset was updated to 64 times the number of nodes of the initial run instead of number of nodes of the initial run to increase data source diversity and improve convergence.
The default value of shuffle_algo in StreamingDataset was changed from py1b to py1s as it requires less shards to be downloaded during iteration. More details about different shuffling algorithms can be found here.

What's Changed

Redesign shard index by @knighton in #236
Propagate an exception raise by a thread to its caller by @karan6181 in #241
Raise descriptive error message when index.json is corrupted by @karan6181 in #242
Rename "samples" to "choose" (distinguish underlying vs resampled) by @knighton in #243
Added py.typed to indicate that the repository has typing annotations by @karan6181 in #245
Add "Array" base class, which provides numpy-style indexing. by @knighton in #120
Better organize code by @knighton in #246
Update readthedocs python version to 3.9 by @karan6181 in #249
Create a new boto3 session per thread by @karan6181 in #251
Bump uvicorn from 0.21.1 to 0.22.0 by @dependabot in #253
Add support for Cloudflare R2 cloud storage by @hlky in #255
Fix typo in documentation's conversion pile.py link by @ouhenio in #259
Add support for Azure cloud storage by @hlky in #256
Fix slack link in readme by @growlix in #262
Bugfix in user_guide.md sample code by @tginart in #263
Add Stream usage example to README by @hanlint in #266
Update Stream documentation by @karan6181 in #267
Update README.md - slack by @ejyuen in #273
Bump fastapi from 0.95.1 to 0.95.2 by @dependabot in #269
Cold shard eviction by @knighton in #219
Update slack link with a URL shortener by @karan6181 in #274
Bump pydantic from 1.10.7 to 1.10.8 by @dependabot in #276
Bump yamllint from 1.31.0 to 1.32.0 by @dependabot in #277
Fix SD length calculation when resampling by @knighton in #278
Fixed performance degradation when not doing shard eviction by @karan6181 in #279
Derived predownload value using batch size and NCN by @karan6181 in #280
Support any S3-compatible object store (R2, Coreweave, Backblaze, etc.) by @abhi-mosaic in #265
Update docs pypi package and Improved documentation by @karan6181 in #281
Change the default number of canonical nodes by @karan6181 in #282
Set predownload value correctly for all usecase by @karan6181 in #283
Add documentation for MDSWriter, conversion scripts, and supported format by @karan6181 in #232
Ensure int64 by @knighton in #284
Wait for thread job to finish and Fixed filelock directory structure by @karan6181 in #286
Bump fastapi from 0.95.2 to 0.96.0 by @dependabot in #287
Bump version to 0.5.0 by @karan6181 in #289
Remove github action workflow concurrency check by @karan6181 in #290

New Contributors

@hlky made their first contribution in #255
@ouhenio made their first contribution in #259
@growlix made their first contribution in #262
@tginart made their first contribution in #263
@hanlint made their first contribution in #266
@abhi-mosaic made their first contribution in #265

Full Changelog: v0.4.1...v0.5.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.5.0