v0.5.0
🚀 Streaming v0.5.0
Streaming v0.5.0
is released! Install via pip
:
pip install --upgrade mosaicml-streaming==0.5.0
New Features
🆕 Cold Shard Eviction. ( #219 )
Dynamically delete least recently used shards in order to keep disk usage under a specified limit. This is enabled by setting the StreamingDataset argument cache_limit
. See the shuffling guide for more details.
from streaming import StreamingDataset
dataset = StreamingDataset(
cache_limit='100gb',
...
)
🤙 Fetch sample using NumPy style indexing. ( #120 )
Users can now randomly access samples using NumPy-style indexing with StreamingDataset
. For example,
import numpy as np
from streaming import StreamingDataset
dataset = StreamingDataset(local=local, remote=remote)
dataset[0] # Fetch sample 0
dataset[-1] # Fetch last sample
dataset[[10, 20]] # Fetch sample 10 and 20
dataset[slice(1, 10, 2)] # Fetch sample 1, 3, 5, 7, and 9
dataset[5:0:-1] # Fetch sample 5, 4, 3, 2, 1
dataset[np.array([4, 7])] # Fetch sample 4 and 7
🦾 Any S3 compatible object store. ( #265 )
Support of any S3 compatible object stores, meaning, an object store which uses the S3 API to communicate with any connected device or system. Some of the S3 compatible object stores are Cloudflare R2, Coreweave, Backblaze b2, etc. User needs to provide an environment variable S3_ENDPOINT_URL
based on the object store that you are using. Details on how to configure credentials can be found here.
🦾 Azure cloud blob storage. ( #256 )
Support of Azure cloud blob storage. Details on how to configure credentials can be found here.
Bug Fixes
- Wait for download and ready thread to finish before terminating job. ( #286 )
- Fixed length calculation to use resampled epoch size, not underlying num samples. ( #278 )
- Fixed mypy errors by adding a py.typed marker file. ( #245 )
- Create a new boto3 session per thread to avoid sharing resources. ( #241 )
🔧 API changes
- The argument
samples_per_epoch
has been renamed toepoch_size
inStreamingDataset
to better distinguish the actual number of underlying samples as serialized and the number of observed samples when iterating (which may be different due to weighting sub-datasets). - The argument
samples
has been renamed tochoose
inStream
to better distinguish the underlying sample vs resampled data. - The argument
keep_raw
has been removed inStreamingDataset
in the process of finalizing the design for shard eviction (see the newly-addedcache_limit
parameter). - The default value of
predownload
inStreamingDataset
was updated; it is now derived using batch size and number of canonical nodes instead of previous constant value of100_000
. This is to prevent predownloaded shards from getting evicted before ever being used. - The default value of
num_canonical_nodes
inStreamingDataset
was updated to 64 times the number of nodes of the initial run instead of number of nodes of the initial run to increase data source diversity and improve convergence. - The default value of
shuffle_algo
inStreamingDataset
was changed frompy1b
topy1s
as it requires less shards to be downloaded during iteration. More details about different shuffling algorithms can be found here.
What's Changed
- Redesign shard index by @knighton in #236
- Propagate an exception raise by a thread to its caller by @karan6181 in #241
- Raise descriptive error message when index.json is corrupted by @karan6181 in #242
- Rename "samples" to "choose" (distinguish underlying vs resampled) by @knighton in #243
- Added py.typed to indicate that the repository has typing annotations by @karan6181 in #245
- Add "Array" base class, which provides numpy-style indexing. by @knighton in #120
- Better organize code by @knighton in #246
- Update readthedocs python version to 3.9 by @karan6181 in #249
- Create a new boto3 session per thread by @karan6181 in #251
- Bump uvicorn from 0.21.1 to 0.22.0 by @dependabot in #253
- Add support for Cloudflare R2 cloud storage by @hlky in #255
- Fix typo in documentation's conversion
pile.py
link by @ouhenio in #259 - Add support for Azure cloud storage by @hlky in #256
- Fix slack link in readme by @growlix in #262
- Bugfix in user_guide.md sample code by @tginart in #263
- Add
Stream
usage example to README by @hanlint in #266 - Update Stream documentation by @karan6181 in #267
- Update README.md - slack by @ejyuen in #273
- Bump fastapi from 0.95.1 to 0.95.2 by @dependabot in #269
- Cold shard eviction by @knighton in #219
- Update slack link with a URL shortener by @karan6181 in #274
- Bump pydantic from 1.10.7 to 1.10.8 by @dependabot in #276
- Bump yamllint from 1.31.0 to 1.32.0 by @dependabot in #277
- Fix SD length calculation when resampling by @knighton in #278
- Fixed performance degradation when not doing shard eviction by @karan6181 in #279
- Derived predownload value using batch size and NCN by @karan6181 in #280
- Support any S3-compatible object store (R2, Coreweave, Backblaze, etc.) by @abhi-mosaic in #265
- Update docs pypi package and Improved documentation by @karan6181 in #281
- Change the default number of canonical nodes by @karan6181 in #282
- Set predownload value correctly for all usecase by @karan6181 in #283
- Add documentation for MDSWriter, conversion scripts, and supported format by @karan6181 in #232
- Ensure int64 by @knighton in #284
- Wait for thread job to finish and Fixed filelock directory structure by @karan6181 in #286
- Bump fastapi from 0.95.2 to 0.96.0 by @dependabot in #287
- Bump version to 0.5.0 by @karan6181 in #289
- Remove github action workflow concurrency check by @karan6181 in #290
New Contributors
- @hlky made their first contribution in #255
- @ouhenio made their first contribution in #259
- @growlix made their first contribution in #262
- @tginart made their first contribution in #263
- @hanlint made their first contribution in #266
- @abhi-mosaic made their first contribution in #265
Full Changelog: v0.4.1...v0.5.0