What's Changed
- split the code and move to a monorepo by @severo in #210
- Docker by @severo in #214
- Send docker images to ecr by @severo in #218
- Rename to datasets server by @severo in #221
- Use kubernetes by @severo in #227
- Add datasets-server-worker to the Kube cluster by @severo in #236
- Nginx proxy by @severo in #245
- feat: 🎸 upgrade datasets to 2.2.0 by @severo in #246
- feat: 🎸 upgrade the docker images to use datasets 2.2.0 by @severo in #247
- feat: 🎸 upgrade datasets to 2.2.1 by @severo in #253
- feat: 🎸 use images with datasets 2.2.1 by @severo in #254
- Add metrics by @severo in #258
- feat: 🎸 upgrade images to get /prometheus endpoint by @severo in #262
- fix: 🐛 add support for mongodb+srv:// URLs using dnspython by @severo in #263
- Prod env by @severo in #266
- feat: 🎸 upgrade images by @severo in #267
- fix: 🐛 fix loop by @severo in #268
- feat: 🎸 upgrade image by @severo in #269
- fix: 🐛 fix the query to get the list of jobs in the queue by @severo in #271
- Upgrade worker by @severo in #272
- Add service monitor by @severo in #260
- fix: 🐛 fix nfs mount by @severo in #274
- feat: 🎸 add the admin service (to run admin scripts) by @severo in #275
- feat: 🎸 enable monitoringin prod by @severo in #276
- fix: 🐛 the block list must be a comma-separated list by @severo in #278
- Fix ram in prod by @severo in #280
- feat: 🎸 upgrade images by @severo in #281
- fix: 🐛 disable the metrics about cache and queue by @severo in #282
- feat: 🎸 upgrade images by @severo in #283
- test: 💍 fix test by @severo in #284
- feat: 🎸 update prod values by @severo in #285
- perf: ⚡️ reduce the number of workers by @severo in #287
- fix: 🐛 increase resources for api, and block big datasets by @severo in #289
- feat: 🎸 upgrade datasets to 2.2.2 (and minor upgrades) by @severo in #290
- feat: 🎸 update docker images by @severo in #291
- Fix valid endpoint query by @severo in #292
- Update docker images by @severo in #294
- feat: 🎸 add indexes in mongo by @severo in #295
- feat: 🎸 update docker images by @severo in #296
- Reenable metrics by @severo in #298
- feat: 🎸 update docker images by @severo in #299
- fix: 🐛 disable cache and queue metrics for now by @severo in #300
- feat: 🎸 update the docker images by @severo in #303
- perf: ⚡️ increase the number of replicas for the API by @severo in #304
- feat: 🎸 block two datasets by @severo in #305
- ci: 🎡 use cache (gha) when building the docker images by @severo in #313
- ci: 🎡 use cache with poetry by @severo in #314
- ci: 🎡 launch e2e after docker build, and use the images by @severo in #316
- feat: 🎸 use only one uvicorn worker per api pod by @severo in #317
- feat: 🎸 adapt the value of resources based on monitoring by @severo in #321
- feat: 🎸 upgrade dependencies by @severo in #322
- Respond to datasets-server.huggingface.co by @severo in #328
- Optimize the query behind /splits by @severo in #329
- feat: 🎸 update the docker image for api by @severo in #330
- feat: 🎸 use the tls certificate with two domains by @severo in #331
- fix: 🐛 optimize the query to get the list of valid datasets by @severo in #333
- feat: 🎸 update api docker image by @severo in #335
- feat: 🎸 update dependencies to update libcache and libqueue by @severo in #336
- feat: 🎸 update docker image by @severo in #337
- feat: 🎸 add an index to optimize the distinct query by @severo in #338
- feat: 🎸 update docker image by @severo in #339
- Add metrics endpoint to admin by @severo in #340
- Expose admin metrics by @severo in #341
- fix: 🐛 give every servicemonitor its name by @severo in #342
- ci: 🎡 use reusable workflows, and conditional runs on path by @severo in #344
- Be more explicit about the current docker images by @severo in #345
- Be more explicit about the current docker images by @severo in #346
- ci: 🎡 fix the file extension by @severo in #347
- ci: 🎡 checkout the repo before accessing a file by @severo in #348
- ci: 🎡 fix missing replace by @severo in #349
- feat: 🎸 remove old domain datasets-server.huggingface.tech by @severo in #351
- Remove the datasets blocklist and re-enqueue server errors by @severo in #352
- feat: 🎸 upgrade libqueue and libcache by @severo in #353
- Fix worker by @severo in #354
- feat: 🎸 update images by @severo in #356
- feat: 🎸 increase resources for the workers by @severo in #357
- feat: 🎸 update the resources by trial and error by @severo in #358
- fix: 🐛 adapt the pods resources by @severo in #359
- feat: 🎸 use the new certificate by @severo in #360
- fix: 🐛 ensure the NUMBA_CACHE_DIR is set by @severo in #361
- fix: 🐛 use a new name for the numba cache preparation by @severo in #362
- Allow none path in audio by @severo in #363
- fix: 🐛 don't mark empty splits as stalled by @severo in #366
- docs: ✏️ add doc about k8 by @severo in #370
- Fix dockerfiles by @severo in #372
- Add timestamp type by @severo in #374
- feat: 🎸 upgrade datasets to 2.3.1 by @severo in #375
- fix: 🐛 fix the log name by @severo in #377
- feat: 🎸 upgrade datasets (and dependencies) by @severo in #381
- feat: 🎸 adjust the prod resources by @severo in #383
- feat: use new cache locations (to have empty ones) by @severo in #385
- feat: 🎸 increase the log verbosity to help debug by @severo in #405
- fix: 🐛 rename "stalled" into "stale" by @severo in #406
- feat: 🎸 revert docker images to previous state by @severo in #408
- Revert two commits by @severo in #409
- Fallback to other image formats if JPEG generation fails by @mariosasko in #410
- Fix stale by @severo in #411
- Don't share the cache for the datasets modules by @severo in #414
- fix: 🐛 set the modules cache inside /tmp by @severo in #418
- feat: 🎸 add basis for the docs by @severo in #421
- Create the OpenAPI spec by @severo in #424
- feat: 🎸 publish openapi.json from the reverse proxy by @severo in #426
- wording tweak by @julien-c in #433
- Add /first-rows endpoint by @severo in #431
- 442 500 error if not ready by @severo in #443
- 404 improve error messages by @severo in #444
- Add two endpoints to openapi by @severo in #445
- docs: ✏️ multiple fixes on the openapi spec by @severo in #448
- docs: ✏️ nit by @severo in #449
- fix: 🐛 add cpu for the first-rows worker by @severo in #452
- fix: 🐛 increase cpu limit for split worker, and reduce per ds by @severo in #453
- Improve technical routes response by @severo in #454
- feat: 🎸 update docker images by @severo in #456
- feat: 🎸 move two technical endpoints from api to admin by @severo in #457
- fix: 🐛 remove the conflict for the admin domain bw dev and prod by @severo in #460
- fix: 🐛 fix domains (we had to ask for them to Route53) by @severo in #461
- refactor: 💡 move ingress to the root in values by @severo in #462
- feat: 🎸 add a script to refresh the canonical datasets by @severo in #463
- feat: 🎸 move the admin endpoints under /admin/ by @severo in #467
- feat: 🎸 revert to remove the /admin prefix by @severo in #469
- feat: 🎸 upgrade datasets to 2.4.0 by @severo in #470
- fix: 🐛 fix target name by @severo in #471
- feat: 🎸 fix the servicemonitor url by @severo in #472
- chore: 🤖 move /infra/charts/datasets-server to /chart by @severo in #476
- feat: 🎸 change the format of the error responses by @severo in #477
- feat: 🎸 add a target by @severo in #478
- feat: 🎸 use main instead of master to load datasets by @severo in #479
- Stop the count by @lhoestq in #481
- Update ephemeral namespace by @severo in #483
- Add error code by @severo in #482
- docs: ✏️ The docs have been moved to notion.so by @severo in #485
- Add cache reports endpoint by @severo in #487
- feat: 🎸 update docker by @severo in #489
- Optimize reports pagination by @severo in #490
- Add error code to metrics by @severo in #492
- fix: 🐛 endpoint is reserved in prometheus by @severo in #494
- Allow multiple uvicorn workers by @severo in #497
- Add auth to api endpoints by @severo in #495
- Use hub ci for tests by @severo in #499
- ci: 🎡 separate docker workflows by @severo in #500
- ci: 🎡 copy less files to the dockerfiles by @severo in #501
- refactor: 💡 use pathlib instead of os.path by @severo in #503
- Add valid next and is valid next by @severo in #504
- Add valid next and is valid next to the doc by @severo in #505
- docs: ✏️ fix duplicate paths by @severo in #506
- docs: ✏️ add the expected X-Error-Code values by @severo in #508
- Add expected x error code headers by @severo in #509
- docs: ✏️ fix list and sequence features by @severo in #512
- test: 💍 test cookie authentication by @severo in #514
- Private token handling by @LysandreJik in #517
- Use fixtures in tests by @severo in #515
- test: 💍 enable two tests by @severo in #519
- Reduce responses size by @severo in #520
- Update tools by @severo in #521
- ci: 🎡 fix the names to have a better coherence by @severo in #522
- ci: 🎡 restore Makefile in the docker image by @severo in #523
- feat: 🎸 rename the tags of the /admin/metrics by @severo in #524
- ci: 🎡 only copy the scripts targets to the Makefile in docker by @severo in #527
- feat: 🎸 change the prod resources by @severo in #529
- fix: 🐛 handle the case where two jobs exist for the same ds by @severo in #530
- feat: 🎸 gve priority to datasets that have no started jobs yet by @severo in #531
- Fix the
datasets
config parameters by @lhoestq in #533 - feat: 🎸 tweak prod parameters by @severo in #536
- Update safety by @severo in #537
- 👽️ moon-landing will return 404 for auth-check instead of 403 by @coyotte508 in #535
- feat: 🎸 return 404 for /healthcheck and /metrics by @severo in #541
- feat: 🎸 add auth for /admin by @severo in #542
- fix: 🐛 add missing annotations by @severo in #543
- feat: 🎸 update certificate by @severo in #544
- feat: 🎸 support OPTIONS requests (CORS pre-flight requests) by @severo in #538
- test: 💍 fix e2e tests since /healthcheck is not public anymore by @severo in #547
- feat: 🎸 remove deprecated workers (splits, datasets) by @severo in #549
- docs: ✏️ update the docs by @severo in #550
- feat: 🎸 remove temporary routes (-next) by @severo in #551
- Use whoami to protect admin routes by @severo in #553
- docs: ✏️ remove extra char by @severo in #556
- docs: ✏️ add a mention to postman by @severo in #557
- docs: ✏️ add reference to page on RapidAPI by @severo in #558
- chore: 🤖 add a stale bot by @severo in #565
- rework doc by @severo in #566
- feat: 🎸 don't close issues with tag "keep" by @severo in #569
- docs: ✏️ update and simplify the README/INSTALL/CONTRIBUTING doc by @severo in #570
- chore: 🤖 add an issue template by @severo in #573
- refactor: 💡 remove unused value by @severo in #574
- feat: 🎸 remove support for .env files by @severo in #572
- chore: 🤖 add license and other files before going opensource by @severo in #571
- docs: ✏️ fix the docs to only use datasets server, not ds api by @severo in #575
- refactor: 💡 remove dead code and TODO comments by @severo in #576
- Fix dependency vulnerabilities by @severo in #577
- Use json logs in nginx by @severo in #579
- feat: 🎸 upgrade datasets to 2.5.1 by @severo in #580
- Hot fix webhook v1 by @severo in #581
- Fix private to public by @severo in #582
- Simplify code snippet in docs by @albertvillanova in #583
- docs: ✏️ improve the onboarding by @severo in #586
- Details by @severo in #589
- fix: 🐛 restore the check on the webhook payload by @severo in #591
- 587 fix list of images or audio by @severo in #592
- fix: 🐛 fix the dependencies for macos m1/m2 by @severo in #593
- ci: push the images to Docker Hub in the public organization hf by @severo in #595
- Add section for macos by @severo in #597
- feat: 🎸 add a query on the features of the datasets by @severo in #598
- feat: 🎸 change the format of the image cells in /first-rows by @severo in #600
- docs: ✏️ add sections by @severo in #596
- Support Sequence of dicts by @severo in #603
- chore: 🤖 upgrade safety by @severo in #604
- fix: 🐛 fix tests for the Sequence cells by @severo in #605
- test: 💍 add tests for missing fields and None value by @severo in #606
- feat: 🎸 upgrade hub webhook client to v2 by @severo in #607
- feat: 🎸 8 splits workers by @severo in #609
- feat: 🎸 make the queue agnostic to the types of jobs by @severo in #608
- feat: 🎸 fix vulnerabilities by upgrading tensorflow by @severo in #610
- feat: 🎸 remove obsolete DATASETS_REVISION by @severo in #611
- Manage the environment variables and configuration more robustly by @severo in #612
- feat: 🎸 change the number of pods by @severo in #613
- refactor: 💡 setup everything in the configs by @severo in #615
- Details by @severo in #616
- Fix metrics by @severo in #618
- fix: 🐛 mount the assets directory by @severo in #619
- Fix api metrics by @severo in #620
- test: 💍 missing change in e2e by @severo in #621
- fix: 🐛 fix hf-token by @severo in #622
- feat: 🎸 sort the configs alphabetically by @severo in #623
- Store and compare worker+dataset repo versions by @severo in #624
- feat: 🎸 only sleep for 5 seconds by @severo in #625
- Limit the started jobs per "dataset namespace" by @severo in #626
- feat: 🎸 change mongo indexes (following cloud recommendations) by @severo in #627
- Update pr docs actions by @mishig25 in #632
- ci: 🎡 remove the token for codecov since the repo is public by @severo in #633
- Add migration job by @severo in #636
- fix: 🐛 fix the truncation by @severo in #638
- feat: 🎸 update dependencies to fix vulnerabilities by @severo in #639
- Revert "Update pr docs actions" by @mishig25 in #641
- Force job by @severo in #642
- feat: 🎸 upgrade huggingface_hub to 0.11.0 by @severo in #643
- Standardize Helms Charts by @XciD in #635
- Refactor common cache entry by @severo in #634
- feat: 🎸 upgrade datasets by @severo in #644
- Replace safety with pip audit by @severo in #645
- feat: 🎸 upgrade to datasets 2.7.1 by @severo in #646
- fix: 🐛 install missing dependency by @severo in #647
- Implement generic processing steps by @severo in #650
- Fix ask access by @severo in #652
- feat: 🎸 cancel-jobs must be a POST request, not a GET by @severo in #653
- Simplify docker by @severo in #654
- Merge the workers that rely on the datasets library by @severo in #656
- feat: 🎸 upgrade from python 3.9.6 to 3.9.15 by @severo in #658
- feat: 🎸 add parquet worker by @severo in #651
- feat: 🎸 update the production parameters by @severo in #662
- feat: 🎸 add method to get the duration of the jobs per dataset by @severo in #663
- docs: ✏️ fix doc by @severo in #664
- Fix empty commits by @severo in #665
- feat: 🎸 upgrade datasets to 2.8.0 by @severo in #666
- feat: 🎸 give each worker its own version + upgrade to 2.0.0 by @severo in #667
- Split Worker into WorkerLoop, WorkerFactory and Worker by @severo in #668
- chore: 🤖 speed-up docker build by @severo in #669
- Small tweaks on Helm charts by @n1t0 in #649
- feat: 🎸 update the HF webhook content by @severo in #671
- feat: 🎸 allow more concurrent jobs fo the same namespace by @severo in #675
- fix: 🐛 only check webhook payload for what we are interested in by @severo in #676
- ci: 🎡 fix app token by @severo in #678
- Create children in generic worker by @severo in #677
- Create endpoint /dataset-info by @severo in #670
- feat: 🎸 add /sizes by @severo in #679
- chore: 🤖 add --no-cache (poetry) and --no-cache-dir (pip) by @severo in #680
- feat: 🎸 increase number of workers for a moment by @severo in #681
- feat: 🎸 increase resources by @severo in #682
- feat: 🎸 increase resources` by @severo in #683
- fix: 🐛 fix memory specification + increase pods in /parquet by @severo in #684
- chore: 🤖 update resources by @severo in #686
- feat: 🎸 block more datasets, and allow more /first-rows per ns by @severo in #690
- feat: 🎸 add support for pdf2image by @severo in #691
- feat: 🎸 replace Queue.add_job with Queue.upsert_job by @severo in #694
- feat: 🎸 launch children jobs even when skipped by @severo in #695
- Add a new route: /cache-reports-with-content by @severo in #696
- feat: 🎸 reduce logs level from DEBUG to INFO by @severo in #697
- feat: 🎸 block more datasets in /parquet-and-dataset-info by @severo in #698
- refactor: 💡 set libcommon as an "editable" dependency by @severo in #699
- Update hfh by @severo in #700
- ci: 🎡 launch CI when libcommon has been modified by @severo in #703
- Configs and splits by @severo in #702
- Update index.mdx by @keleffew in #693
- feat: 🎸 make /first-rows depend on /split-names, not /splits by @severo in #706
- Add priority field to queue by @severo in #705
- fix: 🐛 fix migration script by @severo in #707
- feat: 🎸 add a /backfill admin endpoint by @severo in #708
- Update poetry lock file format to 2.0 by @albertvillanova in #714
- ci: 🎡 build the images before running the e2e tests by @severo in #716
- ci: 🎡 build and push the docker images only on push to main by @severo in #717
- Update datasets to 2.9.0 by @albertvillanova in #715
- fix: 🐛 don't check if dataset is supported when we know it is by @severo in #720
- Trigger CI by PRs from forks by @albertvillanova in #713
- feat: 🎸 update docker images by @severo in #723
- fix: 🐛 add a missing default value for org name in admin/ by @severo in #722
- Refactoring for Private hub by @rtrompier in #719
- feat: publish helm chart on HF internal registry by @rtrompier in #729
- fix: 🐛 fix two labels by @severo in #730
- feat: 🎸 adapt number of replicas to flush the queues by @severo in #733
- feat: 🎸 add indexes, based on recommendations from mongo cloud by @severo in #728
- fix: remove mongo migration job execution on pre-install hook by @rtrompier in #738
- Add gradio admin interface by @lhoestq in #732
- fix: 🐛 disable the mongodbMigration job for now by @severo in #743
- fix admin ui requirements.txt by @lhoestq in #742
- fix: 🐛 fix the migration scripts to be able to run on new base by @severo in #747
- Add HF_TOKEN env var for admin ui by @lhoestq in #746
- feat: 🎸 update docker images by @severo in #748
- remove docker-images.yaml, and fix dev.yaml by @severo in #752
- refactor: 💡 remove dead code by @severo in #757
- test: 💍 ensure the database is ready in the tests by @severo in #759
- ci: 🎡 only run on PR and on main by @severo in #758
- update the logic to skip a job by @severo in #761
- Adding custom exception when cache insert fails because of too many columns by @AndreaFrancis in #749
- Add refresh dataset ui by @lhoestq in #760
- Create doc for every PR by @lhoestq in #768
- Locally use volumes for workers code by @lhoestq in #766
- refactor: 💡 hard-code the value of the fallback by @severo in #773
- Use hub-ci locally by @lhoestq in #774
- Fix CI mypy error: "WorkerFactory" has no attribute "app_config" by @albertvillanova in #778
- Pass processing step to worker by @severo in #779
- Make workers' errors derive from WorkerError by @albertvillanova in #772
- ci: 🎡 the e2e tests must now be run on any code change by @severo in #775
- Updating docker image hash by @AndreaFrancis in #783
- remove first rows fallback variable by @JatinKumar001 in #771
- ci: 🎡 run e2e tests only once for a push or pull-request by @severo in #786
- Fix dockerfiles by @severo in #787
- feat: 🎸 add logs when an unexpected error occurs by @severo in #789
- feat: remove job after 5 minutes by @rtrompier in #788
- Allow to use http instead of https by @rtrompier in #798
- Use shared action to publish helm chart by @rtrompier in #799
- Dataset info big content error by @AndreaFrancis in #780
- feat: 🎸 add concept of Resource by @severo in #784
- feat: 🎸 ensure immutability of the configs by @severo in #790
- use classmethod for factories instead of staticmethod by @severo in #791
- Upgrade dependencies, fix kenlm by @severo in #803
- Check dataset connection before migration job (and other apps) by @severo in #792
- Add admin ui url by @lhoestq in #801
- Move workers/datasets_based to services/worker by @severo in #800
- Rename obsolete mentions to datasets_based by @severo in #805
- Generic worker by @severo in #802
- chore: 🤖 add VERSION file by @severo in #807
- Update chart by @severo in #808
- fix: 🐛 ensure all the workers have the same access to the disk by @severo in #811
- fix: 🐛 add missing volumes by @severo in #812
- fix: 🐛 add missing config by @severo in #813
New Contributors
- @mariosasko made their first contribution in #410
- @julien-c made their first contribution in #433
- @lhoestq made their first contribution in #481
- @LysandreJik made their first contribution in #517
- @coyotte508 made their first contribution in #535
- @albertvillanova made their first contribution in #583
- @mishig25 made their first contribution in #632
- @XciD made their first contribution in #635
- @n1t0 made their first contribution in #649
- @keleffew made their first contribution in #693
- @rtrompier made their first contribution in #719
- @AndreaFrancis made their first contribution in #749
- @JatinKumar001 made their first contribution in #771
Full Changelog: 0.20.2...0.21.0