forked from JohnSnowLabs/spark-nlp
-
Notifications
You must be signed in to change notification settings - Fork 0
/
CHANGELOG
2625 lines (2287 loc) · 127 KB
/
CHANGELOG
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
========
3.3.2
========
----------------
New Features
----------------
* Comet.ml integration with Spark NLP
* Introducing BertForSequenceClassification annotator
----------------
Bug Fixes
----------------
* Fix EntityRulerApproach name from import
* Fix missing EntityRulerModel in ResourceDownloader
* Fix NerDLApproach logs format on Databricks
* Fix a missing batchSize param in NerDLModel that degraded GPU performance
========
3.3.1
========
----------------
New Features
----------------
* Introducing EntityRuler annotator to receive either a JSON or CSV ontology file that maps entities to patterns. You can implement a purely rule-based entity recognition system by using EntityRuler, it can be saved as a Model and reused in other pipelines to annotate your document against your knowledge base.
----------------
Bug Fixes
----------------
* Fix compatibility issue between NerOverwriter and AlbertForTokenClassification, BertForTokenClassification, DistilBertForTokenClassification, LongformerForTokenClassification, RoBertaForTokenClassification, XlmRoBertaForTokenClassification, XlnetForTokenClassification annotators
* Fix a bug in ContextSpellCheckerApproach annotator failing to find an appropriate TF graph
* Fix a bug in ContextSpellCheckerModel not being able to load a trained model
* Fix token alignment with token pieces in BertEmbeddings resulting in missing vectors with Unicode characters
* Add the missing pretrained NER models for the XlmRoBertaForTokenClassification annotator
* Add the missing pretrained NER models for the LongformerForTokenClassification annotator
----------------
Backward compatibility
----------------
* Renaming YakeModel to YakeKeywordExtraction to represent the actual purpose of this annotator more clearly.
========
3.3.0
========
----------------
Major features and improvements
----------------
* **NEW:** Beginning of Spark NLP 3.3.0 release there will be no limitation of size when you import TensorFlow models! You can now import TF Hub & HuggingFace models larger than 2G of size.
* **NEW:** Up to 50x faster when saving Spark NLP models and pipelines! 🚀 We have improved the way we package TensorFlow SavedModel while saving Spark NLP models & pipelines. For instace, it used to take up to 10 minutes to save `xlm_roberta_base` model prior to Spark NLP 3.3.0, and now it only takes up to 15 seconds!
* **NEW:** Introducing **AlbertForTokenClassification** annotator in Spark NLP 🚀. `AlbertForTokenClassification` can load ALBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using `AlbertForTokenClassification` or `TFAlbertForTokenClassification` in HuggingFace 🤗
* **NEW:** Introducing **XlnetForTokenClassification** annotator in Spark NLP 🚀. `XlnetForTokenClassification` can load XLNet Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using `XLNetForTokenClassificationet` or `TFXLNetForTokenClassificationet` in HuggingFace 🤗
* **NEW:** Introducing **RoBertaForTokenClassification** annotator in Spark NLP 🚀. `RoBertaForTokenClassification` can load RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using `RobertaForTokenClassification` or `TFRobertaForTokenClassification` in HuggingFace 🤗
* **NEW:** Introducing **XlmRoBertaForTokenClassification** annotator in Spark NLP 🚀. `XlmRoBertaForTokenClassification` can load XLM-RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using `XLMRobertaForTokenClassification` or `TFXLMRobertaForTokenClassification` in HuggingFace 🤗
* **NEW:** Introducing **LongformerForTokenClassification** annotator in Spark NLP 🚀. `LongformerForTokenClassification` can load Longformer Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using `LongformerForTokenClassification` or `TFLongformerForTokenClassification` in HuggingFace 🤗
* **NEW:** Introducing new ResourceDownloader functions to easily look for pretrained models & pipelines inside Spark NLP (Python and Scala). You can filter models or pipelines via `language`, `version`, or the name of the `annotator`
* Welcoming [Databricks Runtime 9.1 LTS](https://docs.databricks.com/release-notes/runtime/9.1.html), 9.1 ML, and 9.1 ML with GPU
* Fix printing a wrong version return in sparknlp.version()
----------------
Bug Fixes
----------------
* Fix a bug in RoBertaEmbeddings when all special tokens were identical
* Fix a bug in RoBertaEmbeddings when special token contained valid regex
* Fix a bug lead to memory leak inside NorvigSweeting spell checker. This issue caused issues with pretrained pipelines such as `explain_document_ml` and `explain_document_dl` when some inputs
* Fix the wrong types being assigned to `minCount` and `classCount` in Python for `ContextSpellCheckerApproach` annotator
* Fix `explain_document_ml` pretrained pipeline for Spark NLP 3.x on Apache Spark 2.x
========
3.2.3
========
----------------
Bug Fixes & Enhancements
----------------
* Add delimiter feature to CoNLL() class to support other delimiters in CoNLL files https://github.com/JohnSnowLabs/spark-nlp/pull/5934
* Add support for IOB in addition to IOB2 format in GraphExctraction https://github.com/JohnSnowLabs/spark-nlp/pull/6101
* Change YakeModel output type from KEYWORD to CHUNK to have more available features after the YakeModel annotator such as Chunk2Doc or ChunkEmbeddings https://github.com/JohnSnowLabs/spark-nlp/pull/6065
* Fix the default language for XlmRoBertaSentenceEmbeddings pretrained model in Python https://github.com/JohnSnowLabs/spark-nlp/pull/6057
* Fix SentenceEmbeddings issue concatenating sentences instead of each correspondent sentence https://github.com/JohnSnowLabs/spark-nlp/pull/6060
* Fix GraphExctraction usage in LightPipeline https://github.com/JohnSnowLabs/spark-nlp/pull/6101
* Fix compatibility issue in `explain_document_ml` pipeline
* Better import process for corrupted merges file in Longformer tokenizer https://github.com/JohnSnowLabs/spark-nlp/pull/6083
========
3.2.2
========
----------------
New Features
----------------
* A new RoBertaSentenceEmbeddings annotator for sentence embeddings used in SentimentDL, ClassifierDL, and MultiClassifierDL annotators
* A new XlmRoBertaSentenceEmbeddings annotator for sentence embeddings used in SentimentDL, ClassifierDL, and MultiClassifierDL annotators
* Add support for AWS MFA via Spark NLP configuration
* Add new AWS configs to Spark NLP configuration when using a private S3 bucket to store logs for training models or access TF graphs needed in NerDLApproach
* spark.jsl.settings.aws.credentials.access_key_id
* spark.jsl.settings.aws.credentials.secret_access_key
* spark.jsl.settings.aws.credentials.session_token
* spark.jsl.settings.aws.s3_bucket
* spark.jsl.settings.aws.region
----------------
Bug Fixes & Enhancements
----------------
* Improve loading merges file for RoBERTa tokenizer
* Remove batchSize param from broadcast in XlmRoBertaEmbeddings to be set after it is created
* Preserve previsouly generated metadata in BertSentenceEmbeddings annotator
* Set `elmo` as a default poolingLayer in ElmoEmbeddings
* Fix special tokens ids in XlmRoBertaEmbeddings annotator
* Fix distilbert_base_token_classifier_ontonotes model
* Fix distilbert_base_token_classifier_conll03 model
* Fix distilbert_base_token_classifier_few_nerd model
* Fix distilbert_token_classifier_persian_ner model
* Fix ner_conll_longformer_base_4096 model
========
3.2.1
========
----------------
Patch release
----------------
* Fix "unsupported model" error in pretrained function for LongformerEmbeddings, BertForTokenClassification, and DistilBertForTokenClassification
========
3.2.0
========
----------------
Major features and improvements
----------------
* **NEW:** Introducing **LongformerEmbeddings** annotator
* **NEW:** Introducing **BertForTokenClassification** annotator
* **NEW:** Introducing **DistilBertForTokenClassification** annotator
* **NEW:** Introducing **GraphExctraction** and **GraphFinisher** annotators.
* **NEW:** Introducing support for multilingual **DateMatcher** and **MultiDateMatcher** annotators. These two annotators will support **English**, **French**, **Italian**, **Spanish**, **German**, and **Portuguese** languages
* **NEW:** Introducing new **Python APIs** and fully documented **Pydoc**
* **NEW:** Introducing new **Spark NLP configurations** via spark.conf() by deprecating `application.conf` usage
* Add support for S3 to `log_folder` Spark NLP config and `outputLogsPath` param in `NerDLApproach`, `ClassifierDlApproach`, `MultiClassifierDlApproach`, and `SentimentDlApproach` annotators
* Added examples to all Spark NLP Scaladoc
* Added examples to all Spark NLP Pydoc
* Welcoming new Databricks runtimes to our Spark NLP family:
* Databricks 8.4 ML & GPU
* Fix printing a wrong version return in sparknlp.version()
========
3.1.3
========
----------------
Bug Fixes & Enhancements
----------------
* Fix serialization issue in NorvigSweetingModel
* Fix the issue with BertSentenceEmbeddings model in TF v2
* Update ArrayType structure to fix Finisher failing to clean up some annotators
========
3.1.2
========
----------------
New Features
----------------
* Migrate XlnetEmbeddings to TensorFlow v2. This allows the importing of HuggingFace XLNet models to Spark NLP
* Migrate XlnetEmbeddings to BatchAnnotate to allow better performance on accelerated hardware such as GPU
* Dynamically extract special tokens from SentencePiece model in XlmRoBertaEmbeddings
* Add setIncludeAllConfidenceScores param in NerDLModel to merge confidence scores per label to only predicted label
* Sync Python params with Scala params in ContextSpellCheckerApproach, WordSegmenterApproach, RegexMatcher, and ViveknSentimentApproach,
----------------
Bug Fixes & Enhancements
----------------
* Fix issue with SymmetricDeleteModel
* Fix issue with encoding unknown bytes in RoBertaEmbeddings
* Fix issue with multi-lingual UniversalSentenceEncoder models
----------------
Backward compatibility
----------------
We have migrated XlnetEmbeddings to TensorFlow v2, the earlier models prior to 3.1.2 won't work after this release.
We have already updated the models and uploaded them on Models Hub. You can use `pretrained()` that takes care of it automatically or please make sure you download the new models manually.
========
3.1.1
========
----------------
New Features
----------------
* Migrate AlbertEmbeddings to TensorFlow v2. This allows the importing of HuggingFace ALBERT models to Spark NLP
* Migrate AlbertEmbeddings to BatchAnnotate to allow better performance on accelerated hardware such as GPU
* Enable stdout/stderr in real-time for child processes `sparknlp.start()`. Thanks to PySpark 3.x, this is now possible with `sparknlp.start(real_time_output=True)` to have the outputs of Spark NLP (such as metrics during training) right in your Jupyter, Colab, and Kaggle notebooks.
* Complete examples for all annotators in Scaladoc APIs https://github.com/JohnSnowLabs/spark-nlp/pull/5668
----------------
Bug Fixes & Enhancements
----------------
* Fix YakeModel issue with empty token https://github.com/JohnSnowLabs/spark-nlp/pull/5683 thanks to @shaddoxac
* Fix getAnchorDateMonth method in DateMatcher and MultiDateMatcher https://github.com/JohnSnowLabs/spark-nlp/pull/5693
* Fix the broken PubTutor class in Python https://github.com/JohnSnowLabs/spark-nlp/pull/5702
* Fix relative dates in DateMatcher and MultiDateMatcher such as `day after tomorrow` or `day before yesterday` https://github.com/JohnSnowLabs/spark-nlp/pull/5706
* Add isPaddedToken param to PubTutor https://github.com/JohnSnowLabs/spark-nlp/pull/5702
* Fix issue with `logger` inside session on some setup https://github.com/JohnSnowLabs/spark-nlp/pull/5715
* Add signatures to TF session to handle inputs/outputs more dynamically in BertEmbeddings, DistilBertEmbeddings, RoBertaEmbeddings, and XlmRoBertaEmbeddings https://github.com/JohnSnowLabs/spark-nlp/pull/5715
* Fix XlmRoBertaEmbeddings issue with `init_all_tables` https://github.com/JohnSnowLabs/spark-nlp/pull/5715
* Add missing random seed param to ClassifierDLApproach, MultiClassifierDLApproach, and SentimentDLApproach https://github.com/JohnSnowLabs/spark-nlp/pull/5697
* Make the Java Exceptions appear before Py4J exceptions for ease of debugging in Python https://github.com/JohnSnowLabs/spark-nlp/pull/5709
* Make sure batchSize set in NerDLModel is the same internally to feed TensorFlow https://github.com/JohnSnowLabs/spark-nlp/pull/5716
----------------
Backward compatibility
----------------
We have migrated AlbertEmbeddings to TensorFlow v2, the earlier models prior to 3.1.1 won't work after this release.
We have already updated the models and uploaded them on Models Hub. You can use `pretrained()` that takes care of it automatically or please make sure you download the new models manually.
========
3.1.0
========
----------------
New Features
----------------
* **NEW:** Introducing DistiBertEmbeddings annotator. DistilBERT is a small, fast, cheap, and light Transformer model trained by distilling BERT base. It has 40% fewer parameters than `bert-base-uncased`, runs 60% faster while preserving over 95% of BERT’s performances
* **NEW:** Introducing RoBERTaEmbeddings annotator. RoBERTa (Robustly Optimized BERT-Pretraining Approach) models deliver state-of-the-art performance on NLP/NLU tasks and a sizable performance improvement on the GLUE benchmark. With a score of 88.5, RoBERTa reached the top position on the GLUE leaderboard
* **NEW:** Introducing XlmRoBERTaEmbeddings annotator. XLM-RoBERTa (Unsupervised Cross-lingual Representation Learning at Scale) is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data with 100 different languages. It also outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy on XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model
* **NEW:** Introducing support for HuggingFace exported models in equivalent Spark NLP annotators. Starting this release, you can easily use the `saved_model` feature in HuggingFace within a few lines of codes and import any BERT, DistilBERT, RoBERTa, and XLM-RoBERTa models to Spark NLP. We will work on the remaining annotators and extend this support to the rest with each release - For more information please visit [this discussion](https://github.com/JohnSnowLabs/spark-nlp/discussions/5669)
* **NEW:** Migrate MarianTransformer to BatchAnnotate to control the throughput when you are on accelerated hardware such as GPU to fully utilize it
* Upgrade to TensorFlow v2.4.1 with native support for Java to take advantage of many optimizations for CPU/GPU and new features/models introduced in TF v2.x
* Update to CUDA11 and cuDNN 8.0.2 for GPU support
* Implement ModelSignatureManager to automatically detect inputs, outputs, save and restore tensors from SavedModel in TF v2. This allows Spark NLP 3.1.x to extend support for external Encoders such as HuggingFace and TF Hub (coming soon!)
* Implement a new BPE tokenizer for RoBERTa and XLM models. This tokenizer will use the custom tokens from `Tokenizer` or `RegexTokenizer` and generates token pieces, encodes, and decodes the results
* Welcoming new Databricks runtimes to our Spark NLP family:
* Databricks 8.1 ML & GPU
* Databricks 8.2 ML & GPU
* Databricks 8.3 ML & GPU
* Welcoming a new EMR 6.x series to our Spark NLP family:
* EMR 6.3.0 (Apache Spark 3.1.1 / Hadoop 3.2.1)
----------------
Backward compatibility
----------------
* We have updated our MarianTransformer annotator to be compatible with TF v2 models. This change is not compatible with previous models/pipelines. However, we have updated and uploaded all the models and pipelines for `3.1.x` release. You can either use `MarianTransformer.pretrained(MODEL_NAME)` and it will automatically download the compatible model or you can visit [Models Hub](https://nlp.johnsnowlabs.com/models) to download the compatible models for offline use via `MarianTransformer.load(PATH)`
========
3.0.3
========
----------------
New Features
----------------
* Add new functionalities for text generation in T5Transformer
----------------
Bug Fixes
----------------
* Fix ChunkEmbeddings Array out of bounds exception
* Fix pretrained tfhub_use_multi and tfhub_use_multi_lg models in UniversalSentenceEncoder
* Fix anchorDateMonth in Python and case sensitivity in relative dates
========
3.0.2
========
----------------
New Features and Enhancements
----------------
* Experimental support for community models and pipelines https://github.com/JohnSnowLabs/spark-nlp/pull/2743
* Add proper conversions for Scala 2.11/2.12 in ContextSpellChecker to use models from Spark 2.x in Spark 3.x https://github.com/JohnSnowLabs/spark-nlp/pull/2758
* Provide confidence scores for all available tags in NerDLModel and NerCrfModel https://github.com/JohnSnowLabs/spark-nlp/pull/2760
```
# Previously in NerDLModel and NerCrfModel
[[named_entity, 0, 4, B-LOC, [word -> Japan, confidence -> 0.9998], []]
```
```
# In Spark NLP 3.0.2
[[named_entity, 0, 4, B-LOC, [B-LOC -> 0.9998, I-ORG -> 0.0, I-MISC -> 0.0, I-LOC -> 0.0, I-PER -> 0.0, B-MISC -> 0.0, B-ORG -> 1.0E-4, word -> Japan, O -> 0.0, B-PER -> 0.0], []]
```
* Add confidence score to NerConverter metadata https://github.com/JohnSnowLabs/spark-nlp/pull/2784
```
[chunk, 30, 37, john, [entity -> PERSON, sentence -> 0, chunk -> 0, confidence -> 0.44035]
```
* Refactoring SentencePiece encoding in AlbertEmbeddings and XlnetEmbeddings https://github.com/JohnSnowLabs/spark-nlp/pull/2777
----------------
Bug Fixes
----------------
* Fix an exception in NerConverter when the documents/sentences don't carry the used tokens in NerDLModel https://github.com/JohnSnowLabs/spark-nlp/pull/2784
* Fix an exception in AlbertEmbeddings when the original tokens are longer than the piece tokens https://github.com/JohnSnowLabs/spark-nlp/pull/2777
========
3.0.1
========
----------------
New Features
----------------
* Add minLength and maxLength parameters to Normalizer annotator https://github.com/JohnSnowLabs/spark-nlp/pull/2614
* 1 line to setup [Google Colab](https://github.com/JohnSnowLabs/spark-nlp#google-colab-notebook)
* 1 line to setup [Kaggle Kernel](https://github.com/JohnSnowLabs/spark-nlp#kaggle-kernel)
----------------
Enhancements
----------------
* Adjust shading rule for amazon AWS to support sub-projects from Spark NLP Fat JAR https://github.com/JohnSnowLabs/spark-nlp/pull/2613
* Fix the missing variables in BertSentenceEmbeddings https://github.com/JohnSnowLabs/spark-nlp/pull/2615
* Restrict loading Sentencepiece ops only to supported models https://github.com/JohnSnowLabs/spark-nlp/pull/2623
* improve dependency management and resolvers https://github.com/JohnSnowLabs/spark-nlp/pull/2479
========
3.0.0
========
----------------
New Features
----------------
* Support for Apache Spark and PySpark 3.0.x on Scala 2.12
* Support for Apache Spark and PySpark 3.1.x on Scala 2.12
* Migrate to TensorFlow v2.3.1 with native support for Java to take advantage of many optimizations for CPU/GPU and new features/models introduced in TF v2.x
* Welcoming 9x new Databricks runtimes to our Spark NLP family:
* Databricks 7.3
* Databricks 7.3 ML GPU
* Databricks 7.4
* Databricks 7.4 ML GPU
* Databricks 7.5
* Databricks 7.5 ML GPU
* Databricks 7.6
* Databricks 7.6 ML GPU
* Databricks 8.0
* Databricks 8.0 ML (there is no GPU in 8.0)
* Databricks 8.1 Beta
* Welcoming 2x new EMR 6.x series to our Spark NLP family:
* EMR 6.1.0 (Apache Spark 3.0.0 / Hadoop 3.2.1)
* EMR 6.2.0 (Apache Spark 3.0.1 / Hadoop 3.2.1)
* Starting Spark NLP 3.0.0 the default packages for CPU and GPU will be based on Apache Spark 3.x and Scala 2.12 (`spark-nlp` and `spark-nlp-gpu` will be compatible only with Apache Spark 3.x and Scala 2.12)
* Starting Spark NLP 3.0.0 we have two new packages to support Apache Spark 2.4.x and Scala 2.11 (`spark-nlp-spark24` and `spark-nlp-gpu-spark24`)
* Spark NLP 3.0.0 still is and will be compatible with Apache Spark 2.3.x and Scala 2.11 (`spark-nlp-spark23` and `spark-nlp-gpu-spark23`)
* Adding a new param to sparknlp.start() function in Python for Apache Spark 2.4.x (`spark24=True`)
* Adding a new param to adjust Driver memory in sparknlp.start() function (`memory="16G"`)
----------------
Performance Improvements
----------------
Introducing a new batch annotation technique implemented in Spark NLP 3.0.0 for NerDLModel, BertEmbeddings, and BertSentenceEmbeddings annotators to radically improve prediction/inferencing performance.
From now on the `batchSize` for these annotators means the number of rows that can be fed into the models for prediction instead of sentences per row.
You can control the throughput when you are on accelerated hardware such as GPU to fully utilize it.
----------------
Breaking changes
----------------
There are only 5 annotators that are not compatible with both Scala 2.11 (Apache Spark 2.3 and Apache Spark 2.4) and Scala 2.12 (Apache Spark 3.x).
You can either train and use them on Apache Spark 2.3.x/2.4.x or train and use them on Apache Spark 3.x. The rest of our models/pipelines can be used on all Apache Spark and Scala major versions.
- TokenizerModel
- PerceptronApproach (POS Tagger)
- WordSegmenter
- DependencyParser
- TypedDependencyParser
========
2.7.5
========
----------------
Bugfixes
----------------
* Fix BigDecimal error in NerDL when includeConfidence is true
----------------
Enhancements
----------------
* Shade Hadoop AWS and AWS Java SDK dependencies
========
2.7.4
========
----------------
Bugfixes
----------------
* Fix Tensors with a 0 dimension issue in ClassifierDL and SentimentDL
* Fix index error in TokenAssembler
* Fix MatchError in DateMatcher and MultiDateMatcher annotators
* Fix setOutputAsArray and its default value for valueSplitSymbol in Finisher annotator
----------------
Enhancements
----------------
* Implement missing frequencyThreshold and ambiguityThreshold params in WordSegmenterApproach annotator
* Downgrade Hadoop from 3.2 to 2.7 which caused an issue with S3
* Update Apache HTTP Client
========
2.7.3
========
---------------
New Features
---------------
* Add anchorDateYear, anchorDateMonth, and anchorDateDay to DateMatcher and MultiDateMatcher to be used for relative dates extraction
----------------
Bugfixes
----------------
* Fix the default value for action parameter in Python wrapper for DocumentNormalizer annotator
* Fix Lemmatizer pretrained models published in 2021
----------------
Enhancements
----------------
* Improve T5Transformer performance on documents with many sentences
========
2.7.2
========
----------------
Bugfixes
----------------
* Fix casual mask calculations resulting in bad translation in MarianTransformer
* Fix Serialization issue in the cluster while training ContextSpellChecker
* Fix calculating CHUNK spans based on the sentences' boundaries in RegexMatcher
----------------
Enhancements
----------------
* Add GPU support for training ContextSpellChecker
* Adding Scalatest ability to control tests by tags
========
2.7.1
========
----------------
Bugfixes
----------------
* Fix default pretrained model T5Transformer
* Fix default pretrained model WordSegmenter
* Fix missing reference to WordSegmenter in ResourceDwonloader
* Fix T5Transformer models crashing due to unknown task
* Fix the issue of saving and reading ClassifierDL, SentimentDL, and MultiClassifierDL models introduced in the 2.7.0 release
----------------
Enhancements
----------------
* Export new T5 models with optimized Encoder/Decoder
* Add support for alternative tagging with the positional parser in RegexTokenizer
* Refactor AssertAnnotations
----------------
Backward compatibility
----------------
* In order to fix the issue of Classifiers in the clusters, we had to export new TF models and change the read/write functions of these annotators. This caused any model trained prior to the 2.7.0 release not to be compatible with 2.7.1 and require retraining including pre-trained models. (we are re-training all the existing text classification models with 2.7.1)
========
2.7.0
========
------------------------------
Major features and improvements
------------------------------
* Introducing MarianTransformer annotator for machine translation based on MarianNMT models. Marian is an efficient, free Neural Machine Translation framework mainly being developed by the Microsoft Translator team (646+ pretrained models & pipelines in 192+ languages)
* Introducing T5Transformer annotator for Text-To-Text Transfer Transformer (Google T5) models to achieve state-of-the-art results on multiple NLP tasks such as Translation, Summarization, Question Answering, Sentence Similarity, and so on
* Introducing brand new and refactored language detection and identification models. The new LanguageDetectorDL is faster, more accurate, and supports up to 375 languages
* Introducing WordSegmenter annotator, a trainable annotator for word segmentation of languages without any rule-based tokenization such as Chinese, Japanese, or Korean
* Introducing DocumentNormalizer annotator cleaning content from HTML or XML documents, applying either data cleansing using an arbitrary number of custom regular expressions either data extraction following the different parameters
* [Spark NLP Display](https://github.com/JohnSnowLabs/spark-nlp-display) for visualization of different types of annotations
* Add support for new multi-lingual models in UniversalSentenceEncoder annotator
* Add support to Lemmatizer to be trained directly from a DataFrame instead of a text file
* Add training helper to transform CoNLL-U into Spark NLP annotator type columns
----------------
Bugfixes and Enhancements
----------------
* Fix all the known issues in ClassifierDL, SentimentDL, and MultiClassifierDL annotators in a Cluster
* NerDL enhancements for memory optimization and logging during the training with the test dataset
* SentenceEmbeddings annotator now reuses the storageRef of any embeddings used in prior
* Fix dropout in SentenceDetectorDL models for more deterministic results. Both English and Multi-lingual models are retrained for the 2.7.0 release
* Fix Python dataType Annotation
* Upgrade to Apache Spark 2.4.7
========
2.6.5
========
----------------
Bugfixes
----------------
* Fix a bug in batching sentences in BertSentenceEmbeddings
* Fix AttributeError when trying to load a saved EmbeddingsFinisher in Python
----------------
Enhancements
----------------
* Improve handeling exceptions in DocumentAssmbler when user uses a corrupted DataFrame
========
2.6.4
========
----------------
Bugfixes
----------------
* Fix loading from a local folder with no access to the cache folder
* Fix NullPointerException in DocumentAssembler when there are null in the rows
* Fix dynamic padding in BertSentenceEmbeddings
========
2.6.3
========
---------------
New Features
---------------
* Add enableMemoryOptimizer to allow training NerDLApproach on a dataset larger than the memory
* Add option to explode sentences in SentenceDetectorDL
----------------
Enhancements
----------------
* Improve POS (AveragedPerceptron) performance
* Improve Norvig Spell Checker performance
----------------
Bugfixes
----------------
* Fix SentenceDetectorDL unsupported model error in pretrained function
* Fix a race condition in Lru that can cause NullPointerException during a LightPipeline operations with embeddings
* Fix max sequence length calculation in BertEmbeddings and BertSentenceEmbeddings
* Fix threshold in YakeModel on Python side
========
2.6.2
========
---------------
New Features
---------------
* Introducing a new SentenceDetectorDL
----------------
Enhancements
----------------
* Improved BioBERT models quality for BertEmbeddings (it achieves higher accuracy in sequence classification)
* Improved Sentence BioBERT models quality for BertSentenceEmbeddings (it achieves higher accuracy in text classification)
* Add unit test to MultiClassifierDL annotator
* Better error handling in SentimentDLApproach
* Improve loadSavedModel in BertEmbeddings and BertSentenceEmbeddings
----------------
Bugfixes
----------------
* Fix BERT LaBSE model for BertSentenceEmbeddings
* Fix loadSavedModel for BertSentenceEmbeddings in Python
---------------
Deprecations
---------------
* DeepSentenceDetector is deprecated in favor of SentenceDetectorDL
========
2.6.1
========
----------------
Bugfixes
----------------
* Fix a bug in ClassifierDL that resulted in low accuracy during the training
========
2.6.0
========
------------------------------
Major features and improvements
------------------------------
* **NEW:** A new MultiClassifierDL annotator for multi-label text classification
* **NEW:** A new BertSentenceEmbeddings annotator with 41 available pre-trained models for sentence embeddings used in SentimentDL, ClassifierDL, and MultiClassifierDL annotators
* **NEW:** A new YakeModel annotator for an unsupervised, corpus-independent, domain, and language-independent and single-document keyword extraction algorithm
* Integrate 24 new Small BERT models where the smallest model is 24x times smaller and 28x times faster compare to BERT base models
* Add 3 new ELECTRA small, base, and large models
* Add 4 new Finnish BERT models for BertEmbeddings and BertSentenceEmbeddings
* Improve BertEmbeddings memory consumption by 30%
* Improve BertEmbeddings performance by more than 70% with a new built-in dynamic shape inputs
* Remove the poolingLayer parameter in BertEmbeddings in favor of sequence_output that is provided by TF Hub models for new BERT models
* Add validation loss, validation accuracy, validation F1, and validation True Positive Rate during the training in MultiClassifierDL
* Add parameter to enable/disable list detection in SentenceDetector
* Unify the loggings in ClassifierDL and SentimentDL during training
----------------
Bugfixes
----------------
* Fix Tokenization bug with Bigrams in the exception list
* Fix the versioning error in second SBT projects causing models not being found via pretrained function
* Fix logging to file in NerDLApproach, ClassifierDL, SentimentDL, and MultiClassifierDL on HDFS
* Fix ignored modified tokens in BertEmbeddings, now it will consider modified tokens instead of originals
========
2.5.5
========
---------------
New Features
---------------
- Add getClasses() function to NerDLModel
- Add getClasses() function to ClassifierDLModel
- Add getClasses() function to SentimentDLModel
---------------------
Enhancements
---------------------
- Improve max sequence length calculation in BertEmbeddings and XlnetEmbeddings
----------------
Bugfixes
----------------
- Fix a bug in RegexTokenizer in Python
- Fix StopWordsCleaner exception in Python when pretrained() is used
- Fix max sequence length issue in AlbertEmbeddings and SentencePiece generation
- Fix HDFS support for setGaphFolder param in NerDLApproach
========
2.5.4
========
---------------
New Features
---------------
* Add support for Apache Spark 2.3.x including new Maven artifacts and full support of all pre-trained models/pipelines
* Add 43 new pre-trained models in 43 languages to StopWordsCleaner annotator
* Introduce a new RegexTokenizer to split text by regex pattern
---------------------
Enhancements
---------------------
* Retrained 6 new BioBERT and ClinicalBERT models
* Add a new param to `start()` function to start the session for Apache Spark 2.3.x
----------------
Bugfixes
----------------
* Add missing library for SentencePiece used by AlbertEmbeddings and XlnetEmbeddings on Windows
* Fix ModuleNotFoundError in LanguageDetectorDL pipelines in Python
========
2.5.3
========
---------------
New Features
---------------
* TextMatcher now can construct the chunks from tokens instead of the original documents via buildFromTokens param
* CoNLLGenerator now is accessible in Python
----------------
Bugfixes
----------------
* Fix a bug in ContextSpellChecker resulting in IllegalArgumentException
---------------------
Enhancements
---------------------
* Improve RocksDB connection to support different storage capabilities
* Improve parameters naming convention in ContextSpellChecker
---------------------
Enhancements
---------------------
* Add NerConverter to documentation
* Fix multi-language tabs in documentation
========
2.5.2
========
---------------
New Features
---------------
* Introducing a new LanguageDetectorDL state-of-the-art annotator to detect and identify languages in documents and sentences
* Add a new param entityValue to TextMatcher to add custom value inside metadata. Useful in post-processing when there are multiple TextMatcher annotators with multiple dictionaries https://github.com/JohnSnowLabs/spark-nlp/issues/920
----------------
Bugfixes
----------------
* Add missing TensorFlow graphs to train ContextSpellChecker annotator https://github.com/JohnSnowLabs/spark-nlp/issues/912
* Fix misspelled param in classThreshold param in ContextSpellChecker annotator https://github.com/JohnSnowLabs/spark-nlp/issues/911
* Fix a bug where setGraphFolder in NerDLApproach annotator couldn't find a graph on Databricks (DBFS) https://github.com/JohnSnowLabs/spark-nlp/issues/739
* Fix a bug in NerDLApproach when includeConfidence was set to true https://github.com/JohnSnowLabs/spark-nlp/issues/917
* Fix a bug in BertEmbeddings https://github.com/JohnSnowLabs/spark-nlp/issues/906 https://github.com/JohnSnowLabs/spark-nlp/issues/918
---------------------
Enhancements
---------------------
* Improve TF backend in ContextSpellChecker annotator
========
2.5.1
========
---------------
New Features
---------------
* Add Python support for PubTator reader to convert automatic annotations of the biomedical datasets into DataFrame
* Add 6 new pre-trained BERT models from BioBERT and ClinicalBERT
---------------------
Enhancements
---------------------
* Add unit tests for XlnetEmbeddings
* Add unit tests for AlbertEmbeddings
* Add unit tests for ContextSpellChecker
========
2.5.0
========
---------------
New Features
---------------
* A new AlbertEmbeddings annotator with 4 available pre-trained models
* A new XlnetEmbeddings annotator with 2 available pre-trained models
* A new ContextSpellChecker annotator, the state-of-the-art annotator for spell checking
* A new SentimentDL annotator for multi-class sentiment analysis. This annotator comes with 2 available pre-trained models trained on IMDB and Twitter datasets
* Add new PubTator reader to convert automatic annotations of the biomedical datasets into DataFrame
* Introducing a new outputLogsPath param for NerDLApproach, ClassifierDLApproach and SentimentDLApproach annotators
* Refactored CoNLLGenerator to actually use NER labels from the DataFrame
* Unified params in NerDLModel in both Scala and Python
* Extend and complete Scaladoc APIs for all the annotators
----------------
Bugfixes
----------------
* Fix position of tokens in Normalizer
* Fix Lemmatizer exception on a bad input
* Fix annotator logs failing on object storage file systems like DBFS
----------------
Documentation
----------------
* Update documentation for release of Spark NLP 2.5.x
* Update the entire [spark-nlp-workshop](https://github.com/JohnSnowLabs/spark-nlp-models) notebooks for Spark NLP 2.5.x
* Update the entire [spark-nlp-models](https://github.com/JohnSnowLabs/spark-nlp-workshop) repository with new pre-trained models and pipelines
========
2.4.5
========
---------------
Overview
---------------
We are very excited to extend Spark NLP support to 6 new Databricks runtimes and add support to Cloudera and EMR YARN cluster-mode.
As always, we thank our community for their feedback and questions in our Slack channel.
---------------
New Features
---------------
* Extend Spark NLP support for Databricks runtimes:
* 6.2
* 6.2 ML
* 6.3
* 6.3 ML
* 6.4
* 6.4 ML
* 6.5
* 6.5 ML
* Add support for cluster-mode in Cloudera and EMR YARN clusters
* New splitPattern param in Tokenizer to split tokens by regex rules
----------------
Bugfixes
----------------
* Fix ClassifierDLModel save and load in Python
* Fix ClassifierDL TensorFlow session reuse
* Fix Normalizer positions of new tokens
----------------
Documentation
----------------
* Update documentation for release of Spark NLP 2.4.x
* Update the entire [spark-nlp-workshop](https://github.com/JohnSnowLabs/spark-nlp-models) notebooks for Spark NLP 2.4.x
* Update the entire [spark-nlp-models](https://github.com/JohnSnowLabs/spark-nlp-workshop) repository with new pre-trained models and pipelines
========
2.4.4
========
---------------
Overview
---------------
* We are very excited to release the very first multi-class text classifier in Spark NLP v2.4.4! We have built a generic ClassifierDL annotator that uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. The ClassifierDL annotator uses a deep learning model (DNNs) we have built inside TensorFlow and supports up to 50 classes.
* We are also happy to announce the support of yet another language: Russian! We have trained and prepared 5 pre-trained models and 6 pre-trained pipelines in Russian.
**NOTE**: ClassifierDL is an experimental feature in 2.4.4 release. We have worked hard to aim for simplicity and we are looking forward to your feedback as always.
---------------
New Features
---------------
* Introducing an experimental multi-class text classification by using the DNNs model in TensorFlow called `ClassifierDL`. This annotator can train any dataset from 2 up to 50 classes.
* 5 new pretrained Russian models (Lemma, POS, 3x NER)
* 6 new pretrained Russian pipelines
---------------
Enhancements
---------------
* Add param to NerConverter to override modified tokens instead of original tokens
----------------
Bugfixes
----------------
* Fix TokenAssembler
* Fix NerConverter exception when NerDL is trained with different tagging style than IOB/IOB2
========
2.4.3
========
---------------
Overview
---------------
This minor release fixes a bug on our Python side that was introduced in 2.4.2 release.
As always, we thank our community for their feedback and questions in our Slack channel.
----------------
Bugfixes
----------------
* Fix Python imports which resulted in AttributeError: module 'sparknlp' has no attribute
========
2.4.2
========
---------------
Overview
---------------
This minor release fixes a few bugs in some of our annotators reported by our community.
As always, we thank our community for their feedback and questions in our Slack channel.
----------------
Bugfixes
----------------
* Fix UniversalSentenceEncoder.pretrained() that failed in Python
* Fix ElmoEmbeddings.pretrained() that failed in Python
* Fix ElmoEmbeddings poolingLayer param to be a string as expected
* Fix ChunkEmbeddings to preserve chunk's index
* Fix NGramGenerator and missing chunk metadata
---------------
New Features
---------------
* Add GPU support param in Spark NLP start function: sparknlp.start(gpu=true)
* Improve create_model.py to create custom TF graph for NerDLApproach
----------------
Documentation
----------------
* Update documentation for release of Spark NLP 2.4.x
* Update the entire [spark-nlp-workshop](https://github.com/JohnSnowLabs/spark-nlp-models) notebooks for Spark NLP 2.4.x
* Update the entire [spark-nlp-models](https://github.com/JohnSnowLabs/spark-nlp-workshop) repository with new pre-trained models and pipelines
========
2.4.1
========
---------------
Overview
---------------
This minor release fixes a few bugs in some of our annotators reported by our community.
As always, we thank our community for their feedback and questions in our Slack channel.
----------------
Bugfixes
----------------
* Improve ChunkEmbeddings annotator and fix the empty chunk result
* Fix UniversalSentenceEncoder crashing on empty Tensor
* Fix NorvigSweetingModel missing sentenceId that results in NGramsGenerator crashing
* Fix missing storageRef in embeddings' column for ElmoEmbeddings annotator
----------------
Documentation
----------------
* Update documentation for release of Spark NLP 2.4.x
* Add new features such as ElmoEmbeddings and UniversalSentenceEncoder
* Add multiple programming languages for demos and examples
* Update the entire [spark-nlp-models](https://github.com/JohnSnowLabs/spark-nlp-models) repository with new pre-trained models and pipelines
========
2.4.0
========
---------------
Overview
---------------
We are very excited to finally release Spark NLP v2.4.0! This has been one of the largest releases we have ever made since the inception of the library!
The new release of Spark NLP `2.4.0` has been migrated to TensorFlow `1.15.0` which takes advantage of the latest deep learning technologies and pre-trained models.
As always, thanks to the community for the feedback and questions in our Slack channel.
Please beware as this release breaks backwards compatibility with previously saved models, particularly on Tensorflow and Embeddings, aside from code-breaking changes in the API.
We will be working in our documentation to enhance the learning curve.
---------------
New Features
---------------
* TensorFlow 1.15.0 now works behind Spark NLP. This brings implicit improvements in performance, accuracy and functionalities
* New Annotator UniversalSentenceEncoder with 2 pre-trained models from TF Hub. Check our spark-nlp-models repo for updates
* New Annotator MultiDateMatcher capable of matching more than one date per sentence (Extends DateMatcher algorithm)
* New Annotator NGramGenerator with Param tweaks for customization
* New Annotator BigTextMatcher works best with large amounts of input data
* New Annotator ElmoEmbeddings with a pre-trained model from TF Hub. Check our spark-nlp-models repo for updates
* BertEmbeddings improvements with 5 new models from TF Hub
* RecursivePipelineModel as an enhanced PipelineModel allows Annotators access previous annotators in the pipeline for more ML strategies
* LazyAnnotators: A new Param in Annotators allow them to stand idle in the Pipeline and do nothing. Can be called by other Annotators in a RecursivePipeline
---------------
Enhancements
---------------
* RocksDB now available as a flexible API called `Storage`. Allows any annotator to have it's own distributed local index database
* Now our Tensorflow pre-trained models are cross-platform. Enabling multi-language models and other improvements to Windows users.
* Improved IO performance in general for handling embeddings
* Improved cache cleanup and GC by liberating open files utilized in RocksDB (to be improved further)
* Tokenizer and SentenceDetector Params minLength and MaxLength to filter out annotations outside these bounds
* Tokenizer improvements in splitChars and simplified rules
* DateMatcher improvements
* TextMatcher improvements preload algorithm information within the model for faster prediction
* Annotators the utilize embeddings have now a strict validation to be using exactly the embeddings they were trained with
* Improvements in the API allow Annotators with Storage to save and load their RocksDB database independently and let it be shared across Annotators
----------------
Bugfixes
----------------
* Fixes in Chunk and SentenceEmbeddings to better deal with empty cleaned-up Annotations
* Fixed PretrainedPipeline in Python to allow accessing the inner PipelineModel in the instance
* Probably a bunch of uncommented bugfixes along the way :)
========
2.3.6
========
---------------
Overview
---------------
This minor release fixes a bug in ChunkEmbeddings causing an out of boundaries exception in some scenarios. We
also switch to maven coordinates as default source for start() function since spark-packages has not been responsive
on their package approval process. Thank you all for your consistent feedback.
---------------
Bugfixes
---------------
* Fixed a bug in Chunk Embeddings caused by out of bound exception in some scenarios
---------------
Other
---------------
* start() function switched to use maven coordinates instead
========
2.3.5
========
---------------
Overview
---------------
We would like to thank you all for your valuable feedback via our Slack channels and our GitHub repositories.
Spark NLP `2.3.4` is a very stable and rock-solid release. However, we wanted to fix the few remaining minor bugs before moving to our bigger release `2.4.0`!
---------------
Bugfixes
---------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/702 Date matcher fixes flexible dates
* https://github.com/JohnSnowLabs/spark-nlp/pull/718 Fixed a bug in a pragmatic sentence detector where a sub matched group contained a dollar sign.
* https://github.com/JohnSnowLabs/spark-nlp/pull/719 Move import to top-level to avoid import fail in Spark NLP functions
* https://github.com/JohnSnowLabs/spark-nlp/pull/709 https://github.com/JohnSnowLabs/spark-nlp/pull/716 Some improvements in our documentation thanks to @marcinic @howmuchcomputer
========
2.3.4
========
---------------
Overview
---------------
Thank you, as always, for the feedback given at Slack and our repos. The most important part of this release,
is how we internally started organizing models. We'll be deploying our model news in
https://github.com/JohnSnowLabs/spark-nlp-models . The models repo will be kept up to date.
As for this release, it improves various internal API functionalities, allowing for positive side-effects across
the library. As an important enhancement, we have added user UDFs and functions for both Scala and Python users
to be able to easily manipulate annotations on DataFrames. Finally, we have fixed various bugs in embeddings
metadata to make sure we provide accurate offsetting information for other annotators to consume it successfully.
---------------
Enhancements
---------------
* Revamped functions in Scala and python to help users deal with annotations from dataframes or in UDF form, such as `map_annotations` and `filter_by_annotations`
---------------
Bugfixes
---------------
* Fixed bugs in ChunkEmbeddings and SentenceEmbeddings causing them to report wrong metadata and offset values
* Fixed a nested import issue in Python causing LightPipelines not to work in some environments
---------------
Developer API
---------------
* downloadModel is now flexible as to which inner downloader class is being used to access AnnotatorModel reference
* pretrained API now deals with defaultModelName as an Option to allow non default pretrained models