Releases · salesforce/TransmogrifAI

11 Jun 23:58

nicodv

0.7.0

036d1fc

0.7.0 Latest

Latest

Bug fixes:

Fix flaky ModelInsight tests #407
Remove logging of tokens of text fields #420, #438, #447, #474
Add validation prepare call before model selection when no DAG is passed #424, #429
Fix Days.daysBetween int overflow #471

New features / updates:

Downsample the number of training samples to maxTrainingSample for regression #413 and multi-class classification #414
Refactor InsightLOCOTest #412
Enable more loss types for OpLinearRegression #421
Add property-based tests for regression model selection #427
Add option to calculate LOCO for dates/texts by leaving out their entire vector #418
Add Chinese and Korean examples to TextTokenizerTest #442
Add support for ignoring text that looks like IDs in SmartTextVectorizer #448, #455
Add a unary estimator for detecting names in text fields and transforming to likely gender #445
Allow result features to be removed by raw feature filter #458
Metadata changes for sensitive feature information #457
Add MinVarianceFilter which checks that computed features have a minimum variance #463, #465
Allow TextStats length distribution to be token-based and refactor for testability #464
Use Spark job grouping to distinguish steps of the machine learning flow #467, #468, #470
Add categorical detection to be coverage based in addition to unique count based #473
Remove duplicate features using sanity checker feature to feature correlations #476, #479
Lift the upper bound on number of hash features #477
Enable Html stripping on text-like features #478

Dependency updates (#402, #466):

Update Apache Spark version to 2.4.5
Avro is a built-in data source in Spark 2.4, so no longer using the spark-avro package
Avro to 1.8.2
XGBoost to 0.90
MLeap to 0.14.0
json4s to 3.5.3
JUnit to 4.12
chill to 0.9.3
gradle-avro-plugin to 0.16.0

Miscellaneous:

Add ROADMAP.md #394

Assets 2

12 Sep 00:30

gerashegalov

0.6.1

f4b6af3

0.6.1

Bug fixes:

Ensure correct metrics despite model failures on some CV folds #404
Fix flaky ModelInsight tests #395
Avoid creating SparseVectors for LOCO #377

New features / updates:

Model combiner #385
Added new sample for HousingPrices #365
Test to verify that custom metrics appear in model insight metrics #387
Add FeatureDistribution to SerializationFormats #383
Add metadata to OpStandadrdScaler to allow for descaling #378
Improve json serde error in evalMetFromJson #380
Track mean & standard deviation as metrics for numeric features and for text length of text features #354
Making model selectors robust to failing models #372
Use compact and compressed model json by default #375
Descale feature contribution for Linear Regression & Logistic Regression #345

Dependency updates:

Update tika version #382

Assets 2

12 Jul 22:57

michaelweilsalesforce

0.6.0

028bf81

0.6.0

Bug fixes:

Quick Fix Alias Type Names #346
Forecast Evaluator - fixes SMAPE, adds MASE and Seasonal Error metrics #342

New features / updates:

Aggregate LOCOs of DateToUnitCircleTransformer. #349
Convert lambda functions into concrete classes to allow compatibility with Scala 2.12 #357
Replace mapValues with immutable Map where applicable #363
Aggregate spark metrics during run time instead of post processing by default #358
Allow customizing serialization for FeatureGenerator extract function #352
Update helloworld examples to be simple #351
Adding key ctor field in all RawFeatureFilter results #348
Forecast evaluator + SMAPE metric #337
Local scoring for model with features of all types #340
Remove local runner + update docs #335
Added missing test for java conversions #334
Get rid of scalaj-collections #333
Workflow independent model loading #274
Aggregated LOCOs of SmartTextVectorizer outputs #308
Added community projects docs section #326
Add FeatureBuilder.fromSchema #325
Improve WeekOfMonth in date transformers #323
Improved datetime unit transformer shortcuts - Part 2 #319
Correctly pass main class for CLI sub project #321
Serialize blacklisted map keys with the model + updated access on workflow/model members #320
Improved datetime unit transformer shortcuts #316
Improved OpScalarStandardScalerTest #317
improved PercentileCalibratorTest #318
Added concrete wrappers for HashingTF, NGram and StopWordsRemover #314
Avoid singleton random generators #312
Remove free function aggregation with feature builders #311
Added util methods to create class/object by name + retrieve type tag by type name #310

Dependency updates:

Bump shadowjar plugin to 5.0.0 #306
Bump Apache Tika to 1.21 #331
Enable CicleCI version 2.1 #353

Assets 2

08 May 21:16

Jauntbox

0.5.3

8d2e819

0.5.3

Bug fixes:

Threshold metrics calculation fix when unseen labels are present #293
DataCutter-related fixes for multiclass #263
Fixed onSetInput so is always called with new input #280

New features / updates:

Improved test SmartTextMapVectorizerTest #296
Add check to raw feature filter for removing all features #303
Spec-ifying ngram similarity tests #299
Add random test feature generator to generate datasets with features of all types #298
Spec-ifying NGramTest #297
Added base spec for testing Spark wrapping transformers #295
Add/upgrade string indexing tests #294
Improved multi pick list map vectorizer test #292
Improvements of Vectorizer tests #291
Updated TextMapPivotVectorizerTest to use OpEstimatorSpec #290
Update TextTokenizerTest to use OpTransformerSpec #289
Add test for RealNNVectorizer #288
Improved OPCollectionHashingVectorizerTest test #286
Created new tests for OpCollection #285
Update names of transformer tests and files to match class names #284
Improved test by extending OpTransformerSpec #283
Skip writing empty stages & skip loading stages without uid-s #282
Skip serializing estimators + fix test + added empty data transform test #281

Dependency updates:
N/A

Assets 2

11 Apr 03:05

tovbinm

0.5.2

b040762

0.5.2

Bug fixes:

Fixed local scoring with multipicklist features #243
Fixed error messages in DataCutter and DataBalancer #256
Fixed bug in in model selector fit method #251
Fixed some Transmogrifier defaults to be modifiable / exposed #232
Fixed bug in OpXGBoostClassificationModel #229
Minor fixes / cleanup on notebooks, Helloworld examples, and developer guide #226, #230, #240, #259

New features / updates:

Added transformer classes for common math operations #255, #257
Added string transformers for substring search and valid email #265
Added scaler and descaler transformers #223
Added Raw Feature Filter results e.g., metrics, exclusion reasons to serialization and to ModelInsights #237, #252, #258, #276
Changed OpBinScoreEvaluator to allow for lift analysis #233
Added random param builder for random hyperparameter search in model selectors #238
Added possibility to return top K positives and top K negatives improvement for LOCO #264
Added a max cardinality percentage that can be set for pivot #241
Added minimum rows for scoring set in RawFeatureFilter #250
Allowed copying model instances across multiple threads #270
Added stub to allow loading models without workflow #269, #272
Made decision tree numeric bucketizer tests less flaky #225
Added Jupyter notebooks for samples #231

Dependency updates:

Switched to MLeap runtime from Aardpfark for local scoring #249, #261

Assets 2

09 Feb 04:42

Jauntbox

0.5.1

26bacc9

0.5.1

Bug fixes:

Fix indices in LOCO for record-level insights and add more robust tests #216
Fix sorting in Prediction type for multiclass classification and add stronger tests #213
Fixing code generation bug with underscores in names #208
Correct some syntax/compilation errors in Titanic Binary Classification Docs Example #202

New features / updates:

Make some tests a little less flaky #221
Integrate helloworld project with Travis CI #210, #212
Use ParamGridBuilder in model selector grids to allow modifications #206
Use class.getName & update splitter meta parsing #204
Export model selector defaults + metadata fixes #199
Use OS specific path separator #193
Add transformer / estimator for text length calculation and options for using this as default behavior #190, #195
Allow conversion from Date and Timestamp Spark types to Date and DateTime TransmogrifAI types #188

Dependency updates:

Upgrade to Gradle 5.2 #218
Upgrade shadowjar plugin to 4.0.4 #220

Assets 2

22 Nov 21:31

tovbinm

0.5.0

078c8a0

0.5.0

New features and bug fixes:

XGBoost classification & regression models - EXPERIMENTAL #44
Add default param grid for xgboost #175
Fix ModelInsights for xgboost #170
Added Parquet reader #169
Added aggregate & conditional readers for Parquet #172
Evaluators check for empty data #178
Refactored splitter tests #176
Return scoring feature distributions from RawFeatureFilter #171
Using MapReduce Api for Avro Read Write #150
Improve test coverage for VectorsCombiner and make vector aggregator efficient #168
Time based aggregators #167
Ignore null values in meta + support floats #166
CLI command name fix + bump shadow plugin version + cleanup #164
Fix build.sbt example in readme #165
Removed an old test I added to check if Spark ran out of memory when calculating a correlation matrix (this is unnecessary and unhelpful) #160
Replace assert with require #159
Streaming histogram implementation #152
Added test and removed dead code for Sanity Checker dealing with map with same key #153
Fixed model insights exception when features are excluded from sanity checker correlation calculations #147
Added logging of response distribution to RFF #146
Use proper test ranges in feature converter test #143
Added support for DateType and TimestampType primitive spark types #135
Standardizing timezone to UTC #138

Dependency upgrades & misc:

XGBoost 0.81 #180
Spark 2.3.2 #44
Gradle 4.10.2 #142
Use OpenJDK8 for CircleCI builds + refactor build config #140

Assets 2

23 Sep 06:35

tovbinm

0.4.0

62aed6e

0.4.0

New features and bug fixes:

Allow to specify the formula to compute the text features bin size for RawFeatureFilter (see RawFeatureFilter.textBinsFormula argument) #99
Fixed metadata on Geolocation and GeolocationMap so that keep the name of the column in descriptorValue. #100
Local scoring (aka Sparkless) using Aardpfark. This enables loading and scoring models without Spark context but locally using Aardpfark (PFA for Spark) and Hadrian libraries instead. This allows orders of magnitude faster scoring times compared to Spark. #41
Add distributions calculated in RawFeatureFilter to ModelInsights #103
Added binary sequence transformer & estimator: BinarySequenceTransformer and BinarySequenceEstimator + plus the associated base traits #84
Added StringIndexerHandleInvalid.Keep option into OpStringIndexer (same as in underlying Spark estimator) #93
Allow numbers and underscores in feature names #92
Stable key order for map vectorizers #88
Keep raw feature distributions calculated in raw feature filter #76
Transmogrify to use smart text vectorizer for text types: Text, TextArea, TextMap and TextAreaMap #63
Transmogrify circular date representations for date feature types: Date, DateTime, DateMap and DateTimeMap #100
Improved test coverage for utils and other modules #50, #53, #67, #69, #70, #71, #72, #73
Match feature type map hierarchy with regular feature types #49
Redundant and deadlock-prone end listener removal #52
OS-neutral filesystem path creation #51
Make Feature class public instead hide it's ctor #45
Specify categorical variables in metadata #120
Fix fill geo location vectorizer values #132
Adding feature importance for new model types #128
Adding binaryclassification bin score evaluator #119
Apply DateToUnitCircleTransformer logic in raw feature filter transformations 130#

Breaking changes:

Made case class to deal with model selector metadata #39
Made FileOutputCommiter a default and got rid of DirectMapreduceOutputCommitter and DirectOutputCommitter #86
Refactored OpVectorColumnMetadata to allow numeric column descriptors #89
Renaming JaccardDistance to JaccardSimilarity #80
New model selector interface #55. The breaking changes are related to return type and the way the parameters are passed into model selectors. Starting this version model selectors would return a single result feature of type Prediction (instead of a variable number of feature - (pred, raw, prob)). Example:

val (pred, raw, prob) = MultiClassificationModelSelector() // won't compile anymore
val prediction = MultiClassificationModelSelector() // ok!

Another change is the way parameters are passed into model selectors. Example:

BinaryClassificationModelSelector
  .withCrossValidation()
  .setLogisticRegressionRegParam(0.05, 0.1) // won't compile anymore

Instead one should do:

val lr = new OpLogisticRegression()
val models = Seq(lr -> new ParamGridBuilder().addGrid(lr.regParam, Array(0.05, 0.1)).build())
BinaryClassificationModelSelector
  .withCrossValidation(modelsAndParameters = models)

For more example on how to use new model selectors please refer to our documentation and helloworld examples.

Dependency upgrades & misc:

CI/CD runtime improvements for CircleCI and TravisCI
Updated Gradle to 4.10
Updated scala-graph to 1.12.5
Updated scalafmt to 1.5.1
New transmogrifai-local subproject #41 introduces aardpfark and hadrian dependencies.

Assets 2

22 Aug 23:50

tovbinm

0.3.4

4171930

0.3.4

Performance improvements:

Added featureLabelCorrOnly parameter in SanityChecker to only compute correlations between features and label (defaults to false)
Added ignoreHashCorrelations parameter in SanityChecker that ignores correlations from hashed text features (defaults to false)
Parallelize OP cross validation and set default validation parallelism to 8
Added warmup in concurrent checks

New features and bug fixes:

Replace deprecated 'forceSharedHashSpace' param with HashingStrategy
Added explicit annotations for all classes with generic collections that use JsonUtils
Added .transmogrify shortcut for arrays of features
Removed referencing UID from a case object
DecisionTree & DropIndices stages tests now use the OP spec base classes
Added map features removed by RFF to model insights
Pretty print model summaries
Ensure OP Models are portable across environments
Ignore _ in simple streaming avro file reader
Updated evaluators so they can work with either Prediction type feature or three input featues
Added Algebird kryo registrar
Make Sure that SmartTextVectorizerModel can be serialized to/from json

Dependency upgrades:

Upgraded to Scala 2.11.12
Updated Gradle to 4.9 & bump Scalastyle plugin to 1.0.1

Released to Bintray - https://bintray.com/salesforce/maven/TransmogrifAI

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: salesforce/TransmogrifAI

0.7.0

0.6.1

0.6.0

0.5.3

0.5.2

0.5.1

0.5.0

0.4.0

0.3.4