Skip to content

Commit

Permalink
Consolidates NLP and AutoML, adds support for PyKX 2.5.3 with Python …
Browse files Browse the repository at this point in the history
…3.11. (#109)

* added link to documentation

* added link to documentation

* don't print load messages in quiet mode

* docker image with nlp dependencies installed

* build docker image on travisCI

* update docker output

* slack notification

* updated README

* updated README

* updated README

* removed finding years as there are too many false positives in findDates

* adding tests

* added tests

* updated travis

* updated paths

* added tests

* Squashed commit of the following:

commit 86dde8886648f4c0199ca25696a68d97fecf30a7
Author: Fionnuala Carr <[email protected]>
Date:   Mon Jul 9 11:38:13 2018 +0100

    cleaned up tests

commit 2c3c612e70a92cd0d7c23a331198d9f351868c35
Author: Fionnuala Carr <[email protected]>
Date:   Mon Jul 9 11:11:48 2018 +0100

    changed ranges of dates,q error

commit 3828ac2c55c3ce0e89788362db527fb5305b1f60
Author: Fionnuala Carr <[email protected]>
Date:   Fri Jun 29 11:54:08 2018 +0100

    modified test

commit 32ed5525c697200cbd011826bcbeab0b3e9b9e8a
Author: Fionnuala Carr <[email protected]>
Date:   Fri Jun 29 11:46:08 2018 +0100

    moved tests

commit aa56fbc3ba6459ee1d6df788c9f3890a6830fdb2
Author: Fionnuala Carr <[email protected]>
Date:   Fri Jun 29 11:36:58 2018 +0100

    changed path

commit 1b006a946bc00e195a216d317a60e3ec2d9f92fd
Author: Fionnuala Carr <[email protected]>
Date:   Fri Jun 29 11:25:20 2018 +0100

    cd back

commit c83255288dfc0dfac3eca39bb636bc76015b1b5e
Author: Fionnuala Carr <[email protected]>
Date:   Fri Jun 29 11:22:13 2018 +0100

    test embedPy runs

commit 1ad7c10a651eb64da0e00aa12de3f4101b269229
Author: Fionnuala Carr <[email protected]>
Date:   Fri Jun 29 10:50:21 2018 +0100

    testing

commit f9327754052c95f26414c25d232c9703f316206d
Author: Fionnuala Carr <[email protected]>
Date:   Fri Jun 29 10:48:38 2018 +0100

    embedpy

commit a036ac2e51b9f42a63c8faf8e86c2691b38ea424
Author: Fionnuala Carr <[email protected]>
Date:   Fri Jun 29 10:46:34 2018 +0100

    embed

commit 2fcc4402bc6b6993b4e4735fe6415def84ecc368
Author: Fionnuala Carr <[email protected]>
Date:   Fri Jun 29 10:38:34 2018 +0100

    embedPy

commit 36cdbef8c469750bb0448688e7a767647b29d423
Author: Fionnuala Carr <[email protected]>
Date:   Fri Jun 29 10:33:48 2018 +0100

    embedPy

commit 2b84902e70ee84b863a76ac39844c6d3e0017e9d
Author: Fionnuala Carr <[email protected]>
Date:   Fri Jun 29 10:30:28 2018 +0100

    embedPy

commit 66d3ab44d97b7f12ea2921fd15efb1b1c69df434
Author: Fionnuala Carr <[email protected]>
Date:   Fri Jun 29 10:25:04 2018 +0100

    embedPy

commit 58f6ce0413992c78ce05e250860b9eac2de1ae13
Author: Fionnuala Carr <[email protected]>
Date:   Fri Jun 29 10:20:45 2018 +0100

    embedPy

commit 905511d833ad29ffc5bb159c40b75c74a7788057
Author: Fionnuala Carr <[email protected]>
Date:   Fri Jun 29 10:17:59 2018 +0100

    embedPy

commit 50ec1d4d5189c703f9cb11c918a1cc19b85c0c62
Author: Fionnuala Carr <[email protected]>
Date:   Fri Jun 29 10:06:36 2018 +0100

    test

commit b9b4ec17333f59082fceee57d32914dd7f3217f9
Author: Fionnuala Carr <[email protected]>
Date:   Fri Jun 29 09:58:00 2018 +0100

    test

commit 7d6fcf7582c8681aaa0664d192b009d91262781b
Author: Fionnuala Carr <[email protected]>
Date:   Fri Jun 29 09:48:15 2018 +0100

    test

commit bd6c713e62db8bfc0494a4b990a9ca91ff7e7d16
Author: Fionnuala Carr <[email protected]>
Date:   Fri Jun 29 09:16:32 2018 +0100

    tests

commit f701c088e83ed7b2243eef4c99618ce32ffd8626
Author: Fionnuala Carr <[email protected]>
Date:   Fri Jun 29 09:09:31 2018 +0100

    removed builds

commit 26d056251cc79a2c38356324dcfd3af345c2af55
Author: Fionnuala Carr <[email protected]>
Date:   Thu Jun 28 18:50:10 2018 +0100

    add test

commit 66ab09c98eeb142c7ea1b4a1f6a410a41a67dfba
Author: Fionnuala Carr <[email protected]>
Date:   Thu Jun 28 18:38:51 2018 +0100

    tests

commit e23b746cf1aa96256429fc252aea7036e52a02d2
Author: Fionnuala Carr <[email protected]>
Date:   Thu Jun 28 18:29:01 2018 +0100

    test

commit b3e889493228411e6d1485dfefd25681e31b8899
Author: Fionnuala Carr <[email protected]>
Date:   Thu Jun 28 18:28:34 2018 +0100

    test

commit aecb40447e48957dcaaf8e102673915d2968e7ef
Author: Fionnuala Carr <[email protected]>
Date:   Thu Jun 28 18:24:45 2018 +0100

    tests

commit 18f4417e0c0123981ea613d9ba4ffff61e9211b1
Author: Fionnuala Carr <[email protected]>
Date:   Thu Jun 28 18:18:22 2018 +0100

    add tests

commit 79c35c63e4bf647242258b3613e7d573b31883a8
Author: Fionnuala Carr <[email protected]>
Date:   Thu Jun 28 17:14:10 2018 +0100

    non conda

commit a6a22020ad545306509f926c965d96c7f092e82e
Author: Fionnuala Carr <[email protected]>
Date:   Thu Jun 28 16:47:17 2018 +0100

    testing

commit 0cb7b2ddd0326b3504d5e2478b3e478c39582d94
Author: Fionnuala Carr <[email protected]>
Date:   Thu Jun 28 16:21:54 2018 +0100

    testing

commit 6e2d2493d3c07a9371a1c830e9038414b85976ad
Author: Fionnuala Carr <[email protected]>
Date:   Thu Jun 28 16:19:02 2018 +0100

    testing

commit 867d432c080d3ae78b5fa6a90a050e883627c72e
Author: Fionnuala Carr <[email protected]>
Date:   Thu Jun 28 16:17:35 2018 +0100

    testing

commit 8411f070ed3a791e3c92c3b1cb133fe24a14e03d
Author: Fionnuala Carr <[email protected]>
Date:   Thu Jun 28 16:16:26 2018 +0100

    testing

commit 8325b17e1168781eddabd663cc255e9be8168d7d
Author: Fionnuala Carr <[email protected]>
Date:   Thu Jun 28 16:14:39 2018 +0100

    testing

commit 23d17da19d06d4b0068f89a8e27c9e1747b907d1
Author: Fionnuala Carr <[email protected]>
Date:   Thu Jun 28 16:08:03 2018 +0100

    testing

commit 162387ee28a74144aee8e69a9253bc98e34a9b18
Author: Fionnuala Carr <[email protected]>
Date:   Thu Jun 28 16:05:19 2018 +0100

    testing

commit 14db1b68d22b4e86906022029b11fdf3e78cc1af
Author: Fionnuala Carr <[email protected]>
Date:   Thu Jun 28 16:02:15 2018 +0100

    testing

commit 9e88a0c35980f232b1d5ab3098e484ee98617aa8
Author: Fionnuala Carr <[email protected]>
Date:   Thu Jun 28 15:52:46 2018 +0100

    testing

commit c423e96384c69277516be8a16880f9081beda129
Author: Fionnuala Carr <[email protected]>
Date:   Thu Jun 28 15:45:58 2018 +0100

    testing

commit 4be9539013403c0e35d67a17c4aee37040d99033
Author: Fionnuala Carr <[email protected]>
Date:   Thu Jun 28 15:44:35 2018 +0100

    testing

commit 4bfc0f1955cd4ff068c7cb71c6cf13e431622c6c
Author: Fionnuala Carr <[email protected]>
Date:   Thu Jun 28 15:41:33 2018 +0100

    testing

commit 7687973cf4fce5101578458049cb2cf80788b54e
Author: Fionnuala Carr <[email protected]>
Date:   Thu Jun 28 15:35:24 2018 +0100

    testing

commit 2892e37f735b5ea7e71807c10f2359247319f27e
Author: Fionnuala Carr <[email protected]>
Date:   Thu Jun 28 15:30:05 2018 +0100

    testing

commit 12b25e0d4d702b38e80a24ed06f405c944faadf9
Author: Fionnuala Carr <[email protected]>
Date:   Thu Jun 28 13:38:14 2018 +0100

    testing

commit 7c609469090784d78bcc450f2fb15bc1ccf8c055
Author: Fionnuala Carr <[email protected]>
Date:   Thu Jun 28 12:25:36 2018 +0100

    testing

commit d162961f17a530bb795127d4523a8062b288ce5b
Author: Fionnuala Carr <[email protected]>
Date:   Thu Jun 28 12:13:33 2018 +0100

    testing

commit d6018f7d3626fc94e9160833fe27b40ca3a9ae9f
Author: Fionnuala Carr <[email protected]>
Date:   Thu Jun 28 11:49:42 2018 +0100

    testing

commit a67fa1348c811659aa63f8a33acebb007b8426af
Author: Fionnuala Carr <[email protected]>
Date:   Thu Jun 28 09:59:38 2018 +0100

    testing

commit a48dc73c939c02950cadff7f9bcdadc57d6e7974
Author: Fionnuala Carr <[email protected]>
Date:   Thu Jun 28 09:40:29 2018 +0100

    testing

commit 3d728ebfc799dd8655f7cde8e149f0287dbececc
Author: Fionnuala Carr <[email protected]>
Date:   Thu Jun 28 09:36:57 2018 +0100

    testing

commit a593150467bae8385a9c4cb2d7a4a458fb3fa504
Author: Fionnuala Carr <[email protected]>
Date:   Wed Jun 27 18:23:28 2018 +0100

    testing

commit 4258fdbbd342cc342f8021a313a872b95589a30b
Author: Fionnuala Carr <[email protected]>
Date:   Wed Jun 27 17:17:50 2018 +0100

    testing

commit d2e2312dc17a69421952ad59af68ef29d84f9c34
Author: Fionnuala Carr <[email protected]>
Date:   Wed Jun 27 17:09:45 2018 +0100

    testing

commit c2d442d4bb1cc74f37507f3bf5f8662edb7948f5
Author: fionncarr <[email protected]>
Date:   Tue Jun 26 11:42:06 2018 +0000

    tests

commit 8c7cd25b3b46d779fbbfcdcdd6a7df89e06348e9
Author: fionncarr <[email protected]>
Date:   Tue Jun 26 11:16:08 2018 +0000

    tests

commit 254c6411183fb39074d8af603b6b28eef09581b5
Author: fionncarr <[email protected]>
Date:   Tue Jun 26 10:58:11 2018 +0000

    tests

commit 4423021a7dea2696a12f02e0dd9ecc74cd8bbbae
Author: fionncarr <[email protected]>
Date:   Tue Jun 26 10:37:17 2018 +0000

    test

commit 3e63b5fc7d9d59b205b7554b35095598559c7fcf
Author: fionncarr <[email protected]>
Date:   Tue Jun 26 10:04:56 2018 +0000

    tests

commit 8a34ed99cc31e397d8cc8be2af79b04754cc3662
Author: fionncarr <[email protected]>
Date:   Tue Jun 26 09:26:40 2018 +0000

    updated tests

commit 382ccd710860920a25df413b4e7b05321cc8bc9e
Author: fionncarr <[email protected]>
Date:   Tue Jun 26 08:42:03 2018 +0000

    checking what test fails

commit 1859636d67a1ef35b8e798e4fe879f0e0019ed53
Author: fionncarr <[email protected]>
Date:   Mon Jun 25 17:33:10 2018 +0000

    downloading spacy en model

commit e6713fa0c3fc81a3391ad200a6f63b19c41eee86
Author: fionncarr <[email protected]>
Date:   Mon Jun 25 17:10:45 2018 +0000

    changed email

commit 200d7c9bba262cc5654028cbdabf189a721f13a8
Author: Fionnuala Carr <[email protected]>
Date:   Mon Jun 25 17:20:06 2018 +0100

    deleted email

commit 8915a135a9e176af8f8deb0811bdf5c40cab1e26
Author: Fionnuala Carr <[email protected]>
Date:   Mon Jun 25 17:07:13 2018 +0100

    added email

commit d0613bdc6d5ae1e1934809b0e379cbe440d057c7
Author: Fionnuala Carr <[email protected]>
Date:   Mon Jun 25 16:35:57 2018 +0100

    updated path

commit 3c10c52eca07e880f6c4a2ce791a8c54cd4ab28d
Author: Fionnuala Carr <[email protected]>
Date:   Mon Jun 25 16:13:25 2018 +0100

    pip not pip3

commit 0a5ce933611647663d0f0b01d472b02b963aa28a
Author: Fionnuala Carr <[email protected]>
Date:   Mon Jun 25 16:10:37 2018 +0100

    remove comments

commit 5cd62ff1eabc7ae37c07f9c0c29bc674438191d4
Author: Fionnuala Carr <[email protected]>
Date:   Mon Jun 25 16:06:41 2018 +0100

    test with kdb

commit a560aef96e1747a746ac1bf12bd6b2d271d4972b
Author: Fionnuala Carr <[email protected]>
Date:   Mon Jun 25 15:34:37 2018 +0100

    osx and linux

* decoded evn vars

* updated path for vader

* modified load emails function, added tests for the function  and added scripts to spread expensive computations

* checked where it was breaking

* removed test that it breaks on

* adding scripts to make a release of the code and fixed bug in the parser with lemmas

* corrected typo

* recommiting because docker doesnt work

* fixed bug that was causing findDates to crash, lemmas return correct result and jaroWinkler works for 3 or less lettered words. Created tested for these functions

* delete files

* reversed some of the files so it would be able to merge to master

* added new functionality to extract rtf text in an email

* changed tests to accommodate new format in emails

* fixed .nlp.loadfile function to allow it to load in files in windows

* unsilence tests

* silenced tests

* fixed tfidf to match python

* added alternative apostrophe to stopword contractions

* updated TFIDF code

* added ability for alpha languages

* deleted stopwords

* tests

* changed test files

* test.q now the same as embedpy test.q

* testing

* alpha languages support,detect languages function added and TFIDF update

* buildtest

* buildtest

* buildtest

* buildtest

* test

* test

* test

* test

* test

* test

* test

* added conda cmd

* added conda cmd

* added conda cmd

* a

* changed back to original test.q

* test

* test

* test

* test

* test

* test

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update2

* update2

* update3

* update3

* update3

* update3

* update3

* update3

* update3

* update3

* update3

* update3

* update3

* update3

* update3

* update3

* update3

* update3

* update4

* update5

* update5

* update5

* update6

* update6

* update6

* update6

* update6

* update7

* update8

* update9

* update9

* update10

* update11

* update12

* update13

* update14

* update15

* update16

* update17

* update18

* update18

* update19

* update19

* update

* added user: to curl function

* added user: to curl function

* commit

* getToFrom function now looks for multiple senders if payload is a table

* change appveyor settings

* no changes

* calls embedpy tests

* calls embedpy tests

* added in PennPOS to catch symbols

* loading init.q added to test scripts

* delete test.q

* testing

* testing

* testing

* laoding init.q added to test scripts

* dev (#15)

* tests

* changed test files

* test.q now the same as embedpy test.q

* testing

* alpha languages support,detect languages function added and TFIDF update

* buildtest

* buildtest

* buildtest

* buildtest

* test

* test

* test

* test

* test

* test

* test

* added conda cmd

* added conda cmd

* added conda cmd

* a

* changed back to original test.q

* update

* added user: to curl function

* added user: to curl function

* commit

* getToFrom function now looks for multiple senders if payload is a table

* no changes

* calls embedpy tests

* calls embedpy tests

* added in PennPOS to catch symbols

* loading init.q added to test scripts

* added slack notification

* fixed tests for spacy update

* fixed tests for spacy update

* dev

* fixed tests for spacy update

* Dianedev (#18) (#19)

* tests

* changed test files

* test.q now the same as embedpy test.q

* testing

* alpha languages support,detect languages function added and TFIDF update

* buildtest

* buildtest

* buildtest

* buildtest

* test

* test

* test

* test

* test

* test

* test

* added conda cmd

* added conda cmd

* added conda cmd

* a

* changed back to original test.q

* update

* added user: to curl function

* added user: to curl function

* commit

* getToFrom function now looks for multiple senders if payload is a table

* change appveyor settings

* no changes

* calls embedpy tests

* calls embedpy tests

* added in PennPOS to catch symbols

* loading init.q added to test scripts

* delete test.q

* testing

* testing

* testing

* laoding init.q added to test scripts

* added slack notification

* fixed tests for spacy update

* fixed tests for spacy update

* dev

* fixed tests for spacy update

* added spacy hunspell

* removed .i. from funcs in docs

* removed .i. from funcs in docs

* removed .i. from funcs in docs

* removed .i. from funcs in docs

* removed .i. from funcs in docs

* removed .i. from funcs in docs

* embedPy

* code.kx link updates

* updated spell check

* updated spell check

* cleaned up format

* update file path for init

* updated init.q

* fixed spacy_hunspell

* install spacy_hunspell error

* fixed travis and appveyor

* added regex funcs

* added tensorflow funcs

* added alpha lang instructions, fixed parser for single opts input

* added tf tests

* clean up and added tests

* removed tensorflow, updated langid

* added instructions for spacy_hunspell

* updated travis file

* add conda-forge to docker

* changed docker instructions

* added pip -q to travis file w

* run tests on spacy version 2.2.1

* fix merging (#22)

* Initial commit

* This commit includes the initial beta version of the automated machine learning framework.
* Can be used for Normal/FRESH tasks in regression or classification problem.
* Designed to be flexible in nature to kdb devs and ml engineers
* Includes testing procedures via travis and appveyor

* Addition of hidden travis yml file and issue templates

* Update to code commenting, removal of unneeded plotting functions, minor readme mod

* Code refactor to move to dict input and clean up aml.q (#2)

* Update README.md

* Update README.md

* Fix to 'locals error and addition of target check (#4)

* reduction in local variable to stop 'locals error in kdb+<3.6

* Update to number of locals in aml.q, addition of target check and pandas requirement

* Change to target encoding location for symbols

* Removal of 3.6 requirement

* Train-test-split naming and check for existence of save-default file (#5)

* reduction in local variable to stop 'locals error in kdb+<3.6

* Update to number of locals in aml.q, addition of target check and pandas requirement

* Change to target encoding location for symbols

* Removal of 3.6 requirement

* Update to conventions for train-validate-test, check to see if default file already exists

* Fix to link in contributing.md

* Addition of travis and release tags

* Update .travis.yml

* Update package.bat

* Update getkdb.bat

* Update getkdb.bat

* Update package.bat

* Update to docker image (#6)

* Docker image had not been initialising correctly and was missing the ml toolkit

* wording update (#7)

* Update (#23)

* update merge

* update function name

* Update to findDates func to check for the word "of" or "in"  between dates, months or years. Tests also added to account for this change (#24)

* update merge

* update function name

* fixed findDates function to account for the word of or in between dates, months and years

* update to infinity replace logic (#8)

* Explicit closing of figures to reduce process memory usage (#9)

* Upd infreplace (#10)

* update to infinity replace logic

* fix to bug in infinity replace for float

* Update travis/appveyor files. Removed sys argv statement due to embedPy update (#25)

* update merge

* update function name

* fixed findDates function to account for the word of or in between dates, months and years

* removed sys argv statement

* fix appveyor and travis files

* fix copy error

* removed getembedpy

* v0.2.0 additions (#11)

* addition of latex support and torch functionality

* merged nlp into new version

* update to import and checking functionality

* removal of hack for save paths

* commenting to init,no longer defining the .automml.p namespace, functions won't be callable unless they're all available anyway

* removal of unnecessary type check, more readable choice of first element, checknlp -> validnlp, or not and for check

* space between separate columns, rename of util to prep. ...

* refactoring of nlp preprocessing execution

* splitting of preprocessing functions into sub folders, models folder now splits into sections

* null and constant drop, simplification of percentage calculation in stop tab function

* wording update

* Explicit closing of figures to reduce process memory usage

* random/sobol hyperparam search

* grid/random hyperparam files

* old hyperparam file deleted

* kneighbors fix

* update to hyperparameter generation functionality for automl

* random search moved to ml

* fix for reading in pdict with new hp key

* tests + fix to proc.hp.psearch for sobol

* clearer install instructions for additional modules

* latex requirements

* readme updates

* report feature extraction

* python checks

* fixed nlp terminal printouts

* nlp refactor

* added regex search and user config option for w2v

* word2vec type change. Fixed string error

* added NLP tests

* installed extra requirements for tests

* moved requirements

* updated scripts to support pytorch models without tensorflow

* fix to error in non latex report generation

* updated travis/appveyor files

* saving with failing latex generation results in movement to report directory without rectification

* typo in second system command for current directory

* updated for mproc loading to work with both keras and torch

* V2nlp review (#10)

* pytorch updates

* pytorch updates

* refactor to ensure file movement is correct, failing tests are caught and torch tests can be ran

* pip install for appveyor tests

Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>

* added namespace (#13)

* added namespace

* added warning for w2v randomization. Added w2vitem function that was previously redundant

* update to pip install rather than conda in docker images

Co-authored-by: Dianeod <[email protected]>
Co-authored-by: Deanna Morgan <[email protected]>
Co-authored-by: dmorgankx <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: Dianeod <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>

* Testing minor changes (#12)

* addition of latex support and torch functionality

* merged nlp into new version

* update to import and checking functionality

* removal of hack for save paths

* commenting to init,no longer defining the .automml.p namespace, functions won't be callable unless they're all available anyway

* removal of unnecessary type check, more readable choice of first element, checknlp -> validnlp, or not and for check

* space between separate columns, rename of util to prep. ...

* refactoring of nlp preprocessing execution

* splitting of preprocessing functions into sub folders, models folder now splits into sections

* null and constant drop, simplification of percentage calculation in stop tab function

* wording update

* Explicit closing of figures to reduce process memory usage

* random/sobol hyperparam search

* grid/random hyperparam files

* old hyperparam file deleted

* kneighbors fix

* update to hyperparameter generation functionality for automl

* random search moved to ml

* fix for reading in pdict with new hp key

* tests + fix to proc.hp.psearch for sobol

* clearer install instructions for additional modules

* latex requirements

* readme updates

* report feature extraction

* python checks

* fixed nlp terminal printouts

* nlp refactor

* added regex search and user config option for w2v

* word2vec type change. Fixed string error

* added NLP tests

* installed extra requirements for tests

* moved requirements

* updated scripts to support pytorch models without tensorflow

* fix to error in non latex report generation

* updated travis/appveyor files

* saving with failing latex generation results in movement to report directory without rectification

* typo in second system command for current directory

* updated for mproc loading to work with both keras and torch

* V2nlp review (#10)

* pytorch updates

* pytorch updates

* refactor to ensure file movement is correct, failing tests are caught and torch tests can be ran

* pip install for appveyor tests

Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>

* added namespace (#13)

* added namespace

* added warning for w2v randomization. Added w2vitem function that was previously redundant

* update to pip install rather than conda in docker images

* addition of smaller run of trials for sobol/random

* reduction in number of rows for nlp tests (timeout)

Co-authored-by: Dianeod <[email protected]>
Co-authored-by: Deanna Morgan <[email protected]>
Co-authored-by: dmorgankx <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: Dianeod <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>

* NLP dataset to allow reduced rows

* AutoML Refactor Version 0.3.0 (#14)

* Version 0.3.0 update (#13)

* addition of latex support and torch functionality

* merged nlp into new version

* update to import and checking functionality

* removal of hack for save paths

* commenting to init,no longer defining the .automml.p namespace, functions won't be callable unless they're all available anyway

* removal of unnecessary type check, more readable choice of first element, checknlp -> validnlp, or not and for check

* space between separate columns, rename of util to prep. ...

* refactoring of nlp preprocessing execution

* splitting of preprocessing functions into sub folders, models folder now splits into sections

* null and constant drop, simplification of percentage calculation in stop tab function

* First pass commit at automl code structure with new graphing mechanism

* full graph in new format which can run 'basic' .automl.run

* Addition of stub files for AutoML graph testing

* update to travis test code

* first pass at data ingestion, configuration creation and data checking

* addition of save path to config, additional checking for NLP

* update to config retrieval to support flat files, coinciding refactor of function

* renaming of nlp checks to be clearer

* removal of overwritten date/time and update to structure/commenting

* change to camelCase, addition of image graphs, change to structure

* First pass update to include new coding standard definitions

* addition of a common location for general use utilities

* removal of unnecessary hidden files

* Addition of tests for target data functionality

* variable -> variant

* tests for process based retrieval of feature data

* addition of appropriate tests for the dataCheck node

* minor updates

* update to graph, inclusion of label encodeing symbol mapping to graph both code and images

* addition of tests for remaining function in dataCheck node

* review of targetData, featureData and dataCheck

* added labelEncode functionality and corresponding tests

* Initial addition of now renamed featureDescription node, update to graph images

* change from modification to description in node naming

* removal of unneeded param to dataDescription function, update to tests to cover all expected behaviour

* update to automl graph to use new label encode function from the toolkit

* added node for modelGeneration. Added customization folder for models, scoring funcs etc

* addition of funcs.q

* updated comments from review of code

* Minor improvements to code

* changes to keras functions to make it more adaptable for the addition of other models

* windows fix for updateConfig (#11)

* windows fix for updateConfig - no longer overwrites dir

* code tidy up

* Addition of dataPreprocessing node (#12)

* addition of dataPreprocessing node

* cleaned up commenting

* updated review changes

* minor updates to dataPreprocessing functionality, models definitions updated for keras.q

Co-authored-by: Conor McCarthy <[email protected]>

* New commenting style required for featureDescription node (#19)

* windows fix for updateConfig - no longer overwrites dir

* code tidy up

* variable declared as global broke tests - changed to local

* new commenting style

* Feature creation node (#16)

* windows fix for updateConfig - no longer overwrites dir

* code tidy up

* variable declared as global broke tests - changed to local

* addition of dataPreprocessing node

* cleaned up commenting

* updated review changes

* minor updates to dataPreprocessing functionality, models definitions updated for keras.q

* added featurecreation functionality

* cleaned up nlp functions

* cleaned up code

* updated graph for feat create model, added test print statements. Added travis/appveyor PYTHONHASHSEED

* updated appveyor build scripts to install embedpy via conda

* pythonhashseed env

* code review and test changes

* removal of old testing data

* added tests and error trap for NLP and pulled down review

* changed NLP tests

* updated NLP tests to use spacy 2.3.2

* updated code in line with comments and added tests for ml.df2tab addition

Co-authored-by: Deanna Morgan <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>

* Selectmodels node (#18)

* windows fix for updateConfig - no longer overwrites dir

* code tidy up

* variable declared as global broke tests - changed to local

* added selectModels node

* included funcs.q

* test fixes

* updated any comments in PR. Pulled down latest version

Co-authored-by: Deanna Morgan <[email protected]>

* addition of predictParams node (#21)

* addition of predictParams node

* graph updates

Co-authored-by: Conor McCarthy <[email protected]>

* Created pathConstruct node (#23)

* addition of paramConsolidate node

* created pathConstruct node

* Automl graph tts (#15)

* windows fix for updateConfig - no longer overwrites dir

* code tidy up

* variable declared as global broke tests - changed to local

* addition of featureSignificance node

* addition of trainTestSplit node

* addition of featureSignificance node

* sigFeat fixes

* sigFeat error trapping

* sigFeat tests

* change of train test split output type

* train test split tests

* correction to featSig tests

* correction to featSig tests

* correction to featSig tests

* test updates

* test updates

* correlated columns

* review of tts

* review of sigfeat

* correction to sigFeats functions to include one of correlation columns

* addition of tests for funcs.q

* addition of q/python func check

* review of tts, moved qpyFuncSearch to dataCheck

* reviewed featSig tests

* utils moved to funcs.q for TTS + sz check added

* utils moved to funcs.q for TTS + sz check added

* removed pythonTTS.p - already in dataCheck

* PR changes

Co-authored-by: Dianeod <[email protected]>

* Addition of saveGraph node (#24)

* addition of saveGraph node

* addition of saveGraph node

* addition of extra plots

* removed folders created in tests

* review of comments made

* updated Graph

* moved plt to utils, changed marker size

Co-authored-by: cmccarthy1 <[email protected]>

* Graph runmodels (#17)

* windows fix for updateConfig - no longer overwrites dir

* code tidy up

* variable declared as global broke tests - changed to local

* addition of featureSignificance node

* addition of trainTestSplit node

* addition of featureSignificance node

* sigFeat fixes

* sigFeat error trapping

* sigFeat tests

* change of train test split output type

* train test split tests

* correction to featSig tests

* correction to featSig tests

* correction to featSig tests

* test updates

* test updates

* correlated columns

* review of tts

* review of sigfeat

* updated

* addition of runmodels node

* updated graph

* runModels review

* added number of reps for gs/xv

* updated dataCheck test

* updated comments made in PR

* addition of information for metadata

* resolved all comments

Co-authored-by: Deanna Morgan <[email protected]>
Co-authored-by: dmorgankx <[email protected]>

* Automl graph preproc params (#26)

* windows fix for updateConfig - no longer overwrites dir

* code tidy up

* variable declared as global broke tests - changed to local

* Update Automl_Graph.drawio

* graph, test and code format updates

* test print statements and updated graph

* Update Automl_Graph.drawio

* graph updates

* Connection abd workflow clarity for graph images

Co-authored-by: Conor McCarthy <[email protected]>

* Addition of saveMeta node (#25)

* windows fix for updateConfig - no longer overwrites dir

* code tidy up

* variable declared as global broke tests - changed to local

* new commenting style

* addition of saveMeta node

* addition of saveopt check and tests

* node review - moved mdlMeta to funcs, removed repeated code

* added print statements. Removed pathDict created in pathConstruct

* updated modelMeta lib

* Addition of tests to check paths/metadata is created

Co-authored-by: Deanna Morgan <[email protected]>

* OptimizeModels Node (#20)

* windows fix for updateConfig - no longer overwrites dir

* code tidy up

* variable declared as global broke tests - changed to local

* new commenting style

* added optimization node

* update optimization node

* updated graph

* Update to include confusion matrix and impact dictionary

* added regression calculation

* node review

Co-authored-by: Deanna Morgan <[email protected]>

* Fixed any bugs found (#27)

* windows fix for updateConfig - no longer overwrites dir

* code tidy up

* variable declared as global broke tests - changed to local

* new commenting style

* added optimization node

* update optimization node

* updated graph

* Update to include confusion matrix and impact dictionary

* added regression calculation

* node review

* fixed any bugs found, nor runs through for all nodes

Co-authored-by: Deanna Morgan <[email protected]>

* Moved testing functions to separate file (#29)

* moved passing/failing test to seperate file

* added test/utils.q

* Addition of saveModel node (#28)

* windows fix for updateConfig - no longer overwrites dir

* code tidy up

* variable declared as global broke tests - changed to local

* new commenting style

* addition of saveModels node

* addition of savemodels node

* addition of saveModels node

* clearned up if statement

* addition of saveModels node

* added tests for NLP

Co-authored-by: Deanna Morgan <[email protected]>

* Updating any bugs so that .`automl.run` works (#30)

* removed duplicated function

* fixed any errors to make sure automl.run runs through

* updated travis to include spacy english model and keras

* Automl graph savereport (#31)

* windows fix for updateConfig - no longer overwrites dir

* code tidy up

* variable declared as global broke tests - changed to local

* new commenting style

* addition of saveModels node

* addition of savemodels node

* addition of saveModels node

* clearned up if statement

* minor code changes

* addition of fpdf report gen

* addition of fpdf report gen

* updated saveGraph to run latex

* report tests

* report tests

* updated tests

* updated image size

* latex formatting

* Updated latex checking and change to code organization

* pdflatex naming typo

* Fix to force absolute location of generated reports

* update to reportlab generation, new page removes custom font for headers

Co-authored-by: Dianeod <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>

* removed duplicated function (#33)

* move hyperparams to json files (#34)

* removed duplicated function

* move hyperparams to json files

* Update to some descriptions of functions and indenting of json

Co-authored-by: Conor McCarthy <[email protected]>

* Add functionality to add custom save path (#35)

* removed duplicated function

* added functionality to add custom model save path

* move hyperparams to json files (#34)

* removed duplicated function

* move hyperparams to json files

* Update to some descriptions of functions and indenting of json

Co-authored-by: Conor McCarthy <[email protected]>

* Update to be more strongly typed

Co-authored-by: Conor McCarthy <[email protected]>

* Introduction of command line interface api for automl (#36)

* Initial pass at json driven command line interfacce

* Major update to command line interface to support new input naming and allow first pass at fire and forget

* update to allow data retrieval via ipc/csv in command line case

* update to json format and command line input structure

* addition of code commenting for new command line version

* Final change to facilitate appropriate model naming conventioj

* typo fix

* Review of code (#37)

* removed duplicated function

* review of code

* refactor default layout

* Reintroduction of prediction mechanism for automl (#38)

* first pass at addition of prediction functionality

* Working pass at retrieval of models from disk

* update to remove multiple paths to generate predict function

* revert to pre cli_testing merge

* Minor updates to clean up NLP and correctly retrieve saved model. Update to feature creation for FRESH to support tabular input

* minor fixes to issues with retrieving named models and using the correct save option name

* Update tests (#40)

* windows fix for updateConfig - no longer overwrites dir

* code tidy up

* variable declared as global broke tests - changed to local

* new commenting style

* addition of saveModels node

* addition of savemodels node

* addition of saveModels node

* clearned up if statement

* minor code changes

* update to tests to be in line with new dictionary input structure

* updates in line with requirements for tests with dataPreprocessing node

* featureExtractionType across the board change

* Update to tests for featureCreation node

* update to featureExtractionType name

* update to featureSignificance node tests to account for new config structure

* update to configuration retrieval to ensure full config retrieved

* renaming of config parameters for model optimization and update to feature extraction naming for preprocParams node

* update to runModels test config

* updates to configurations for train test split node

* Fix to bug in data split function and renaming of configuration in testing in line with new functionality

* update to saveMeta testing to align with revised structure for prediction functionality

* update to saveoption and feature extraction naming in line with new config for saveModels node

* Update to configuration for testing of saveReport node

* removal of old config definition

* Fix to bug introduced with change to hyperparameter function retrieval, update to configuration keys

* path error fix and model meta check

* reintroduction of test utilities needed for passing/failing test logic

* Review of updateTests branch (#42)

* removed duplicated function

* review of code

* refactor default layout

* review of testing code

* Reintroduction of travis testing (#43)

* initial update to reintroduce tests

* reintroduction of tensorflow install requirement

* Change to number of features and minor change to model paths

* update to FRESH data to align with correct representation

Co-authored-by: Conor McCarthy <[email protected]>

Co-authored-by: Deanna Morgan <[email protected]>
Co-authored-by: Dianeod <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: Dianeod <[email protected]>

* Automl scoringmodels (#46)

* windows fix for updateConfig - no longer overwrites dir

* code tidy up

* variable declared as global broke tests - changed to local

* new commenting style

* addition of saveModels node

* addition of savemodels node

* addition of saveModels node

* clearned up if statement

* minor code changes

* update to tests to be in line with new dictionary input structure

* updates in line with requirements for tests with dataPreprocessing node

* featureExtractionType across the board change

* Update to tests for featureCreation node

* update to featureExtractionType name

* update to featureSignificance node tests to account for new config structure

* update to configuration retrieval to ensure full config retrieved

* renaming of config parameters for model optimization and update to feature extraction naming for preprocParams node

* update to runModels test config

* updates to configurations for train test split node

* Fix to bug in data split function and renaming of configuration in testing in line with new functionality

* update to saveMeta testing to align with revised structure for prediction functionality

* update to saveoption and feature extraction naming in line with new config for saveModels node

* Update to configuration for testing of saveReport node

* removal of old config definition

* Fix to bug introduced with change to hyperparameter function retrieval, update to configuration keys

* move to json structure for models

* model text files not needed

* json additions for models and scoring

* scoring json file

* apply flag/boolean seed/scoring fixes/docs link

* test fixes

* test fixes

Co-authored-by: Dianeod <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>

* review of predict node (#44)

* removed duplicated function

* review of code

* refactor default layout

* review of predict function

* addition of warning if model comes from an unsupported library

Co-authored-by: Conor McCarthy <[email protected]>

* reintroduction of load for test utils in saveModels

* Adding printing Functionality (#45)

* removed duplicated function

* added functionality to add custom model save path

* Update to be more strongly typed

* added print statements

* removed file

* moved remaining print statements to new format. Added print python warning option

* cleaned up printing dict

* fixed naming convention

* updated naming convention. Adding additional logging parameter

* moved api functionality to utils

* Updates to clean up ordering of printing, allow logging directories/files to be modified in json definitions, update graph as required and add print for graph file locations

Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>

Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: Deanna Morgan <[email protected]>
Co-authored-by: Dianeod <[email protected]>
Co-authored-by: Dianeod <[email protected]>
Co-authored-by: dmorgankx <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>

Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: Dianeod <[email protected]>
Co-authored-by: Deanna Morgan <[email protected]>
Co-authored-by: Dianeod <[email protected]>
Co-authored-by: dmorgankx <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>

* Graph warning (#47)

* removed duplicated function

* Initial pass at json driven command line interfacce

* Major update to command line interface to support new input naming and allow first pass at fire and forget

* update to allow data retrieval via ipc/csv in command line case

* update to json format and command line input structure

* addition of code commenting for new command line version

* Final change to facilitate appropriate model naming conventioj

* typo fix

* review of code

* refactor default layout

* first pass at addition of prediction functionality

* Review of code (#37)

* removed duplicated function

* review of code

* refactor default layout

* Working pass at retrieval of models from disk

* update to remove multiple paths to generate predict function

* revert to pre cli_testing merge

* Minor updates to clean up NLP and correctly retrieve saved model. Update to feature creation for FRESH to support tabular input

* minor fixes to issues with retrieving named models and using the correct save option name

* review of predict function

* Update tests (#40)

* windows fix for updateConfig - no longer overwrites dir

* code tidy up

* variable declared as global broke tests - changed to local

* new commenting style

* addition of saveModels node

* addition of savemodels node

* addition of saveModels node

* clearned up if statement

* minor code changes

* update to tests to be in line with new dictionary input structure

* updates in line with requirements for tests with dataPreprocessing node

* featureExtractionType across the board change

* Update to tests for featureCreation node

* update to featureExtractionType name

* update to featureSignificance node tests to account for new config structure

* update to configuration retrieval to ensure full config retrieved

* renaming of config parameters for model optimization and update to feature extraction naming for preprocParams node

* update to runModels test config

* updates to configurations for train test split node

* Fix to bug in data split function and renaming of configuration in testing in line with new functionality

* update to saveMeta testing to align with revised structure for prediction functionality

* update to saveoption and feature extraction naming in line with new config for saveModels node

* Update to configuration for testing of saveReport node

* removal of old config definition

* Fix to bug introduced with change to hyperparameter function retrieval, update to configuration keys

* path error fix and model meta check

* reintroduction of test utilities needed for passing/failing test logic

* Review of updateTests branch (#42)

* removed duplicated function

* review of code

* refactor default layout

* review of testing code

* Reintroduction of travis testing (#43)

* initial update to reintroduce tests

* reintroduction of tensorflow install requirement

* Change to number of features and minor change to model paths

* update to FRESH data to align with correct representation

Co-authored-by: Conor McCarthy <[email protected]>

Co-authored-by: Deanna Morgan <[email protected]>
Co-authored-by: Dianeod <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: Dianeod <[email protected]>

* add capability to ignore warnings/error statements

* Automl scoringmodels (#46)

* windows fix for updateConfig - no longer overwrites dir

* code tidy up

* variable declared as global broke tests - changed to local

* new commenting style

* addition of saveModels node

* addition of savemodels node

* addition of saveModels node

* clearned up if statement

* minor code changes

* update to tests to be in line with new dictionary input structure

* updates in line with requirements for tests with dataPreprocessing node

* featureExtractionType across the board change

* Update to tests for featureCreation node

* update to featureExtractionType name

* update to featureSignificance node tests to account for new config structure

* update to configuration retrieval to ensure full config retrieved

* renaming of config parameters for model optimization and update to feature extraction naming for preprocParams node

* update to runModels test config

* updates to configurations for train test split node

* Fix to bug in data split function and renaming of configuration in testing in line with new functionality

* update to saveMeta testing to align with revised structure for prediction functionality

* update to saveoption and feature extraction naming in line with new config for saveModels node

* Update to configuration for testing of saveReport node

* removal of old config definition

* Fix to bug introduced with change to hyperparameter function retrieval, update to configuration keys

* move to json structure for models

* model text files not needed

* json additions for models and scoring

* scoring json file

* apply flag/boolean seed/scoring fixes/docs link

* test fixes

* test fixes

Co-authored-by: Dianeod <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>

* review of predict node (#44)

* removed duplicated function

* review of code

* refactor default layout

* review of predict function

* addition of warning if model comes from an unsupported library

Co-authored-by: Conor McCarthy <[email protected]>

* cleaned up code

* added more verbose warnings. Changed the location of removal of previous savePaths

* Update to reverse ordering of warning levels, fixes to deletion logic for tests, cfg->config

Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: cmccarthy1 <[email protected]>
Co-authored-by: Deanna Morgan <[email protected]>
Co-authored-by: dmorgankx <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>

* addition of fit/predict functionality and fix to retrieval of models based on name

* Revert "addition of fit/predict functionality and fix to retrieval of models based on name"

This reverts commit 0e70c17b00aa48af545d892190e235daf46d6af4.

* Addition of Theano capability (#48)

* removed duplicated function

* added capability for adding a Theano model

* Update to Theano model support to remove models and allow run to continue if theano not installed

* added theano model check. Cleaned up printWarnings dict. Fixed print to screen check if saveOpt is 0

Co-authored-by: Conor McCarthy <[email protected]>

* Reintroduction of fit-predict tests and fix to named model retrieval (#49)

* addition of fit-predict tests and fix to retrieval of named models

* Graph testing upd (#50)

* removed duplicated function

* added print statements, included all test files in a bat file

* Changed txt file to bat file in travis

Co-authored-by: cmccarthy1 <[email protected]>

Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: Dianeod <[email protected]>

* Overall code review (#51)

* windows fix for updateConfig - no longer overwrites dir

* code tidy up

* variable declared as global broke tests - changed to local

* new commenting style

* addition of saveModels node

* addition of savemodels node

* addition of saveModels node

* clearned up if statement

* minor code changes

* code review

* code review

* code review

* code review

* code review

* code review

* code review

* code review

* fixes to dataCheck tests

* test updates for windows

* test fixes

* conflict fixes

* review of changes to overall codebase

* minor change to selectModels test

Co-authored-by: Dianeod <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>

* Graph log warning tests (#52)

* removed duplicated function

* added logging tests

* addition of warning/theano/torch tests

* fix for appveyor and travis tests

* fixed appveyor  build

* removed swp file

* updated ignorewarnings print statement

* minor updates, torch change required for non gpu install torch

Co-authored-by: Conor McCarthy <[email protected]>

* Addition of retrieval logic to get nearest model based on start date  (#53)

* removed duplicated function

* Initial pass at retrieval of closest model

* added capability for adding a Theano model

* Update to Theano model support to remove models and allow run to continue if theano not installed

* Addition of model deletion functionality

* removal of code duplication

* Graph delete models (#55)

* added logging tests

* addition of warning/theano/torch tests

* fix for appveyor and travis tests

* fixed appveyor  build

* removed swp file

* updated ignorewarnings print statement

* minor updates, torch change required for non gpu install torch

* review of code

* fix delete models

* fix for getModels using time

Co-authored-by: Conor McCarthy <[email protected]>

* addition of command line interface test and addition of test flag for running cli automl (#54)

Co-authored-by: Conor McCarthy <[email protected]>

* Graph fix misc (#57)

* fixed logging andw warning tests;Updated README; Check for wrong input

* revert changed to TF print

* changed date/time to original format

* cleaned up code

* cleaned up code

* Graph tests (#58)

* reduced tests for appveyor timeout

* reduced number of iterations for Theano

Co-authored-by: Dianeod <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: Dianeod <[email protected]>

Co-authored-by: Dianeod <[email protected]>
Co-authored-by: Dianeod <[email protected]>
Co-authored-by: dmorgankx <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: Deanna Morgan <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>

* Update requirements.txt

Co-authored-by: Dianeod <[email protected]>
Co-authored-by: Dianeod <[email protected]>
Co-authored-by: dmorgankx <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: Deanna Morgan <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>

* Update README.md

* minor change to binary search for model retrieval by date-time, update to allow symbol saveModelName, minor path printing issue

* Addition of recursive deleted function (#15)

* added recursive deleted function

* removed keyPath

* fixed travis issue for mac

* added check for deleting relevant dates

* Change order of remove constant and add nulls (#16)

* Change order of remove constant and add nulls

* swap order of function constant values and null values

Co-authored-by: unknown <Andrew Morrison>

* Refactor of NLP library (#26)

* update merge

* update function name

* fixed findDates function to account for the word of or in between dates, months and years

* removed sys argv statement

* fix appveyor and travis files

* fix copy error

* removed getembedpy

* cluster refactor

* fixed indentations

* fixed @

* Update of date_time.q to new format

* update email to new format and commenting style

* Fix commenting error

* review of parser

* fix email error

* fixed bug

* updated comments

* update commenting

* updated comments

* review of parser code

* Updates to move utils to .i, removal of duplicate email function definitions

* moved callable functions to the end

* moved callable functions to the end

* Minor consistency update

* moved python funcs

* review of regex function

* Updates to parser functionality

* Minor updates to regex string matching refactor

* review of sent

* fix indentation

* fixed length of line to be <80 in regex

* review of utils functions

* fixed indentation

* initial review of nlp_code

* moved functions to nlp_code.q

* Minor changes to sentiment analysis functionality

* renamed files

* minor description updates for nlp utilities

* reintroduction of embedPy load

* updated removeMain and added filelength.t

* minor updates to coincide with docs

* update to coincide with docs

* changed input names

* update comments

* nlp code review qdocs and headers

* updates following comments

* adding dictionarys kind and type

* two small changes

Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: andrewmorrison1 <[email protected]>

* Update to reflect ML Toolkit refactor (#17)

* update to new ml format

* update ml functions for new refactor

* feb 3rd automl code review

* march 3rd code review

* update to infreplace

* response to comments on automl

* update test to run on windows

* changes after comments part 2

* second review of comments

* reply to latest comments

* review

* predict - > transform

* resolved comments

Co-authored-by: andrewmorrison1 <[email protected]>

* add sharpe ratio (#19)

* imported documentation from code.kx.com (#20)

* import documentation from code.kx.com, adapt links; converted to GFM

* minor edits and fixes

* move ml to subfolder

* Support for pykx 2.5.3 & python 3.11

* restructure cleanup

* support pykx & embedpy loaded before shim.q

* Support loading mproc before .ml namespace

* Fix automl output graph labeling

* Add examples & rework READMEs

* Add links to main README

* Update old links

* more links

* used pinned requirements in dockerfile

* remove .gitlab-ci, move shim to ml & deffer docker content

* avoid shim.q dependency

---------

Co-authored-by: Fionnuala Carr <[email protected]>
Co-authored-by: James Hanna <[email protected]>
Co-authored-by: jhanna-kx <[email protected]>
Co-authored-by: fionncarr <[email protected]>
Co-authored-by: Dianeod <[email protected]>
Co-authored-by: diane <[email protected]>
Co-authored-by: awilson-kx <[email protected]>
Co-authored-by: cmccarthy1 <[email protected]>
Co-authored-by: awilson-kx <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: Deanna Morgan <[email protected]>
Co-authored-by: dmorgankx <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: andrewmorrison1 <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: andrewmorrison1 <[email protected]>
Co-authored-by: Stephen Taylor <[email protected]>
  • Loading branch information
23 people authored Oct 9, 2024
1 parent 3d4c7a1 commit 7657d99
Show file tree
Hide file tree
Showing 443 changed files with 75,120 additions and 1,039 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
**/.DS_Store
automl/outputs
data
83 changes: 0 additions & 83 deletions .travis.yml

This file was deleted.

8 changes: 4 additions & 4 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Thanks for choosing to contribute to this project.

If you haven't already, please view the [README](README.md) for an introduction to the aims and intended use of this project.
If you haven't already, please view the [README](README.md) for an introduction to the aims and intended use of this project.

## Contributing as a user (non-development)

Expand Down Expand Up @@ -32,15 +32,15 @@ It can also be worth considering if your changes may also require changes to doc

### Submitting Changes

When committing changes, please provide a descriptive commit comment of why the change was made (e.g. 'fixed bug' is not a suitable comment as it doesn't describe which bug).
When committing changes, please provide a descriptive commit comment of why the change was made (e.g. 'fixed bug' is not a suitable comment as it doesn't describe which bug).

You can link to a relevant issue in a commit comment by referencing the issue number prefixed with a '#'.

After pushing to your fork, submit a pull request against the project's main branch.
After pushing to your fork, submit a pull request against the project's main branch.

In order to have your pull request approved in a timely manner, please provide comments on the pull request that details what was changed & for which reasons. The complexity or size of the change indicate the size of the descriptions required, in order for the reviewer to get up to speed as quick as possible.

### Additional Resources

[Machine Learning Documentation](https://code.kx.com/ml/)
[Machine Learning Documentation](https://code.kx.com/q/ml/)

File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ assignees: ''
---

**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. For example "It would be useful if the a metric for ... was available to help with ..."
A clear and concise description of what the problem is. For example "It would be useful if a metric for ... was available to help with ..."

**Describe the solution you'd like**
A clear and concise description of what you want to happen.
Expand Down
151 changes: 129 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,47 @@
# Machine Learning Toolkit

[![GitHub release (latest by date)](https://img.shields.io/github/v/release/kxsystems/ml?include_prereleases)](https://github.com/kxsystems/ml/releases) [![Build Status](https://travis-ci.com/KxSystems/ml.svg?branch=master)](https://travis-ci.com/KxSystems/ml)
The Machine Learning Toolkit is a comprehensive suite designed to empower kdb+/q users with advanced machine learning capabilities. It offers a robust and flexible framework for addressing a wide range of tasks, including time series analysis, natural language processing, and automated machine learning. By integrating seamlessly with kdb+/q, the toolkit facilitates efficient data handling and processing, leveraging both traditional machine learning techniques and modern NLP models.

The Machine Learning Toolkit is at the core of kdb+/q-centered machine-learning functionality. This library contains functions that cover the following areas:
* An implementation of the FRESH (FeatuRe Extraction and Scalable Hypothesis testing) algorithm for use in the extraction of features from time series data and the reduction in the number of features through statistical testing.
* Cross-validation and grid-search functions allowing for testing of the stability of models to changes in the volume of data or the specific subsets of data used in training.
* Clustering algorithms used to group data points and to identify patterns in their distributions. The algorithms make use of a k-dimensional tree to store points and scoring functions to analyze how well they performed.
* Statistical timeseries models and feature-extraction techniques used for the application of machine learning to timeseries problems. These models allow for the forecasting of the future behavior of a system under various conditions.
* Numerical techniques for calculating the optimal parameters for an objective function.
* A graphing and pipeline library for the creation of modularized executable workflow based on a structure described by a mathematical directed graph.
* Utility functions relating to areas including statistical analysis, data preprocessing and array manipulation.
The repository is structured as three modules: ml and nlp can each be used independently for their respective feature sets [as further described below](#components); automl builds upon ml and nlp to deliver automated machine learning capabilities.

These sections are explained in greater depth within the [FRESH](docs/fresh.md), [cross validation](docs/xval.md), [clustering](docs/clustering/algos.md), [timeseries](docs/timeseries/README.md), [optimization](docs/optimize.md), [graph/pipeline](docs/graph/README.md) and [utilities](docs/utilities/metric.md) documentation.
<!-- ## Getting started
## Requirements
To get up and running quickly, start by pulling the Docker image, which comes pre-installed with all dependencies specified in requirements_pinned.txt. This allows you to dive straight into trying out our [examples](examples/) and exploring the toolkit's capabilities without the need for additional setup.
- embedPy
```bash
git clone https://github.com/KxSystems/ml.git ml
docker pull <image>
docker run -itv ./ml:/ml -e QLIC_K4=$(cat $QHOME/k4.lic | base64 -w0) --entrypoint /bin/bash <image>
# Now within the container, source the initial environment setup script
cd /ml
source scripts/setup.sh
source scripts/pykx.sh # Switch from embedpy to pykx (optionally continue with embedpy)
source scripts/link.sh # Install the toolkit into your selected QHOME
# Now simply start q Load and work with the desired components in q
rlwrap q
q)\l nlp/nlp.q
q).nlp.loadfile`:init.q
Loading init.q
Loading code/utils.q
Loading code/regex.q
Loading code/sent.q
Loading code/parser.q
Loading code/time.q
Loading code/date.q
Loading code/email.q
Loading code/cluster.q
Loading code/nlp_code.q
q).nlp.findTimes"I went to work at 9:00am and had a coffee at 10:20" # See examples/ for more advanced usage.
09:00:00.000 "9:00am" 18 24
10:20:00.000 "10:20" 45 50
q)
``` -->

### Requirements

- kdb+ >= 3.5 64-bit

The Python packages required to allow successful execution of all functions within the machine learning toolkit can be installed via:

Expand All @@ -29,18 +55,32 @@ or via conda:
conda install --file requirements.txt
```

Alternatively, use `requirements_pinned.txt` for a fully resolved, pinned & known working set of dependencies or module specific requirements.txt (eg ml/requirements.txt) when only utilizing a subset of the toolkit.

While the nlp framework may be used with other models, automl the nlp tests use en_core_web_sm. You can download this after installing the python requirements like so:
```bash
python -m spacy download en_core_web_sm
```

<!-- //! optional reqs for automl -->

## Installation

Place the `ml` folder in `$QHOME` and load into a q instance using `ml/ml.q`
### Installation

The following will load **all** functionality into the `.ml` namespace
To install, simply copy or link the desired components to your `$QHOME` directory, for example: `cp -r {ml,nlp,automl} $QHOME/`.

To load all functionality into the `.automl`, `.ml`, and `.nlp` namespaces, run the following from q:
```q
\l ml/ml.q
.ml.loadfile`:init.q
\l automl/automl.q
.automl.loadfile`:init.q
```

## Examples
* To load only specific modules, replace automl with ml or nlp in the commands above.

Once installed, you can explore the toolkit's capabilities by trying out our [examples](examples/).


<!-- ### Examples //! currently outdated
Examples showing implementations of several components of this toolkit can be found [here](https://github.com/KxSystems/mlnotebooks/). These notebooks include examples of the following sections of the toolkit.
Expand All @@ -49,13 +89,80 @@ Examples showing implementations of several components of this toolkit can be fo
* Cross validation and grid search capabilities
* Results Scoring functionality
* Clustering methods applied to datasets
* Timeseries modeling examples
* Timeseries modeling examples -->


## Components
### ml
This library contains functions that cover the following areas:
- An implementation of the FRESH (FeatuRe Extraction and Scalable Hypothesis testing) algorithm for use in the extraction of features from time series data and the reduction in the number of features through statistical testing.
- Cross-validation and grid-search functions allowing for testing of the stability of models to changes in the volume of data or the specific subsets of data used in training.
- Clustering algorithms used to group data points and to identify patterns in their distributions. The algorithms make use of a k-dimensional tree to store points and scoring functions to analyze how well they performed.
- Statistical timeseries models and feature-extraction techniques used for the application of machine learning to timeseries problems. These models allow for the forecasting of the future behavior of a system under various conditions.
- Numerical techniques for calculating the optimal parameters for an objective function.
- A graphing and pipeline library for the creation of modularized executable workflow based on a structure described by a mathematical directed graph.
- Utility functions relating to areas including statistical analysis, data preprocessing and array manipulation.
- A multi-processing framework to parallelize work across many cores or nodes.
- Functions for seamless integration with PyKX or EmbedPy, which ensure seamless interoperability between Python and kdb+/q in either environment.

These sections are explained in greater depth within the [FRESH](ml/docs/fresh.md), [cross validation](ml/docs/xval.md), [clustering](ml/docs/clustering/algos.md), [timeseries](ml/docs/timeseries/README.md), [optimization](ml/docs/optimize.md), [graph/pipeline](ml/docs/graph/README.md) and [utilities](ml/docs/utilities/metric.md) documentation.


### nlp

The Natural language processing (NLP) module allows users to parse dataset using the spacy model from python in which it runs tokenisation, Sentence Detection, Part of speech tagging and Lemmatization. In addition to parsing, users can cluster text documents together using different clustering algorithms like MCL, K-means and radix. You can also run sentiment analysis which indicates whether a word has a positive or negative sentiment.

<!-- //! docs? old link is dead: Documentation is available on the [nlp](https://code.kx.com/v2/ml/nlp/) homepage.-->


## Documentation
### automl

Documentation for all sections of the Machine Learning Toolkit:
The automated machine learning library described here is built on top of ml & nlp. The purpose of this framework is help you automate the process of applying machine learning techniques to real-world problems. In the absence of expert machine-learning engineers this handles the following processes within a traditional workflow.

- Data preprocessing
- Feature engineering and feature selection
- Model selection
- Hyperparameter Tuning
- Report generation and model persistence

Each of these steps is outlined in depth within the [documentation](automl/docs).

<!--
## Building the docker images
### preflight
You will need [Docker installed](https://www.docker.com/community-edition) on your workstation; make sure it is a recent version.
Check out a copy of the project with `git clone https://github.com/KxSystems/ml.git`.
### building
To build the project locally:
```bash //! improve
docker build -t registry.gitlab.com/kxdev/kxinsights/data-science/ml-tools/automl:embedpy-gcc-deb12 -f docker/Dockerfile .
docker build -t myimage:mytag -f docker/Dockerfile .
``` -->

<!-- **N.B.** if you wish to use an alternative source for [embedPy](https://github.com/KxSystems/embedPy) then you can append `--build-arg embedpy_img=embedpy` to your argument list. -->

<!-- Other build arguments are supported and you should browse the `Dockerfile` to see what they are. -->

<!-- Once built, you should have a local image which you can run with as shown in the "Getting started" section above. -->

<!-- ### Deploy //! outdated
[travisCI](https://travis-ci.org/) is configured to monitor when tags of the format `/^[0-9]+\./` are added to the [GitHub hosted project](https://github.com/KxSystems/ml), a corresponding Docker image is generated and made available on [Docker Cloud](https://cloud.docker.com/)
This is all done server side as the resulting image is large.
To do a deploy, you simply tag and push your releases as usual:
```bash
git push
git tag 0.7
git push --tag
``` -->

:open_file_folder: [`docs`](docs)

## Status

Expand Down
26 changes: 0 additions & 26 deletions appveyor.yml

This file was deleted.

48 changes: 48 additions & 0 deletions automl/automl.q
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
// automl.q - Setup automl namespace
// Copyright (c) 2021 Kx Systems Inc
//
// Define version, path, and loadfile.
// Execute algo if run from cmd line.


\d .automl

if[not `e in key `.p;
@[{system"l ",x;.pykx.loaded:1b};"pykx.q";
{@[{system"l ",x;.pykx.loaded:0b};"p.q";
{'"Failed to load PyKX or embedPy with error: ",x}]}]];

if[not `loaded in key `.pykx;.pykx.loaded:`import in key `.pykx];
if[.pykx.loaded;.p,:.pykx];

// Coerse to string/sym
coerse:{$[11 10h[x]~t:type y;y;not[x]&-11h~t;y;0h~t;.z.s[x] each y;99h~t;.z.s[x] each y;t in -10 -11 10 11h;$[x;string;`$]y;y]}
cstring:coerse 1b;
csym:coerse 0b;

// Ensure plain python string (avoid b' & numpy arrays)
pydstr:$[.pykx.loaded;{.pykx.eval["lambda x:x.decode()"].pykx.topy x};::]

version:@[{AUTOMLVERSION};`;`development]
path:{string`automl^`$@[{"/"sv -1_"/"vs ssr[;"\\";"/"](-3#get .z.s)0};`;""]}`
loadfile:{$[.z.q;;-1]"Loading ",x:_[":"=x 0]x:$[10=type x;;string]x;system"l ",path,"/",x;}

// @kind description
// @name commandLineParameters
// @desc Retrieve command line parameters and convert to a kdb+ dictionary
commandLineInput:first each .Q.opt .z.x

// @kind description
// @name commandLineExecution
// @desc If a user has defined both config and run command line arguments, the
// interface will attempt to run the fully automated version of AutoML. The
// content of the JSON file provided will be parsed to retrieve data
// appropriately via ipc/from disk, then the q session will exit.
commandLineArguments:lower key commandLineInput
if[all`config`run in commandLineArguments;
loadfile`:init.q;
.ml.updDebug[];
testRun:`test in commandLineArguments;
runCommandLine[testRun];
exit 0]

Loading

0 comments on commit 7657d99

Please sign in to comment.