This is a Node-RED extension pack and contains a set of nodes which offer Spark Dataframe, SQL and machine learning functionalities. All nodes have a python/pyspark core.
Allows Drag & Drop Machine Learning with Spark. Provides Visual Interface.
This project is a WIP, and I am planning to add more nodes - as many as are available in Spark Transformers and Estimators.
- TF-IDF
- Word2Vec
- CountVectorizer
- FeatureHasher
- Tokenizer
- StopWordsRemover
- n-gram
- Binarizer
- PCA
- StringIndexer
- IndexToString
- OneHotEncoderEstimator
- VectorIndexer
- SQLTransformer
- VectorAssembler
- Decision Tree Classifier
- Logistic Regression
- Gradient-boosted Tree Classifier
- Multilayer Perceptron
- Random Forest Classifier
- Support Vector Machines
- k-Nearest Neighbour Classifier
- K-Means Clustering
- Latent Dirichlet allocation (LDA)
Be sure to have a working installation of Node-RED.
Install python and the following libraries:
To install the latest version use the Menu - Manage palette option and search for node-red-contrib-sparkml, or run the following command in your Node-RED user directory (typically ~/.node-red
):
npm i node-red-contrib-sparkml
These flows create a dataset, train a model and then evaluate it. Models, after training, can be use in real scenarios to make predictions.
There is an example flow and a test dataset available in the 'test' folder.
Tip: You can run 'node-red' (or 'sudo node-red' if you are using linux/mac) from the folder '.node-red/node-modules/node-red-contrib-sparkml' to avoid confusion.
I am looking for contributors! Feel free to open issues directly on github or email me for any questions, suggesting features or general feedback!