Abstract

This year my proposal got selected under GSoC-2021 by mlpack. mlpack is a fast, flexible and amazing C++ Machine Learning Library with awesome community. Over the years we have seen huge growth the application of ML. Doesn't matter for which application you are using ML, one thing that you will always need is Fast Data Loading.

Overview of the project

The project revolves around reducing the binary footprint of mlpack by replacing the functionality of boost::spirits to handle a more diverse range of data in mlpack. This project is a part of the bigger goal of removing boost dependencies. The project aims to implement a parser by taking inspiration from armadillo's parser. A major goal will be to re-implement mlpack’s custom CSV parser that is currently being utilized to handle non-numeric data.

Summary of Project

We were able to nearly acheive all the targets that were set in the proposal but there are many more things to do. We successfully implemented the parse for numeric data and also successfully removed boost::spirit and adapted the parse for non-numeric data as well.

Here's my code

load_csv.hpp as base for the parser
- Declarations of all the fucntions used for parsing.
- Implementation of fucntions which are common to both types of parser.
load_numeric_csv.hpp for numeric parser
- Functions for loading numeric data.
load_categorical_csv.hpp for cateforical data
- Functions for loading categorical data using DatasetMapper.
Implemented FileTyp in mlpack to replace arma::file_type.
Implemented a set of string algorithms.
- trim()
- trim_if()
Example on how to use DatasetMapper
Adding DatasetMapper example to docs.

All of my work in wrapped in one PR only. You can have a look at the PR.

Go through the PR to follow the discussions. You can also jump to IRC. There are also mlpack IRC logs which Ryan has worked very hard to maintain so you can have a look there as well.

Best way to clear your querries? Ping me!!!. I would be more than happy to hear your views and clear any doubts that you have.

Results

I will add more details about this soon.

Blog

I also logged my weekly progress and updates in form of blog. You can have a look here.

Future Scope

I will keep working on mlpack since I have realized that the best way to learn in to contribute. There is still soo much to do with the data loading part in mlpack. Once this is merged into the master I will start working on restructing the load.hpp. You can see more about it on my blog. I also have plans to work on creating a DataFrame class for mlpack.

Note of Thanks

I would really like to thank mlpack for this awesome summer. My mentor Omar Shrit guided me at each and every step of the way. Not only my mentor but the whole community was very helpful. Thank you Ryan, Marcus and everyone who were there for those aweomse discussions on IRC and weekly meetups.

At last I would like to thank Google for organising such an event and working towards the growth on open-source community.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Organisation: mlpack

Project: Replacing boost spirits

Mentor: Omar Shrit

Abstract

Overview of the project

Summary of Project

Here's my code

Results

Blog

Future Scope

Note of Thanks

Files

README.md

Latest commit

History

README.md

File metadata and controls

Organisation: mlpack

Project: Replacing boost spirits

Mentor: Omar Shrit

Abstract

Overview of the project

Summary of Project

Here's my code

Results

Blog

Future Scope

Note of Thanks