Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create spam classification tutorial #112

Merged
merged 48 commits into from
Jan 5, 2022
Merged
Show file tree
Hide file tree
Changes from 38 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
3bf2dd2
initial creation of spam tutorial and update of data download script
bkmgit Aug 24, 2020
aa5cec5
Update spam/tutorial.md
bkmgit Aug 28, 2020
0236200
Update spam/tutorial.md
bkmgit Aug 28, 2020
65d1df2
Update spam/tutorial.md
bkmgit Aug 28, 2020
9b04377
Update spam/tutorial.md
bkmgit Aug 28, 2020
943266c
Update spam/tutorial.md
bkmgit Aug 28, 2020
ea562e7
Update spam/tutorial.md
bkmgit Aug 28, 2020
1586e9b
Update spam/tutorial.md
bkmgit Aug 28, 2020
0c6ba9c
Update spam/tutorial.md
bkmgit Sep 17, 2020
11a07b7
Update spam/tutorial.md
bkmgit Sep 17, 2020
d7c4891
Update spam/tutorial.md
bkmgit Sep 17, 2020
1ed293c
Update spam/tutorial.md
bkmgit Sep 17, 2020
b4df918
Update spam/tutorial.md
bkmgit Sep 17, 2020
1e57f69
Update spam/tutorial.md
bkmgit Sep 17, 2020
da82a46
Update spam/tutorial.md
bkmgit Sep 17, 2020
7e5de02
Update tutorial.md
bkmgit Sep 17, 2020
3884cfc
add tutorial as a bash script
bkmgit Sep 17, 2020
a2ce18a
Merge branch 'mlpack:master' into master
bkmgit May 23, 2021
5c0ff28
update example lists in README
bkmgit May 23, 2021
4cd27f5
Merge branch 'master' of https://github.com/bkmgit/examples-1
bkmgit May 23, 2021
486eab5
minor update of spam classification tutorial
bkmgit May 23, 2021
3dba2c7
update to download spam dataset
bkmgit May 23, 2021
9013136
Update spam/spam_classification.sh
bkmgit Jul 16, 2021
e7b27b2
Update spam/spam_classification.sh
bkmgit Jul 16, 2021
6f4e7f3
Update spam/spam_classification.sh
bkmgit Jul 16, 2021
5b1205d
Update spam/spam_classification.sh
bkmgit Jul 16, 2021
62e1044
Update spam/tutorial.md
bkmgit Jul 16, 2021
710c017
Update spam/spam_classification.sh
bkmgit Jul 16, 2021
19f3803
fix conflict
bkmgit Jul 16, 2021
6cc39bd
Merge branch 'mlpack-master'
bkmgit Jul 16, 2021
c495d4b
remove tutorial.sh
bkmgit Jul 16, 2021
fd0d1e4
Merge branch 'mlpack:master' into master
bkmgit Jul 30, 2021
a1219a1
Test whether SPAM example runs
bkmgit Jul 30, 2021
a38bce1
implement @zoqs' suggestion
bkmgit Jul 30, 2021
8132776
Improve comment formatting
bkmgit Jul 30, 2021
0677b0a
Update version of Ubuntu
bkmgit Jul 30, 2021
a58a6f4
check if build will work without build script
bkmgit Jul 30, 2021
9d9c401
update script permissions
bkmgit Jul 30, 2021
72460e6
remove spam pre-processing in CI
bkmgit Aug 1, 2021
745e953
fix error in ordering of commands
bkmgit Aug 1, 2021
fef1858
enable building of command line executables
bkmgit Aug 1, 2021
4285f0e
remove example builds due to time constraint
bkmgit Aug 1, 2021
a518cb2
temporarily disable dataset download, travis
bkmgit Aug 1, 2021
2db8885
update data files
bkmgit Aug 1, 2021
1ced319
Merge branch 'master' into master
bkmgit Nov 17, 2021
32fec35
remove file as it can be pre-processed
bkmgit Nov 17, 2021
cbd31c3
remove file as it can be pre-processed
bkmgit Nov 17, 2021
f483f4e
skip processing of spam
bkmgit Nov 17, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
sudo: required
dist: xenial
dist: focal
language: cpp

before_install:
Expand Down Expand Up @@ -107,7 +107,10 @@ script:
cat faillog;
exit 1;
fi

# Make sure the spam tutorial runs correctly.
- cd spam/
- ./spam_classification.sh
- cd ../

notifications:
email:
Expand Down
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,9 @@ description (just a little bit more than the title):
- `lstm_stock_prediction`: predict Google's historical stock price (daily high
_and_ low) using an LSTM-based recurrent neural network

- `spam`: predict whether a mobile phone text message in Indonesian is spam
or not using logistic regression

- `mnist_batch_norm`: use batch normalization in a simple feedforward neural
network to recognize the MNIST digits

Expand Down
322 changes: 322 additions & 0 deletions spam/spam_classification.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,322 @@
#!/usr/bin/env bash
# Spam Classification with mlpack on the command line

## Introduction

: <<'COMMENT'
In this tutorial, the mlpack command line interface will
be used to train a machine learning model to classify
SMS spam. It will be assumed that mlpack has been
successfully installed on your machine. The tutorial has
been tested in a linux environment. It is written in
bash - https://www.gnu.org/software/bash/

Download the dataset using the `download_data_set.py` python script in the
tools directory.

Then run this example script using

bash spam_classification.sh

If you are using a low power computer, you may want to run
the script using

nice -20 bash spam_classification.sh

so that it runs as a background task.
COMMENT

## Example

: <<'COMMENT'
As an example, we will train a machine learning model to classify
spam SMS messages. We will use an example spam dataset in Indonesian
provided by Yudi Wibisono.

We will try to classify a message as spam or ham by the number of
occurences of a word in a message. We first change the file line
endings, remove line 243 which is missing a label and then remove the
header from the dataset. Then, we split our data into two files, labels
and messages. Since the labels are at the end of the message, the message
is reversed and then the label removed and placed in one file. The
message is then removed and placed in another file.
COMMENT

tr '\r' '\n' < ../data/dataset_sms_spam_bhs_indonesia_v1/dataset_sms_spam_v1.csv > dataset.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary on all systems? I found that the output I got here for dataset.txt when I ran locally was like this:

$ head dataset.txt 
Teks,label

"[PROMO] Beli paket Flash mulai 1GB di MY TELKOMSEL APP dpt EXTRA kuota 2GB 4G LTE dan EXTRA nelpon hingga 100mnt/1hr. Buruan, cek  di tsel.me/mytsel1 S&K",2

2.5 GB/30 hari hanya Rp 35 Ribu Spesial buat Anda yang terpilih. Aktifkan sekarang juga di *550*905#. Promo sd 30 Nov 2015.Buruan aktifkan sekarang. S&K,2

"2016-07-08 11:47:11.Plg Yth, sisa kuota Flash Anda 478KB. Download MyTelkomsel apps di http://tsel.me/tsel utk cek kuota&beli paket Flash atau hub *363#",2

"2016-08-07 11:29:47.Plg Yth, sisa kuota Flash Anda 7160KB. Download MyTelkomsel apps di http://tsel.me/tsel utk cek kuota&beli paket Flash atau hub *363#",2

What's the goal of the line? Maybe we can use dos2unix or something instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added sed '/^$/d' dataset2.csv > dataset.csv to remove the extra lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regenerating data files. mlpack_preprocess_split fails with an error if one of the labels is a .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I encountered the same thing. I see that labels.csv has one line that's just a ., which can't be parsed:

$ cat labels.csv  | sort |uniq -c
      1 .
    569 0
    574 1

Maybe there is a bug or an extra case that needs to be handled in the preprocessing script?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this, incorrect lines were joined. Checking build.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! :) I'm checking to see if it works on my system too. 👍

sed '243d' dataset.txt > dataset1.csv
sed '1d' dataset1.csv > dataset.csv
rev dataset.csv | cut -c1 | rev > labels.txt
rev dataset.csv | cut -c2- | rev > messages.txt
rm dataset.csv
rm dataset1.csv
rm dataset.txt

: <<'COMMENT'
Machine learning works on numeric data, so we will use labels of
1 for ham and 0 for spam. The dataset contains three labels, 0,
normal sms (ham), 1, fraud (spam) and 2, promotion (spam). We will
label all spam as 1, so promotions and fraud will be labelled as 1.
COMMENT

tr '2' '1' < labels.txt > labels.csv
rm labels.txt

: <<'COMMENT'
The next step is to convert all text in the messages to lower case
and for simplicity remove punctuation and any symbols that are not
spaces, line endings or in the range a-z (one would need expand
this range of symbols for production use)
COMMENT

tr '[:upper:]' '[:lower:]' < messages.txt > messagesLower.txt
tr -Cd 'abcdefghijklmnopqrstuvwxyz \n' < messagesLower.txt > messagesLetters.txt
rm messagesLower.txt

: <<'COMMENT'
We now obtain a sorted list of unique words used (this step may take
a few minutes).
COMMENT

xargs -n1 < messagesLetters.txt > temp.txt
sort temp.txt > temp2.txt
uniq temp2.txt > words.txt
rm temp.txt
rm temp2.txt

: <<'COMMENT'
We then create a matrix, where for each message, the frequency of word
occurrences is counted (more on this on Wikipedia,
https://en.wikipedia.org/wiki/Tf–idf and
https://en.wikipedia.org/wiki/Document-term_matrix ).
COMMENT


declare -a words=()
declare -a letterstartind=()
declare -a letterstart=()
letter=" "
i=0
lettercount=0
while IFS= read -r line; do
labels[$((i))]=$line
let "i++"
done < labels.csv
i=0
while IFS= read -r line; do
words[$((i))]=$line
firstletter="$( echo $line | head -c 1 )"
if [ "$firstletter" != "$letter" ]
then
letterstartind[$((lettercount))]=$((i))
letterstart[$((lettercount))]=$firstletter
letter=$firstletter
let "lettercount++"
fi
let "i++"
done < words.txt
letterstartind[$((lettercount))]=$((i))
echo "Created list of letters"

touch wordfrequency.txt
rm wordfrequency.txt
touch wordfrequency.txt
messagecount=0
messagenum=0
messages="$( wc -l messages.txt )"
i=0
while IFS= read -r line; do
let "messagenum++"
declare -a wordcount=()
declare -a wordarray=()
read -r -a wordarray <<< "$line"
let "messagecount++"
words=${#wordarray[@]}
for word in "${wordarray[@]}"; do
startletter="$( echo $word | head -c 1 )"
j=-1
while [ $((j)) -lt $((lettercount)) ]; do
let "j++"
if [ "$startletter" == "${letterstart[$((j))]}" ]
then
mystart=$((j))
fi
done
myend=$((mystart))+1
j=${letterstartind[$((mystart))]}
jend=${letterstartind[$((myend))]}
while [ $((j)) -le $((jend)) ]; do
wordcount[$((j))]=0
if [ "$word" == "${words[$((j))]}" ]
then
wordcount[$((j))]="$( echo $line | grep -o $word | wc -l )"
fi
let "j++"
done
done
for j in "${!wordcount[@]}"; do
wordcount[$((j))]=$(echo " scale=4; $((${wordcount[$((j))]})) / $((words))" | bc)
done
wordcount[$((words))+1]=$((words))
echo "${wordcount[*]}" >> wordfrequency.txt
echo "Processed message ""$messagenum"
let "i++"
done < messagesLetters.txt

# Create csv file
tr ' ' ',' < wordfrequency.txt > data.csv

: <<'COMMENT'
Since Bash is an interpreted language, this simple implementation can
take up to 30 minutes to complete.
COMMENT

: <<'COMMENT'
Once the script has finished running, split the data into testing (30%)
and training (70%) sets:
COMMENT

mlpack_preprocess_split \
--input_file data.csv \
--input_labels_file labels.csv \
--training_file train.data.csv \
--training_labels_file train.labels.csv \
--test_file test.data.csv \
--test_labels_file test.labels.csv \
--test_ratio 0.3 \
--verbose

: <<'COMMENT'
Now train a Logistic regression model
(https://mlpack.org/doc/stable/cli_documentation.html#logistic_regression):
COMMENT

mlpack_logistic_regression --training_file train.data.csv \
--labels_file train.labels.csv \
--lambda 0.1 \
--output_model_file lr_model.bin

: <<'COMMENT'
Finally we test our model by producing predictions,
COMMENT

mlpack_logistic_regression --input_model_file lr_model.bin \
--test_file test.data.csv \
--output_file lr_predictions.csv

: <<'COMMENT'
and comparing the predictions with the exact results,
COMMENT

export incorrect=$(diff -U 0 lr_predictions.csv test.labels.csv | grep '^@@' | wc -l)
export tests=$(wc -l < lr_predictions.csv)
echo "scale=2; 100 * ( 1 - $((incorrect)) / $((tests)))" | bc

: <<'COMMENT'
This gives approximately 90% validation rate, similar to that
obtained at
https://towardsdatascience.com/spam-detection-with-logistic-regression-23e3709e522

The dataset is composed of approximately 50% spam messages,
so the validation rates are quite good without doing much parameter tuning.
In typical cases, datasets are unbalanced with many more entries in
some categories than in others. In these cases a good validation
rate can be obtained by mispredicting the class with a few entries.
Thus to better evaluate these models, one can compare the number of
misclassifications of spam, and the number of misclassifications of ham.
Of particular importance in applications is the number of false
positive spam results as these are typically not transmitted. The next
portion of the script creates a confusion matrix.
COMMENT


declare -a labels
declare -a lr
i=0
while IFS= read -r line; do
labels[i]=$line
let "i++"
done < test.labels.csv
i=0
while IFS= read -r line; do
lr[i]=$line
let "i++"
done < lr_predictions.csv
TruePositiveLR=0
FalsePositiveLR=0
TrueZerpLR=0
FalseZeroLR=0
Positive=0
Zero=0
for i in "${!labels[@]}"; do
if [ "${labels[$i]}" == "1" ]
then
let "Positive++"
if [ "${lr[$i]}" == "1" ]
then
let "TruePositiveLR++"
else
let "FalseZeroLR++"
fi
fi
if [ "${labels[$i]}" == "0" ]
then
let "Zero++"
if [ "${lr[$i]}" == "0" ]
then
let "TrueZeroLR++"
else
let "FalsePositiveLR++"
fi
fi

done
echo "Logistic Regression"
echo "Total spam" $Positive
echo "Total ham" $Zero
echo "Confusion matrix"
echo " Predicted class"
echo " Ham | Spam "
echo " ---------------"
echo " Actual| Ham | " $TrueZeroLR "|" $FalseZeroLR
echo " class | Spam | " $FalsePositiveLR " |" $TruePositiveLR
echo ""

: <<'COMMENT'
You should get output similar to

Logistic Regression
Total spam 183
Total ham 159
Confusion matrix
Predicted class
-------------------
| Ham | Spam
| Actual | Ham | 128 | 26
| class | Spam | 31 | 157

which indicates a reasonable level of classification.
Other methods you can try in mlpack for this problem include:
* Naive Bayes
https://mlpack.org/doc/stable/cli_documentation.html#nbc
* Random forest
https://mlpack.org/doc/stable/cli_documentation.html#random_forest
* Decision tree
https://mlpack.org/doc/stable/cli_documentation.html#decision_tree
* AdaBoost
https://mlpack.org/doc/stable/cli_documentation.html#adaboost
* Perceptron
https://mlpack.org/doc/stable/cli_documentation.html#perceptron

To improve the error rating, you can try other pre-processing methods
on the initial data set. Neural networks can give up to 99.95%
validation rates, see for example:

https://thesai.org/Downloads/Volume11No1/Paper_67-The_Impact_of_Deep_Learning_Techniques.pdf
https://www.kaggle.com/kredy10/simple-lstm-for-text-classification
https://www.kaggle.com/xiu0714/sms-spam-detection-bert-acc-0-993

However, using these techniques with mlpack is best covered in another tutorial.

This tutorial is an adaptation of one that first appeared in the Fedora Magazine
https://fedoramagazine.org/spam-classification-with-ml-pack/
COMMENT
Loading