Chi-Squared Test

This project provide a text feature selection method with chi-squared test.
The script could run in stand-alone mode or cluster mode by hadoop streaming.
https://github.com/kn45/Chi-Square

---	cat	non-cat	sum over cats
with word	A[]	B[]	A+B
without word	C[]	D[]	C+D
sum	A+C[]	B+D[]	N

$\chi^2 = \frac{N(AD-BC)^2}{(A+C)(A+B)(B+D)(C+D)}$
$\chi^2 = \frac{(AD-BC)^2}{(A+B)(C+D)}$ (abbrev for in-cat scenario)

Input Format

cat[TAB]segments
cat is class label in string while segments are space separeted words from a certain passage
eg:
sport[TAB]well done MSN congrats to Barcelona

Output Format

cat[TAB]word[TAB]chi2_value[TAB]A[TAB]B[TAB]C[TAB]D[TAB]pos
pos means positive(1) or negative(-1) relative

Dict Format

A file records the pre-computed number of passages of each category with format:
cat[TAB]count
e.g.:
fashion[TAB]347882
sport[TAB]2443297

Usage

stand-alone:

cat input_passage.tst | ./mapred_chi2.py m | sort | ./mapred_chi2.py r passage_cnt_file > output_chi2.tst

cluster:

Refer to run_chi2_uni.sh

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
LICENSE.md		LICENSE.md
README.md		README.md
mapred_chi2.py		mapred_chi2.py
run_chi2_uni.sh		run_chi2_uni.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chi-Squared Test

Input Format

Output Format

Dict Format

Usage

stand-alone:

cluster:

About

Releases

Packages

Languages

License

kn45/chi-square

Folders and files

Latest commit

History

Repository files navigation

Chi-Squared Test

Input Format

Output Format

Dict Format

Usage

stand-alone:

cluster:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages