SCODE Word Embeddings using Substitute Words##

This repository provides a tool to induce word vectors using substitute words. Using this repository you can:

Generate word-type embeddings described in [1,3,4]
Generate word-token embeddings described in [2,5]

[1]. Learning Syntactic Categories Using Paradigmatic Representations of Word Context, In Proceedings of the 2012 Conference on Empirical Methods in Natural Language Processing (EMNLP-CONLL 2012), Jeju, Korea, July. Association for Computational Linguistics, Paper, Presentation & Code, bib.

[2]. Word Context and Token Representations from Paradigmatic Relations and Their Application to Part-of-Speech Induction, Paper & Presentation

[3]. The AI-KU System at the SPMRL 2013 Shared Task : Unsupervised Features for Dependency Parsing, In Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages, pp 78--85, Seattle, Washington, USA, October. Association for Computational Linguistics, Paper, Word Embeddings, bib

[4] Substitute Based SCODE Word Embeddings in Supervised NLP Tasks, Word Vectors For 7 Languages, bib

Other Word Vectors

Here is a list of other word vectors :

Turian, Word representations: A simple and general method for semi-supervised learning
Dhillon, Multi-View Learning of Word Embeddings via CCA
Mikolov, Recurrent neural network based language model
Mikolov, Efficient Estimation of Word Representations in Vector Space
Huang, Improving Word Representations Via Global Context And Multiple Word Prototypes
Stratos, A Spectral Algorithm for Learning Class-Based n-gram Models of Natural Language
Yogatama, Learning Word Representations with Hierarchical Sparse Coding
Faruqui, Improving Vector Space Word Representations Using Multilingual Correlation
Lebret, Word Embeddings through Hellinger PCA

Semi-Supervised Word Vectors

Below is a list of word vectors using dependency pairs for inducing representations.

Murphy, Learning Effective and Interpretable Semantic Models using Non-Negative Sparse Embedding
Levy, Dependency-Based Word Embeddings
Bansal, Tailoring Continuous Word Representations for Dependency Parsing

0 - Setting-Up The Environment

After cloning the repository, the first thing to do is initializing the binary directory. Simlpy go to /run directory and run this command:

 make bin

This will clones some repositories needed for a successfull run and install them. It will also copy the script files unde /bin directory.

Then, in order to generate word substitutes you need a Language Model(LM) in SRILM format. Install SRILM and set the SRILM_PATH variable in the Makefile.

A one-liner procuding word vectors is as follows, just change the ones in the capital letters accordingly:

zcat ../data/YOUR_CORPUS.tok.gz | fastsubs-omp -n 100  -p 1.0 YOUR_LM.lm.gz | grep -v '^</s>' | wordsub-n -s 1 -n 100  | scode-online -v2 -d NUMBER_OF_DIMENSION -s 1  | perl -ne 'print if s/^0://' | cut -f1,3- | tr '\t' ' ' | gzip > YOUR_WVEC.gz

This target is at the end of the run/Makefile . But, one may need a detailed description about what's going on. Let's move on.

1 - Generate Type & Token Vectors - A Shortcut:

This section here to generate word embeddings embeddings right away. However, if you want to know how they are generated please skip this section. We provide a test file named mini-{train|test}.tok.gz so that you can test your setup.

Use an LM Corpus(e.g. Wikipedia) and a target corpus(e.g CONLL Task) to generate 25-dimension word embeddings:

make -n YOUR-TARGET-CORPUS.unk-type-25.gz LMFILE=YOUR-LM-CORPUS DIM=25

For word-token vectors, you can use 4 different methods,

make YOUR-TARGET-CORPUS.XsubX.gz LMFILE=YOUR-LM-CORPUS DIM=25 X1=YOUR-TARGET-CORPUS.unk-type-25.gz X2=YOUR-TARGET-CORPUS.unk-type-25.gz 

make YOUR-TARGET-CORPUS.XplusX.gz LMFILE=YOUR-LM-CORPUS DIM=25 X1=YOUR-TARGET-CORPUS.unk-type-25.gz X2=YOUR-TARGET-CORPUS.unk-type-25.gz 

make YOUR-TARGET-CORPUS.XplusY.gz LMFILE=YOUR-LM-CORPUS DIM=25 X1=YOUR-TARGET-CORPUS.unk-type-25.gz X2=YOUR-TARGET-CORPUS.unk-type-25.gz 

make YOUR-TARGET-CORPUS.knn.XY.gz LMFILE=YOUR-LM-CORPUS DIM=25 X1=YOUR-TARGET-CORPUS.unk-type-25.gz X2=YOUR-TARGET-CORPUS.unk-type-25.gz

or you can combine your favorite word embeddings:

make YOUR-TARGET-CORPUS.XmixX.gz LMFILE=YOUR-LM-CORPUS DIM=25 X1=YOUR-EMBEDDINGS1.gz X2=YOUR-EMBEDDINGS2.gz

2 - How to Generate Word-Type Embeddings?

First train an LM using a large corpus. Tokenize your corpus into YOUR-LM-CORPUS.tok and place it under /data and gzip tokenized file into YOUR-LM-CORPUS.tok.gz.

Then run:

make YOUR-LM-CORPUS.lm.gz

You may want to change LM related variables in the Makefile such as LM_DISCOUNT, LM_NGRAM etc.

Then generate substitute words. First put your substitute(target) corpus under /data in a similar way you did to LM corpus, YOUR-SUB-CORPUS.tok.gz. Your substitute corpus and LM corpus may differ. But use the same tokenization. Then, you should run:

make YOUR-SUB-CORPUS.sub.gz LM=YOUR-LM-CORPUS

Again you may want to change word substitute options like FS_NSUB, FS_PSUB. Note that, fastsubs-omp can be run in parallel, thus , you can set number of threads with OMP_NUM_THREADS variable. We observe no gain after 24 core.

Now you can generate <word,substitute word> pairs using substitute distributions. You can change the number of substitutes and random seed by changing corresponding variables(i.e WORDSUB,SEED). Just run:

 make YOUR-SUB-CORPUS.pairs.gz

If you want to generate an embedding for unknown words (probably you do) with an unknown tag *UNKNOWN*:

 make YOUR-SUB-CORPUS.unk-pairs.gz

Now you can generate word-type (one embedding per word) embeddings. Let's say you want 25-dimension word embeddings with unknown word tag. Run:

make YOUR-SUB-CORPUS.unk-type-25.gz DIM=25

Or you can generate for all words (without unknown tag):

make YOUR-SUB-CORPUS.type-25.gz DIM=25

It runs SCODE, then extracts word-type embeddings. Note that, you can change a variety of parameters of SCODE.

If you only interested in word-type embeddings you are good to go.

3 - How to Generate Word-Token (Context-Dependent) Embeddings?

After generating word-type embeddings, you can generate context dependent word embeddings in a couple of different ways.

Enis Sert's KNN Based Vectors(2):

First find the k-nearest-neighbors(k=128).

make YOUR-SUB-CORPUS.knn128.gz

As always there are variables that you can change here such as KNN_METRIC -- number of nearest neighbors.

Run following command to generate 50-dimension word-token vectors:

make YOUR-SUB-CORPUS.knn.XY.gz DIM=25

Substitute Pairs Based Vectors:

Using Substitute Word Embeddings of SCODE Sphere

You can use <word,substitue word> pairs (<X,Y> where X is target word, Y is substitute word) and substite word embeddings of SCODE sphere to generate word-token embeddings.

After generating unk-pairs.gz and 25-dimension word-type embeddings, just run:

YOUR-SUB-CORPUS.XplusY.gz DIM=25

Or, you can use <word,substitue word> pairs and word embeddings of target words for substitute words:

YOUR-SUB-CORPUS.XplusX.gz DIM=25

These will generate 25+25-dimension word-token vectors.

Using Any Word Embeddings You Like

You can combine two different word embeddings using <word,substitute word> pairs to generate word-token embeddings. First generate unk-pair.gz then gzip your favorite word embeddings 1,2 which should contain *UNKNOWN* tag for unknown words and run:

make YOUR-SUB-CORPUS.XmixX.gz X1=YOUR-EMBEDDING1.gz X2=YOUR-EMBEDDING2.gz

Of course you can use the same word embedding for X1 and X2.

Substitute Distribution Based Vectors:

The generalization of the previous method is using substitute distribution instead of <word,substitute word> pairs.

make YOUR-SUB-CORPUS.XsubX.gz X1=YOUR-EMBEDDING1.gz X2=YOUR-EMBEDDING2.gz

4 - Using Morphologic and Orthographic Features

If you want to generate word embeddings using Orthographic and Morphologic features using Morfessor, we suggest you use a big corpus(maybe your LM corpus) and run:

make YOUR-FEAT-CORPUS.feat.gz

It will generate these features. Please take a look at Morfessor parameters in the Makefile. Than simply append "+f" to your target file, set FEATUREFLAG=+f and FEATFILE=YOUR-FEAT-CORPUS:

make YOUR-SUB-CORPUS.unk-type+f.gz DIM=25 FEATUREFLAG=+f FEATFILE=YOUR-FEAT-CORPUS

or

make YOUR-SUB-CORPUS.XplusY+f.gz DIM=25 FEATUREFLAG=+f FEATFILE=YOUR-FEAT-CORPUS

or

make YOUR-SUB-CORPUS.knn.XY+f.gz DIM=25 FEATUREFLAG=+f FEATFILE=YOUR-FEAT-CORPUS

TODO:

Write a better README.
Word Features

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

SCODE Word Embeddings using Substitute Words##

Other Word Vectors

Semi-Supervised Word Vectors

0 - Setting-Up The Environment

1 - Generate Type & Token Vectors - A Shortcut:

2 - How to Generate Word-Type Embeddings?

3 - How to Generate Word-Token (Context-Dependent) Embeddings?

Enis Sert's KNN Based Vectors(2):

Substitute Pairs Based Vectors:

Using Substitute Word Embeddings of SCODE Sphere

Using Any Word Embeddings You Like

Substitute Distribution Based Vectors:

4 - Using Morphologic and Orthographic Features

TODO:

Files

README.md

Latest commit

History

README.md

File metadata and controls

SCODE Word Embeddings using Substitute Words##

Other Word Vectors

Semi-Supervised Word Vectors

0 - Setting-Up The Environment

1 - Generate Type & Token Vectors - A Shortcut:

2 - How to Generate Word-Type Embeddings?

3 - How to Generate Word-Token (Context-Dependent) Embeddings?

Enis Sert's KNN Based Vectors(2):

Substitute Pairs Based Vectors:

Using Substitute Word Embeddings of SCODE Sphere

Using Any Word Embeddings You Like

Substitute Distribution Based Vectors:

4 - Using Morphologic and Orthographic Features

TODO: