GitHub - lifelongeeek/retrieved_collection_compression_densephrase: Compress retrieved documents collection with densephrase model

Compress retrieved collection with DensePhrase

This repository is used for compressing retrieved collection which further is used as a prompt for retrieval augmented language model.

Procedure

step 1. Clone repository and update submodules

git clone https://github.com/nota-github/retrieved_collection_compression_densephrase
cd retrieved_collection_compression_densephrase
git submodule update --recursive --remote

step 2. Setup docker environment

# in host
docker pull notadockerhub/collection_compression_densephrase:latest
docker run -v /path/to/parent_of_repository:/root --workdir /root --name {container_name} --shm-size=2gb -it --gpus GPU_INDICES -t notadockerhub/collection_compression_densephrase

# in container
cd retrieved_collection_compression_densephrase/DensePhrases
pip install -e . # editable mode install

step 3. Setup path & variable

cd /root/retrieved_collection_compression_densephrase
./config.sh
source ~/.bashrc

step 4. Download prepared resources: data

cd Densephrases
./download.sh 
# download `data`, `index`, `wiki` with this script
cd ../

data: preprocessed datasets
- this project will use open-domain QA (open-qa) only
index: pre-built index of wikipedia
- we will not re-train passage encoder
wiki: pre-processed raw data for making index
pre-trained query encoder will be downloaded from huggingface modelhub

step 5. Retrieve relevant sentences with varying #retrieve

fixed setting
- retrieval unit: sentence
  - other retrieval granularity (documents, paragraph, phrase) not allowed
- topK = 200
- test query, collection

python retrieve.py --query_encoder_name_or_dir princeton-nlp/densephrases-multi-query-multi --runfile_name run.tsv

output: runfile
assignment: modify inference logic to improve evaluation metric (mAR)
- modifyable parts
  - Densephrases/densenphrases/index.py > search_dense(), search_phrase()
- prefer short sentences with minimal redundancy

retrieved sentences example

Query: Where are mucosal associated lymphoid tissues present in the human body and why? (인체에서 점막 관련 림프 조직은 어디에 존재하며 그 이유는 무엇입니까?) Answers: [oral passage, salivary glands, gastrointestinal tract, breast, skin, thyroid, lung, nasopharyngeal tract, eye] Retrieved "sentences" by DensePhrase: ['In the gastrointestinal tract, the term "mucosa" or "mucous membrane" refers to the combination of epithelium, lamina propria, and (where it occurs) muscularis mucosae.', 'Another type of relatively undifferentiated connective tissue is mucous connective tissue, found inside the umbilical cord.', 'Lymph nodes or "glands" or "nodes" or "lymphoid tissue" are nodular bodies located throughout the body but clustering in certain areas such as the armpit, back of the neck and the groin.', 'The mucosa-associated lymphoid tissue (MALT), also called mucosa-associated lymphatic tissue, is a diffuse system of small concentrations of lymphoid tissue found in various submucosal membrane sites of the body, such as the gastrointestinal tract, oral passage, nasopharyngeal tract, thyroid, breast, lung, salivary glands, eye, and skin.' ...]

step 6. Calculate mean average recall (mAR)

python eval.py --runfile_name run.tsv

output: mAR
baseline result
- retrieval_unit = sentence: mAR = 64.57 (starts from this baseline)
- retrieval_unit = paragraph: mAR = 59.72

step 7. Query-side fine-tuning

make train-query MODEL_NAME=NEW_MODEL_SAVE_DIR DUMP_DIR=$SAVE_DIR/densephrases-multi_wiki-20181220/dump/ LOAD_DIR_OR_PRETRAINED_HF_NAME=princeton-nlp/densephrases-multi-query-nq

assignment: adapt Densephrases to retrieval unit similar to sentence
- modifyable parts
  - Densephrases/train_query.py > get_top_phrase(), annotate_phrase_vecs()
  - Densephrases/densephrases/encoder.py > train_query()

Acknowledgement

Majority of code comes from princeton-nlp/Densephrases and included as submodule of this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
DensePhrases @ 9583883		DensePhrases @ 9583883
images		images
.gitmodules		.gitmodules
README.md		README.md
config.sh		config.sh
eval.py		eval.py
retrieve.py		retrieve.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Compress retrieved collection with DensePhrase

Procedure

step 1. Clone repository and update submodules

step 2. Setup docker environment

step 3. Setup path & variable

step 4. Download prepared resources: data

step 5. Retrieve relevant sentences with varying #retrieve

step 6. Calculate mean average recall (mAR)

step 7. Query-side fine-tuning

Acknowledgement

About

Releases

Packages

Languages

lifelongeeek/retrieved_collection_compression_densephrase

Folders and files

Latest commit

History

Repository files navigation

Compress retrieved collection with DensePhrase

Procedure

step 1. Clone repository and update submodules

step 2. Setup docker environment

step 3. Setup path & variable

step 4. Download prepared resources: data

step 5. Retrieve relevant sentences with varying #retrieve

step 6. Calculate mean average recall (mAR)

step 7. Query-side fine-tuning

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages