LangBridge

Repository for the paper "LangBridge: Multilingual Reasoning Without Multilingual Supervision".

Links

TL;DR

🤔LMs good at reasoning are mostly English-centric (MetaMath, Orca 2, etc).

😃Let’s adapt them to solve multilingual tasks. BUT without using multilingual data!

LangBridge “bridges” mT5 encoder and the target LM together while utilizing only English data. In test time, LangBridge models can solve multilingual reasoning tasks effectively.

🤗Models

Llama 2

llama2-langbridge-9b

Llemma

llemma-langbridge-9b

MetaMath

Code Llama

Orca 2

Install

Using the Models only

pip install -e .

Replicating the evaluation from the paper

pip install -e .
pip install -e bigcode-evaluation-harness
pip install -e evaluation-harness

Usage

Quick usage example

from transformers import AutoTokenizer
from langbridge import LangBridgeModel

# our pretrained langbridge models all leverage this encoder tokenizer
enc_tokenizer = AutoTokenizer.from_pretrained('kaist-ai/langbridge_encoder_tokenizer') 
lm_tokenizer = AutoTokenizer.from_pretrained('kaist-ai/metamath-langbridge-9b')
model = LangBridgeModel.from_pretrained('kaist-ai/metamath-langbridge-9b').to('cuda')


metamath_template = (
    "Below is an instruction that describes a task. "
    "Write a response that appropriately completes the request.\n\n"
    "### Instruction:\n{instruction}\n\n### Response:\n"
    )
question = "문제: Jimmy는 Ethel이 가진 돈의 두배보다 2달러가 더 많습니다. Ethel이 8달러가 있다고하면, Jimmy는 얼마를 갖고 있나요?  정답: "
prefix =  metamath_template.format(instruction=question)
output = model.generate_from_prefix(enc_tokenizer, lm_tokenizer, prefix=prefix)
print(output)

If Ethel has 8 dollars, then Jimmy has 2 * 8 + 2 = 18 dollars.
Therefore, Jimmy has 18 dollars.
#### 18
The answer is: 18

Tips

Set the prefixes as if you were prompting the original LMs. For example, for Orca 2-langbridge use the Orca 2 template. For pretrained models (Llama 2, Llemma, and Code Llama), you may need to use few-shot examples.
The encoder tokenizer is simply an mT5 tokenizer with whitespace tokens. The reason for the added whitespaces is explained in section D.1 of the paper.

Training Example

cd python_scripts
bash scripts/train_lb/metamath.sh

Tips

For optimal performance, keep freeze_encoder=False for pretrained LMs (trained on unlabeled corpora), and freeze_encoder=True for finetuned LMs (trained on labeled corpora). This is explained in section D.1 of the paper.
The training and validation data should have two columns: input and output. The output should be empty for unlabeled corpora. In this case pass output_exists=False, then the code will dynamically create the label(output) by splitting the input. The output shouldn't be empty for labeled corpora. In this case pass output_exists=True.
When training on output_exists=False, set use_dynamic_enc_length=True. See section 4.1. use_dynamic_enc_length flag won't have an effect when output_exists=True.

Evaluation Example

cd python_scripts
bash scripts/eval/mgsm/metamath-lb-9b.sh

Limitation

LangBridge mostly helps for low-resource languages. If the language model is already proficient in a certain language, LangBridge may lower performance in that language. Please refer to the paper for the detailed evaluation results.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
bigcode-evaluation-harness		bigcode-evaluation-harness
data/humaneval		data/humaneval
evaluation-harness		evaluation-harness
langbridge		langbridge
python_scripts		python_scripts
.gitignore		.gitignore
README.md		README.md
figure2.png		figure2.png
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LangBridge

Links

TL;DR

🤗Models

Llama 2

Llemma

MetaMath

Code Llama

Orca 2

Install

Using the Models only

Replicating the evaluation from the paper

Usage

Quick usage example

Tips

Training Example

Tips

Evaluation Example

Limitation

About

Releases

Packages

Languages

rahul-sarvam/LangBridge

Folders and files

Latest commit

History

Repository files navigation

LangBridge

Links

TL;DR

🤗Models

Llama 2

Llemma

MetaMath

Code Llama

Orca 2

Install

Using the Models only

Replicating the evaluation from the paper

Usage

Quick usage example

Tips

Training Example

Tips

Evaluation Example

Limitation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages