diff --git a/README.md b/README.md index 5611b0c..a1d85c7 100644 --- a/README.md +++ b/README.md @@ -1,15 +1,15 @@ # ML Papers Explained -Explanations to key concepts in ML +Explanations of key concepts in ML ## Language Models | Paper | Date | Description | |---|---|---| | [Transformer](https://ritvik19.medium.com/papers-explained-01-transformer-474bb60a33f7) | June 2017 | An Encoder Decoder model, that introduced multihead attention mechanism for language translation task. | -| [Elmo](https://ritvik19.medium.com/papers-explained-33-elmo-76362a43e4) | February 2018 | Deep contextualized word representations that captures both intricate aspects of word usage and contextual variations across language contexts. | +| [Elmo](https://ritvik19.medium.com/papers-explained-33-elmo-76362a43e4) | February 2018 | Deep contextualized word representations that capture both intricate aspects of word usage and contextual variations across language contexts. | | [Marian MT](https://ritvik19.medium.com/papers-explained-150-marianmt-1b44479b0fd9) | April 2018 | A Neural Machine Translation framework written entirely in C++ with minimal dependencies, designed for high training and translation speed. | -| [GPT](https://ritvik19.medium.com/papers-explained-43-gpt-30b6f1e6d226) | June 2018 | A Decoder only transformer which is autoregressively pretrained and then finetuned for specific downstream tasks using task-aware input transformations. | +| [GPT](https://ritvik19.medium.com/papers-explained-43-gpt-30b6f1e6d226) | June 2018 | A Decoder-only transformer which is autoregressively pretrained and then finetuned for specific downstream tasks using task-aware input transformations. | | [BERT](https://ritvik19.medium.com/papers-explained-02-bert-31e59abc0615) | October 2018 | Introduced pre-training for Encoder Transformers. Uses unified architecture across different tasks. | | [Transformer XL](https://ritvik19.medium.com/papers-explained-34-transformerxl-2e407e780e8) | January 2019 | Extends the original Transformer model to handle longer sequences of text by introducing recurrence into the self-attention mechanism. | | [XLM](https://ritvik19.medium.com/papers-explained-158-xlm-42a175e93caf) | January 2019 | Proposes two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingual language model objective. | @@ -17,20 +17,20 @@ Explanations to key concepts in ML | [Sparse Transformer](https://ritvik19.medium.com/papers-explained-122-sparse-transformer-906a0be1e4e7) | April 2019 | Introduced sparse factorizations of the attention matrix to reduce the time and memory consumption to O(n√ n) in terms of sequence lengths. | | [UniLM](https://ritvik19.medium.com/papers-explained-72-unilm-672f0ecc6a4a) | May 2019 | Utilizes a shared Transformer network and specific self-attention masks to excel in both language understanding and generation tasks. | | [XLNet](https://ritvik19.medium.com/papers-explained-35-xlnet-ea0c3af96d49) | June 2019 | Extension of the Transformer-XL, pre-trained using a new method that combines ideas from AR and AE objectives. | -| [RoBERTa](https://ritvik19.medium.com/papers-explained-03-roberta-81db014e35b9) | July 2019 | Built upon BERT, by carefully optimizing hyperparameters and training data size to improve performance on various language tasks . | +| [RoBERTa](https://ritvik19.medium.com/papers-explained-03-roberta-81db014e35b9) | July 2019 | Built upon BERT, by carefully optimizing hyperparameters and training data size to improve performance on various language tasks. | | [Sentence BERT](https://ritvik19.medium.com/papers-explained-04-sentence-bert-5159b8e07f21) | August 2019 | A modification of BERT that uses siamese and triplet network structures to derive sentence embeddings that can be compared using cosine-similarity. | | [CTRL](https://ritvik19.medium.com/papers-explained-153-ctrl-146fcd18a566) | September 2019 | A 1.63B language model that can generate text conditioned on control codes that govern style, content, and task-specific behavior, allowing for more explicit control over text generation. | -| [Tiny BERT](https://ritvik19.medium.com/papers-explained-05-tiny-bert-5e36fe0ee173) | September 2019 | Uses attention transfer, and task specific distillation for distilling BERT. | +| [Tiny BERT](https://ritvik19.medium.com/papers-explained-05-tiny-bert-5e36fe0ee173) | September 2019 | Uses attention transfer, and task-specific distillation for distilling BERT. | | [ALBERT](https://ritvik19.medium.com/papers-explained-07-albert-46a2a0563693) | September 2019 | Presents certain parameter reduction techniques to lower memory consumption and increase the training speed of BERT. | | [Distil BERT](https://ritvik19.medium.com/papers-explained-06-distil-bert-6f138849f871) | October 2019 | Distills BERT on very large batches leveraging gradient accumulation, using dynamic masking and without the next sentence prediction objective. | | [T5](https://ritvik19.medium.com/papers-explained-44-t5-9d974a3b7957) | October 2019 | A unified encoder-decoder framework that converts all text-based language problems into a text-to-text format. | -| [BART](https://ritvik19.medium.com/papers-explained-09-bart-7f56138175bd) | October 2019 | An Encoder-Decoder pretrained to reconstruct the original text from corrupted versions of it. | -| [XLM-Roberta](https://ritvik19.medium.com/papers-explained-159-xlm-roberta-2da91fc24059) | November 2019 | A multilingual masked language model pre-trained on text in 100 languages, shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of crosslingual transfer tasks. | -| [XLM-Roberta](https://ritvik19.medium.com/papers-explained-159-xlm-roberta-2da91fc24059) | November 2019 | A multilingual masked language model pre-trained on text in 100 languages, shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of crosslingual transfer tasks. | +| [BART](https://ritvik19.medium.com/papers-explained-09-bart-7f56138175bd) | October 2019 | An Encoder-Decoder pretrained to reconstruct the original text from corrupted versions. | +| [XLM-Roberta](https://ritvik19.medium.com/papers-explained-159-xlm-roberta-2da91fc24059) | November 2019 | A multilingual masked language model pre-trained on text in 100 languages, shows that pretraining multilingual language models at scale lead to significant performance gains for a wide range of cross-lingual transfer tasks. | +| [XLM-Roberta](https://ritvik19.medium.com/papers-explained-159-xlm-roberta-2da91fc24059) | November 2019 | A multilingual masked language model pre-trained on text in 100 languages, shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. | | [Pegasus](https://ritvik19.medium.com/papers-explained-162-pegasus-1cb16f572553) | December 2019 | A self-supervised pre-training objective for abstractive text summarization, proposes removing/masking important sentences from an input document and generating them together as one output sequence. | | [Reformer](https://ritvik19.medium.com/papers-explained-165-reformer-4445ad305191) | January 2020 | Improves the efficiency of Transformers by replacing dot-product attention with locality-sensitive hashing (O(Llog L) complexity), using reversible residual layers to store activations only once, and splitting feed-forward layer activations into chunks, allowing it to perform on par with Transformer models while being much more memory-efficient and faster on long sequences. | | [mBART](https://ritvik19.medium.com/papers-explained-169-mbart-98432ef6fec) | January 2020 | A multilingual sequence-to-sequence denoising auto-encoder that pre-trains a complete autoregressive model on large-scale monolingual corpora across many languages using the BART objective, achieving significant performance gains in machine translation tasks. | -| [UniLMv2](https://ritvik19.medium.com/papers-explained-unilmv2-5a044ca7c525) | February 2020 | Utilizes a pseudo-masked language model (PMLM) for both autoencoding and partially autoregressive language modeling tasks,significantly advancing the capabilities of language models in diverse NLP tasks. | +| [UniLMv2](https://ritvik19.medium.com/papers-explained-unilmv2-5a044ca7c525) | February 2020 | Utilizes a pseudo-masked language model (PMLM) for both autoencoding and partially autoregressive language modeling tasks, significantly advancing the capabilities of language models in diverse NLP tasks. | | [ELECTRA](https://ritvik19.medium.com/papers-explained-173-electra-501c175ae9d8) | March 2020 | Proposes a sample-efficient pre-training task called replaced token detection, which corrupts input by replacing some tokens with plausible alternatives and trains a discriminative model to predict whether each token was replaced or no. | | [FastBERT](https://ritvik19.medium.com/papers-explained-37-fastbert-5bd246c1b432) | April 2020 | A speed-tunable encoder with adaptive inference time having branches at each transformer output to enable early outputs. | | [MobileBERT](https://ritvik19.medium.com/papers-explained-36-mobilebert-933abbd5aaf1) | April 2020 | Compressed and faster version of the BERT, featuring bottleneck structures, optimized attention mechanisms, and knowledge transfer. | @@ -42,21 +42,21 @@ Explanations to key concepts in ML | [mT5](https://ritvik19.medium.com/papers-explained-113-mt5-c61e03bc9218) | October 2020 | A multilingual variant of T5 based on T5 v1.1, pre-trained on a new Common Crawl-based dataset covering 101 languages (mC4). | | [Codex](https://ritvik19.medium.com/papers-explained-45-codex-caca940feb31) | July 2021 | A GPT language model finetuned on publicly available code from GitHub. | | [FLAN](https://ritvik19.medium.com/papers-explained-46-flan-1c5e0d5db7c9) | September 2021 | An instruction-tuned language model developed through finetuning on various NLP datasets described by natural language instructions. | -| [T0](https://ritvik19.medium.com/papers-explained-74-t0-643a53079fe) | October 2021 | A fine tuned encoder-decoder model on a multitask mixture covering a wide variety of tasks, attaining strong zero-shot performance on several standard datasets. | +| [T0](https://ritvik19.medium.com/papers-explained-74-t0-643a53079fe) | October 2021 | A fine-tuned encoder-decoder model on a multitask mixture covering a wide variety of tasks, attaining strong zero-shot performance on several standard datasets. | | [DeBERTa V3](https://ritvik19.medium.com/papers-explained-182-deberta-v3-65347208ce03) | November 2021 | Enhances the DeBERTa architecture by introducing replaced token detection (RTD) instead of mask language modeling (MLM), along with a novel gradient-disentangled embedding sharing method, exhibiting superior performance across various natural language understanding tasks. | | [WebGPT](https://ritvik19.medium.com/papers-explained-123-webgpt-5bb0dd646b32) | December 2021 | A fine-tuned GPT-3 model utilizing text-based web browsing, trained via imitation learning and human feedback, enhancing its ability to answer long-form questions with factual accuracy. | | [Gopher](https://ritvik19.medium.com/papers-explained-47-gopher-2e71bbef9e87) | December 2021 | Provides a comprehensive analysis of the performance of various Transformer models across different scales upto 280B on 152 tasks. | -| [LaMDA](https://ritvik19.medium.com/papers-explained-76-lamda-a580ebba1ca2) | January 2022 | Transformer based models specialized for dialog, which are pre-trained on public dialog data and web text. | +| [LaMDA](https://ritvik19.medium.com/papers-explained-76-lamda-a580ebba1ca2) | January 2022 | Transformer-based models specialized for dialog, which are pre-trained on public dialog data and web text. | | [BERTopic](https://ritvik19.medium.com/papers-explained-193-bertopic-f9aec10cd5a6) | March 20222 | Utilizes Sentence-BERT for document embeddings, UMAP, HDBSCAN (soft-clustering), and an adjusted class-based TF-IDF, addressing multiple topics per document and dynamic topics' linear evolution. | | [Instruct GPT](https://ritvik19.medium.com/papers-explained-48-instructgpt-e9bcd51f03ec) | March 2022 | Fine-tuned GPT using supervised learning (instruction tuning) and reinforcement learning from human feedback to align with user intent. | | [CodeGen](https://ritvik19.medium.com/papers-explained-125-codegen-a6bae5c1f7b5) | March 2022 | An LLM trained for program synthesis using input-output examples and natural language descriptions. | | [Chinchilla](https://ritvik19.medium.com/papers-explained-49-chinchilla-a7ad826d945e) | March 2022 | Investigated the optimal model size and number of tokens for training a transformer LLM within a given compute budget (Scaling Laws). | | [PaLM](https://ritvik19.medium.com/papers-explained-50-palm-480e72fa3fd5) | April 2022 | A 540-B parameter, densely activated, Transformer, trained using Pathways, (ML system that enables highly efficient training across multiple TPU Pods). | | [GPT-NeoX-20B](https://ritvik19.medium.com/papers-explained-78-gpt-neox-20b-fe39b6d5aa5b) | April 2022 | An autoregressive LLM trained on the Pile, and the largest dense model that had publicly available weights at the time of submission. | -| [OPT](https://ritvik19.medium.com/papers-explained-51-opt-dacd9406e2bd) | May 2022 | A suite of decoder-only pre-trained transformers with parameter ranges from 125M to 175B. OPT-175B being comparable to GPT-3. | -| [Flan T5, Flan PaLM](https://ritvik19.medium.com/papers-explained-75-flan-t5-flan-palm-caf168b6f76) | October 2022 | Explores instruction fine tuning with a particular focus on scaling the number of tasks, scaling the model size, and fine tuning on chain-of-thought data. | +| [OPT](https://ritvik19.medium.com/papers-explained-51-opt-dacd9406e2bd) | May 2022 | A suite of decoder-only pre-trained transformers with parameter ranges from 125M to 175B. OPT-175B is comparable to GPT-3. | +| [Flan T5, Flan PaLM](https://ritvik19.medium.com/papers-explained-75-flan-t5-flan-palm-caf168b6f76) | October 2022 | Explores instruction fine tuning with a particular focus on scaling the number of tasks, scaling the model size, and fine-tuning on chain-of-thought data. | | [BLOOM](https://ritvik19.medium.com/papers-explained-52-bloom-9654c56cd2) | November 2022 | A 176B-parameter open-access decoder-only transformer, collaboratively developed by hundreds of researchers, aiming to democratize LLM technology. | -| [BLOOMZ, mT0](https://ritvik19.medium.com/papers-explained-99-bloomz-mt0-8932577dcd1d) | November 2022 | Applies Multitask prompted fine tuning to the pretrained multilingual models on English tasks with English prompts to attain task generalization to non-English languages that appear only in the pretraining corpus. | +| [BLOOMZ, mT0](https://ritvik19.medium.com/papers-explained-99-bloomz-mt0-8932577dcd1d) | November 2022 | Applies Multitask prompted fine-tuning to the pretrained multilingual models on English tasks with English prompts to attain task generalization to non-English languages that appear only in the pretraining corpus. | | [Galactica](https://ritvik19.medium.com/papers-explained-53-galactica-1308dbd318dc) | November 2022 | An LLM trained on scientific data thus specializing in scientific knowledge. | | [ChatGPT](https://ritvik19.medium.com/papers-explained-54-chatgpt-78387333268f) | November 2022 | An interactive model designed to engage in conversations, built on top of GPT 3.5. | | [Self Instruct](https://ritvik19.medium.com/papers-explained-112-self-instruct-5c192580103a) | December 2022 | A framework for improving the instruction-following capabilities of pretrained language models by bootstrapping off their own generations. | @@ -64,17 +64,17 @@ Explanations to key concepts in ML | [Toolformer](https://ritvik19.medium.com/papers-explained-140-toolformer-d21d496b6812) | February 2023 | An LLM trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction. | | [Alpaca](https://ritvik19.medium.com/papers-explained-56-alpaca-933c4d9855e5) | March 2023 | A fine-tuned LLaMA 7B model, trained on instruction-following demonstrations generated in the style of self-instruct using text-davinci-003. | | [GPT 4](https://ritvik19.medium.com/papers-explained-67-gpt-4-fc77069b613e) | March 2023 | A multimodal transformer model pre-trained to predict the next token in a document, which can accept image and text inputs and produce text outputs. | -| [Vicuna](https://ritvik19.medium.com/papers-explained-101-vicuna-daed99725c7e) | March 2023 | A 13B LLaMA chatbot fine tuned on user-shared conversations collected from ShareGPT, capable of generating more detailed and well-structured answers compared to Alpaca. | -| [BloombergGPT](https://ritvik19.medium.com/papers-explained-120-bloomberggpt-4bedd52ef54b) | March 2023 | A 50B language model train on general purpose and domain specific data to support a wide range of tasks within the financial industry. | +| [Vicuna](https://ritvik19.medium.com/papers-explained-101-vicuna-daed99725c7e) | March 2023 | A 13B LLaMA chatbot fine-tuned on user-shared conversations collected from ShareGPT, capable of generating more detailed and well-structured answers compared to Alpaca. | +| [BloombergGPT](https://ritvik19.medium.com/papers-explained-120-bloomberggpt-4bedd52ef54b) | March 2023 | A 50B language model train on general purpose and domain-specific data to support a wide range of tasks within the financial industry. | | [Pythia](https://ritvik19.medium.com/papers-explained-121-pythia-708284c32964) | April 2023 | A suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters. | -| [WizardLM](https://ritvik19.medium.com/papers-explained-127-wizardlm-65099705dfa3) | April 2023 | Introduces Evol-Instruct, a method to generate large amounts of instruction data with varying levels of complexity using LLM instead of humans to fine tune a Llama model | +| [WizardLM](https://ritvik19.medium.com/papers-explained-127-wizardlm-65099705dfa3) | April 2023 | Introduces Evol-Instruct, a method to generate large amounts of instruction data with varying levels of complexity using LLM instead of humans to fine-tune a Llama model | | [CodeGen2](https://ritvik19.medium.com/papers-explained-codegen2-d2690d7eb831) | May 2023 | Proposes an approach to make the training of LLMs for program synthesis more efficient by unifying key components of model architectures, learning methods, infill sampling, and data distributions | | [PaLM 2](https://ritvik19.medium.com/papers-explained-58-palm-2-1a9a23f20d6c) | May 2023 | Successor of PALM, trained on a mixture of different pre-training objectives in order to understand different aspects of language. | | [LIMA](https://ritvik19.medium.com/papers-explained-57-lima-f9401a5760c3) | May 2023 | A LLaMa model fine-tuned on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling. | | [Gorilla](https://ritvik19.medium.com/papers-explained-139-gorilla-79f4730913e9) | May 2023 | A retrieve-aware finetuned LLaMA-7B model, specifically for API calls. | | [Orca](https://ritvik19.medium.com/papers-explained-160-orca-928eff06e7f9) | June 2023 | Presents a novel approach that addresses the limitations of instruction tuning by leveraging richer imitation signals, scaling tasks and instructions, and utilizing a teacher assistant to help with progressive learning. | | [Falcon](https://ritvik19.medium.com/papers-explained-59-falcon-26831087247f) | June 2023 | An Open Source LLM trained on properly filtered and deduplicated web data alone. | -| [Phi-1](https://ritvik19.medium.com/papers-explained-114-phi-1-14a8dcc77ce5) | June 2023 | An LLM for code, trained using a textbook quality data from the web and synthetically generated textbooks and exercises with GPT-3.5. | +| [Phi-1](https://ritvik19.medium.com/papers-explained-114-phi-1-14a8dcc77ce5) | June 2023 | An LLM for code, trained using textbook quality data from the web and synthetically generated textbooks and exercises with GPT-3.5. | | [WizardCoder](https://ritvik19.medium.com/papers-explained-wizardcoder-a12ecb5b93b6) | June 2023 | Enhances the performance of the open-source Code LLM, StarCoder, through the application of Code Evol-Instruct. | | [LLaMA 2](https://ritvik19.medium.com/papers-explained-60-llama-v2-3e415c5b9b17) | July 2023 | Successor of LLaMA. LLaMA 2-Chat is optimized for dialogue use cases. | | [Tool LLM](https://ritvik19.medium.com/papers-explained-141-tool-llm-856f99e79f55) | July 2023 | A LLaMA model finetuned on an instruction-tuning dataset for tool use, automatically created using ChatGPT. | @@ -95,9 +95,9 @@ Explanations to key concepts in ML | [H2O Danube 1.8B](https://ritvik19.medium.com/papers-explained-111-h2o-danube-1-8b-b790c073d257) | January 2024 | A language model trained on 1T tokens following the core principles of LLama 2 and Mistral, leveraging and refining various techniques for pre-training large language models. | | [OLMo](https://ritvik19.medium.com/papers-explained-98-olmo-fdc358326f9b) | February 2024 | A state-of-the-art, truly open language model and framework that includes training data, code, and tools for building, studying, and advancing language models. | | [MobileLLM](https://ritvik19.medium.com/papers-explained-216-mobilellm-2d7fdd5acd86) | February 2024 | Leverages various architectures and attention mechanisms to achieve a strong baseline network, which is then improved upon by introducing an immediate block-wise weight-sharing approach, resulting in a further accuracy boost. | -| [Orca Math](https://ritvik19.medium.com/papers-explained-163-orca-math-ae6a157ce48d) | February 2024 | A fine tuned Mistral-7B that excels at math problems without external tools, utilizing a high-quality synthetic dataset of 200K problems created through multi-agent collaboration and an iterative learning process that involves practicing problem-solving, receiving feedback, and learning from preference pairs incorporating the model's solutions and feedback. | +| [Orca Math](https://ritvik19.medium.com/papers-explained-163-orca-math-ae6a157ce48d) | February 2024 | A fine-tuned Mistral-7B that excels at math problems without external tools, utilizing a high-quality synthetic dataset of 200K problems created through multi-agent collaboration and an iterative learning process that involves practicing problem-solving, receiving feedback, and learning from preference pairs incorporating the model's solutions and feedback. | | [Gemma](https://ritvik19.medium.com/papers-explained-106-gemma-ca2b449321ac) | February 2024 | A family of 2B and 7B, state-of-the-art language models based on Google's Gemini models, offering advancements in language understanding, reasoning, and safety. | -| [Aya 101](https://ritvik19.medium.com/papers-explained-aya-101-d813ba17b83a) | Februray 2024 | A massively multilingual generative language model that follows instructions in 101 languages,trained by finetuning mT5. | +| [Aya 101](https://ritvik19.medium.com/papers-explained-aya-101-d813ba17b83a) | Februray 2024 | A massively multilingual generative language model that follows instructions in 101 languages, trained by finetuning mT5. | | [Nemotron-4 15B](https://ritvik19.medium.com/papers-explained-206-nemotron-4-15b-7d895fb56134) | February 2024 | A 15B multilingual language model trained on 8T text tokens by Nvidia. | | [Hawk, Griffin](https://ritvik19.medium.com/papers-explained-131-hawk-griffin-dfc8c77f5dcd) | February 2024 | Introduces Real Gated Linear Recurrent Unit Layer that forms the core of the new recurrent block, replacing Multi-Query Attention for better efficiency and scalability | | [WRAP](https://ritvik19.medium.com/papers-explained-118-wrap-e563e009fe56) | March 2024 | Uses an off-the-shelf instruction-tuned model prompted to paraphrase documents on the web in specific styles to jointly pre-train LLMs on real and synthetic rephrases. | @@ -106,7 +106,7 @@ Explanations to key concepts in ML | [Grok 1.5](https://ritvik19.medium.com/papers-explained-186-grok-0d9f1aef69be) | March 2024 | An advancement over grok, capable of long context understanding up to 128k tokens and advanced reasoning. | | [Command R+](https://ritvik19.medium.com/papers-explained-166-command-r-models-94ba068ebd2b#c2b5) | April 2024 | Successor of Command R+ with improved performance for retrieval-augmented generation and tool use, across multiple languages. | | [Llama 3](https://ritvik19.medium.com/papers-explained-187a-llama-3-51e2b90f63bb) | April 2024 | A family of 8B and 70B parameter models trained on 15T tokens with a focus on data quality, demonstrating state-of-the-art performance on various benchmarks, improved reasoning capabilities. | -| [Mixtral 8x22B](https://ritvik19.medium.com/papers-explained-95-mixtral-8x7b-9e9f40ebb745#20f3) | April 2024 | A open-weight AI model optimised for performance and efficiency, with capabilities such as fluency in multiple languages, strong mathematics and coding abilities, and precise information recall from large documents. | +| [Mixtral 8x22B](https://ritvik19.medium.com/papers-explained-95-mixtral-8x7b-9e9f40ebb745#20f3) | April 2024 | An open-weight AI model optimised for performance and efficiency, with capabilities such as fluency in multiple languages, strong mathematics and coding abilities, and precise information recall from large documents. | | [CodeGemma](https://ritvik19.medium.com/papers-explained-124-codegemma-85faa98af20d) | April 2024 | Open code models based on Gemma models by further training on over 500 billion tokens of primarily code. | | [RecurrentGemma](https://ritvik19.medium.com/papers-explained-132-recurrentgemma-52732d0f4273) | April 2024 | Based on Griffin, uses a combination of linear recurrences and local attention instead of global attention to model long sequences efficiently. | | [Rho-1](https://ritvik19.medium.com/papers-explained-132-rho-1-788125e42241) | April 2024 | Introduces Selective Language Modelling that optimizes the loss only on tokens that align with a desired distribution, utilizing a reference model to score and select tokens. | @@ -117,7 +117,7 @@ Explanations to key concepts in ML | [Codestral 22B](https://medium.com/dair-ai/papers-explained-mistral-7b-b9632dedf580#057b) | May 2024 | An open-weight model designed for code generation tasks, trained on over 80 programming languages, and licensed under the Mistral AI Non-Production License, allowing developers to use it for research and testing purposes. | | [Aya 23](https://ritvik19.medium.com/papers-explained-151-aya-23-d01605c3ee80) | May 2024 | A family of multilingual language models supporting 23 languages, designed to balance breadth and depth by allocating more capacity to fewer languages during pre-training. | | [Gemma 2](https://ritvik19.medium.com/papers-explained-157-gemma-2-f1b75b56b9f2) | June 2024 | Utilizes interleaving local-global attentions and group-query attention, trained with knowledge distillation instead of next token prediction to achieve competitive performance comparable with larger models. | -| [Orca 3 (Agent Instruct)](https://ritvik19.medium.com/papers-explained-164-orca-3-agent-instruct-41340505af36) | June 2024 | A fine tuned Mistral-7B through Generative Teaching via synthetic data generated through the proposed AgentInstruct framework, which generates both the prompts and responses, using only raw data sources like text documents and code files as seeds. | +| [Orca 3 (Agent Instruct)](https://ritvik19.medium.com/papers-explained-164-orca-3-agent-instruct-41340505af36) | June 2024 | A fine-tuned Mistral-7B through Generative Teaching via synthetic data generated through the proposed AgentInstruct framework, which generates both the prompts and responses, using only raw data sources like text documents and code files as seeds. | | [Nemotron-4 340B](https://ritvik19.medium.com/papers-explained-207-nemotron-4-340b-4cfe268439f8) | June 2024 | 340B models, along with a reward model by Nvidia, suitable for generating synthetic data to train smaller language models, with over 98% of the data used in model alignment being synthetically generated. | | [Mathstral](https://medium.com/dair-ai/papers-explained-mistral-7b-b9632dedf580#0fbe) | July 2024 | a 7B model designed for math reasoning and scientific discovery based on Mistral 7B specializing in STEM subjects. | | [Mistral Nemo](https://medium.com/dair-ai/papers-explained-mistral-7b-b9632dedf580#37cd) | July 2024 | A 12B Language Model built in collaboration between Mistral and NVIDIA, featuring a context window of 128K, an efficient tokenizer and trained with quantization awareness, enabling FP8 inference without any performance loss. | @@ -179,17 +179,17 @@ Explanations to key concepts in ML | Paper | Date | Description | |---|---|---| -| [SimCLR](https://ritvik19.medium.com/papers-explained-200-simclr-191ecf19d2fc) | February 2020 | A simplified framework for contrastive learning that optimizes data augmentation composition, introduces learnable nonlinear transformations, and leverages larger batch sizes and more training steps. | +| [SimCLR](https://ritvik19.medium.com/papers-explained-200-simclr-191ecf19d2fc) | February 2020 | A simplified framework for contrastive learning that optimizes data augmentation composition, introduces learnable nonlinear transformations and leverages larger batch sizes and more training steps. | | [Dense Passage Retriever](https://ritvik19.medium.com/papers-explained-86-dense-passage-retriever-c4742fdf27ed) | April 2020 | Shows that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual encoder framework. | | [ColBERT](https://medium.com/@ritvik19/papers-explained-88-colbert-fe2fd0509649) | April 2020 | Introduces a late interaction architecture that adapts deep LMs (in particular, BERT) for efficient retrieval. | -| [SimCLRv2](https://ritvik19.medium.com/papers-explained-201-simclrv2-bc3fe72b8b48) | June 2020 | A Semi-supervised learning framework which uses unsupervised pre training followed by supervised fine-tuning and distillation with unlabeled examples. | +| [SimCLRv2](https://ritvik19.medium.com/papers-explained-201-simclrv2-bc3fe72b8b48) | June 2020 | A Semi-supervised learning framework which uses unsupervised pre-training followed by supervised fine-tuning and distillation with unlabeled examples. | | [CLIP](https://ritvik19.medium.com/papers-explained-100-clip-f9873c65134) | February 2021 | A vision system that learns image representations from raw text-image pairs through pre-training, enabling zero-shot transfer to various downstream tasks. | | [ColBERTv2](https://ritvik19.medium.com/papers-explained-89-colbertv2-7d921ee6e0d9) | December 2021 | Couples an aggressive residual compression mechanism with a denoised supervision strategy to simultaneously improve the quality and space footprint of late interaction. | | [Matryoshka Representation Learning](https://ritvik19.medium.com/papers-explained-matryoshka-representation-learning-e7a139f6ad27) | May 2022 | Encodes information at different granularities and allows a flexible representation that can adapt to multiple downstream tasks with varying computational resources using a single embedding. | | [E5](https://ritvik19.medium.com/papers-explained-90-e5-75ea1519efad) | December 2022 | A family of text embeddings trained in a contrastive manner with weak supervision signals from a curated large-scale text pair dataset CCPairs. | | [SigLip](https://ritvik19.medium.com/papers-explained-152-siglip-011c48f9d448) | March 2023 | A simple pairwise Sigmoid loss function for Language-Image Pre-training that operates solely on image-text pairs, allowing for larger batch sizes and better performance at smaller batch sizes. | | [SynCLR](https://ritvik19.medium.com/papers-explained-202-synclr-85b50ef0081b) | December 2023 | A visual representation learning method that leverages generative models to synthesize large-scale curated datasets without relying on any real data. | -| [E5 Mistral 7B](https://ritvik19.medium.com/papers-explained-91-e5-mistral-7b-23890f40f83a) | December 2023 | Leverages proprietary LLMs to generate diverse synthetic data to fine tune open-source decoder-only LLMs for hundreds of thousands of text embedding tasks. | +| [E5 Mistral 7B](https://ritvik19.medium.com/papers-explained-91-e5-mistral-7b-23890f40f83a) | December 2023 | Leverages proprietary LLMs to generate diverse synthetic data to fine-tune open-source decoder-only LLMs for hundreds of thousands of text embedding tasks. | | [Nomic Embed Text v1](https://ritvik19.medium.com/papers-explained-110-nomic-embed-8ccae819dac2) | February 2024 | A 137M parameter, open-source English text embedding model with an 8192 context length that outperforms OpenAI's models on both short and long-context tasks. | | [Nomic Embed Text v1.5](https://ritvik19.medium.com/papers-explained-110-nomic-embed-8ccae819dac2#2119) | February 2024 | An advanced text embedding model that utilizes Matryoshka Representation Learning to offer flexible embedding sizes with minimal performance trade-offs | | [Gecko](https://ritvik19.medium.com/papers-explained-203-gecko-8889158b17e6) | March 2024 | A 1.2B versatile text embedding model achieving strong retrieval performance by distilling knowledge from LLMs into a retriever. | @@ -222,7 +222,7 @@ Explanations to key concepts in ML |---|---|---| | [LLMLingua](https://ritvik19.medium.com/papers-explained-136-llmlingua-f9b2f53f5f9b) | October 2023 | A novel coarse-to-fine prompt compression method, incorporating a budget controller, an iterative token-level compression algorithm, and distribution alignment, achieving up to 20x compression with minimal performance loss. | | [LongLLMLingua](https://ritvik19.medium.com/papers-explained-137-longllmlingua-45961fa703dd) | October 2023 | A novel approach for prompt compression to enhance performance in long context scenarios using question-aware compression and document reordering. | -| [LLMLingua2](https://ritvik19.medium.com/papers-explained-138-llmlingua-2-510c752368a8) | March 2024 | A novel approach to task-agnostic prompt compression, aiming to enhance generalizability, using data distillation and leveraging a Transformer encoder for token classification. | +| [LLMLingua2](https://ritvik19.medium.com/papers-explained-138-llmlingua-2-510c752368a8) | March 2024 | A novel approach to task-agnostic prompt compression, aiming to enhance generalizability, using data distillation and leveraging a Transformer encoder for token classification. | ## Vision Transformers @@ -230,10 +230,10 @@ Explanations to key concepts in ML |---|---|---| | [Vision Transformer](https://ritvik19.medium.com/papers-explained-25-vision-transformers-e286ee8bc06b) | October 2020 | Images are segmented into patches, which are treated as tokens and a sequence of linear embeddings of these patches are input to a Transformer | | [DeiT](https://ritvik19.medium.com/papers-explained-39-deit-3d78dd98c8ec) | December 2020 | A convolution-free vision transformer that uses a teacher-student strategy with attention-based distillation tokens. | -| [Swin Transformer](https://ritvik19.medium.com/papers-explained-26-swin-transformer-39cf88b00e3e) | March 2021 | A hierarchical vision transformer that uses shifted windows to addresses the challenges of adapting the transformer model to computer vision. | +| [Swin Transformer](https://ritvik19.medium.com/papers-explained-26-swin-transformer-39cf88b00e3e) | March 2021 | A hierarchical vision transformer that uses shifted windows to address the challenges of adapting the transformer model to computer vision. | | [Convolutional Vision Transformer](https://ritvik19.medium.com/papers-explained-199-cvt-fb4a5c05882e) | March 2021 | Improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions, to yield the best of both designs. | | [LeViT](https://ritvik19.medium.com/papers-explained-205-levit-89a2defc2d18) | April 2021 | A hybrid neural network built upon the ViT architecture and DeiT training method, for fast inference image classification. | -| [BEiT](https://ritvik19.medium.com/papers-explained-27-beit-b8c225496c01) | June 2021 | Utilizes a masked image modeling task inspired by BERT in, involving image patches and visual tokens to pretrain vision Transformers. | +| [BEiT](https://ritvik19.medium.com/papers-explained-27-beit-b8c225496c01) | June 2021 | Utilizes a masked image modeling task inspired by BERT, involving image patches and visual tokens to pretrain vision Transformers. | | [MobileViT](https://ritvik19.medium.com/papers-explained-40-mobilevit-4793f149c434) | October 2021 | A lightweight vision transformer designed for mobile devices, effectively combining the strengths of CNNs and ViTs. | | [Masked AutoEncoder](https://ritvik19.medium.com/papers-explained-28-masked-autoencoder-38cb0dbed4af) | November 2021 | An encoder-decoder architecture that reconstructs input images by masking random patches and leveraging a high proportion of masking for self-supervision. | | [MaxViT](https://ritvik19.medium.com/papers-explained-210-maxvit-6c68cc515413) | April 2022 | Introduces multi-axis attention, allowing global-local spatial interactions on arbitrary input resolutions with only linear complexity. | @@ -258,7 +258,7 @@ Explanations to key concepts in ML | [Mobile Net V1](https://ritvik19.medium.com/papers-explained-review-01-convolutional-neural-networks-78aeff61dcb3#3cb5) | April 2017 | Uses depthwise separable convolutions to reduce the number of parameters and computation required. | | [Mobile Net V2](https://ritvik19.medium.com/papers-explained-review-01-convolutional-neural-networks-78aeff61dcb3#4440) | January 2018 | Built upon the MobileNetv1 architecture, uses inverted residuals and linear bottlenecks. | | [Mobile Net V3](https://ritvik19.medium.com/papers-explained-review-01-convolutional-neural-networks-78aeff61dcb3#8eb6) | May 2019 | Uses AutoML to find the best possible neural network architecture for a given problem. | -| [Efficient Net](https://ritvik19.medium.com/papers-explained-review-01-convolutional-neural-networks-78aeff61dcb3#560a) | May 2019 | Uses a compound scaling method to scale the network's depth, width, and resolution to achieve a high accuracy with a relatively low computational cost. | +| [Efficient Net](https://ritvik19.medium.com/papers-explained-review-01-convolutional-neural-networks-78aeff61dcb3#560a) | May 2019 | Uses a compound scaling method to scale the network's depth, width, and resolution to achieve high accuracy with a relatively low computational cost. | | [NF Net](https://ritvik19.medium.com/papers-explained-84-nf-net-b8efa03d6b26) | February 2021 | An improved class of Normalizer-Free ResNets that implement batch-normalized networks, offer faster training times, and introduce an adaptive gradient clipping technique to overcome instabilities associated with deep ResNets. | | [Conv Mixer](https://ritvik19.medium.com/papers-explained-29-convmixer-f073f0356526) | January 2022 | Processes image patches using standard convolutions for mixing spatial and channel dimensions. | | [ConvNeXt](https://ritvik19.medium.com/papers-explained-92-convnext-d13385d9177d) | January 2022 | A pure ConvNet model, evolved from standard ResNet design, that competes well with Transformers in accuracy and scalability. | @@ -288,7 +288,7 @@ Explanations to key concepts in ML | Paper | Date | Description | |---|---|---| | [Table Net](https://ritvik19.medium.com/papers-explained-18-tablenet-3d4c62269bb3) | January 2020 | An end-to-end deep learning model designed for both table detection and structure recognition. | -| [Donut](https://ritvik19.medium.com/papers-explained-20-donut-cb1523bf3281) | November 2021 | An OCR-free Encoder-Decoder Transformer model. The encoder takes in images, decoder takes in prompts & encoded images to generate the required text. | +| [Donut](https://ritvik19.medium.com/papers-explained-20-donut-cb1523bf3281) | November 2021 | An OCR-free Encoder-Decoder Transformer model. The encoder takes in images, and the decoder takes in prompts & encoded images to generate the required text. | | [DiT](https://ritvik19.medium.com/papers-explained-19-dit-b6d6eccd8c4e) | March 2022 | An Image Transformer pre-trained (self-supervised) on document images | | [UDoP](https://ritvik19.medium.com/papers-explained-42-udop-719732358ab4) | December 2022 | Integrates text, image, and layout information through a Vision-Text-Layout Transformer, enabling unified representation. | | [DocLLM](https://ritvik19.medium.com/papers-explained-87-docllm-93c188edfaef) | January 2024 | A lightweight extension to traditional LLMs that focuses on reasoning over visual documents, by incorporating textual semantics and spatial layout without expensive image encoders. | @@ -324,7 +324,7 @@ Explanations to key concepts in ML |---|---|---| | [Entity Embeddings](https://ritvik19.medium.com/papers-explained-review-04-tabular-deep-learning-776db04f965b#932e) | April 2016 | Maps categorical variables into continuous vector spaces through neural network learning, revealing intrinsic properties. | | [Wide and Deep Learning](https://ritvik19.medium.com/papers-explained-review-04-tabular-deep-learning-776db04f965b#bfdc) | June 2016 | Combines memorization of specific patterns with generalization of similarities. | -| [Deep and Cross Network](https://ritvik19.medium.com/papers-explained-review-04-tabular-deep-learning-776db04f965b#0017) | August 2017 | Combines the a novel cross network with deep neural networks (DNNs) to efficiently learn feature interactions without manual feature engineering. | +| [Deep and Cross Network](https://ritvik19.medium.com/papers-explained-review-04-tabular-deep-learning-776db04f965b#0017) | August 2017 | Combines the a novel cross network with deep neural networks (DNNs) to efficiently learn feature interactions without manual feature engineering. | | [Tab Transformer](https://ritvik19.medium.com/papers-explained-review-04-tabular-deep-learning-776db04f965b#48c4) | December 2020 | Employs multi-head attention-based Transformer layers to convert categorical feature embeddings into robust contextual embeddings. | | [Tabular ResNet](https://ritvik19.medium.com/papers-explained-review-04-tabular-deep-learning-776db04f965b#46af) | June 2021 | An MLP with skip connections. | | [Feature Tokenizer Transformer](https://ritvik19.medium.com/papers-explained-review-04-tabular-deep-learning-776db04f965b#1ab8) | June 2021 | Transforms all features (categorical and numerical) to embeddings and applies a stack of Transformer layers to the embeddings. | @@ -351,7 +351,7 @@ Explanations to key concepts in ML | [Scaling Data-Constrained Language Models](https://ritvik19.medium.com/papers-explained-85-scaling-data-constrained-language-models-2a4c18bcc7d3) | May 2023 | This study investigates scaling language models in data-constrained regimes. | | [An In-depth Look at Gemini's Language Abilities](https://ritvik19.medium.com/papers-explained-81-an-in-depth-look-at-geminis-language-abilities-540ca9046d8e) | December 2023 | A third-party, objective comparison of the abilities of the OpenAI GPT and Google Gemini models with reproducible code and fully transparent results. | | [DSPy](https://ritvik19.medium.com/papers-explained-135-dspy-fe8af7e35091) | October 2023 | A programming model that abstracts LM pipelines as text transformation graphs, i.e. imperative computation graphs where LMs are invoked through declarative modules, optimizing their use through a structured framework of signatures, modules, and teleprompters to automate and enhance text transformation tasks. | -| [Direct Preference Optimization](https://ritvik19.medium.com/papers-explained-148-direct-preference-optimization-d3e031a41be1) | December 2023 | A stable, performant, and computationally lightweight algorithm that fine-tunes llms to align with human preferences without the need for reinforcement learning, by directly optimizing for the policy best satisfying the preferences with a simple classification objective. | +| [Direct Preference Optimization](https://ritvik19.medium.com/papers-explained-148-direct-preference-optimization-d3e031a41be1) | December 2023 | A stable, performant, and computationally lightweight algorithm that fine-tunes LLMs to align with human preferences without the need for reinforcement learning, by directly optimizing for the policy best satisfying the preferences with a simple classification objective. | | [RLHF Workflow](https://ritvik19.medium.com/papers-explained-149-rlhf-workflow-56b4e00019ed) | May 2024 | Provides a detailed recipe for online iterative RLHF and achieves state-of-the-art performance on various benchmarks using fully open-source datasets. | | [Monte Carlo Tree Self-refine](https://ritvik19.medium.com/papers-explained-167-monte-carlo-tree-self-refine-79bffb070c1a) | June 2024 | Integrates LLMs with Monte Carlo Tree Search to enhance performance in complex mathematical reasoning tasks, leveraging systematic exploration and heuristic self-refine mechanisms to improve decision-making frameworks. | | [Magpie](https://ritvik19.medium.com/papers-explained-183-magpie-0603cbdc69c3) | June 2024 | A self-synthesis method that extracts high-quality instruction data at scale by prompting an aligned LLM with left-side templates, generating 4M instructions and their corresponding responses. | @@ -396,11 +396,6 @@ Explanations to key concepts in ML --- Reach out to [Ritvik](https://twitter.com/RitvikRastogi19) or [Elvis](https://twitter.com/omarsar0) if you have any questions. -If you are interested to contribute, feel free to open a PR. - -[Join our Discord](https://discord.gg/SKgkVT8BGJ) - - - - +If you are interested in contributing, feel free to open a PR. +[Join our Discord](https://discord.gg/SKgkVT8BGJ) \ No newline at end of file