Skip to content

Latest commit

 

History

History
216 lines (197 loc) · 29.4 KB

survey_ref.md

File metadata and controls

216 lines (197 loc) · 29.4 KB

Survey and Reference

Survey on Large Language Models


Business use cases

Build an LLMs from scratch: picoGPT and lit-gpt

  • An unnecessarily tiny implementation of GPT-2 in NumPy. picoGPT: Transformer Decoder [Jan 2023] GitHub Repo stars
q = x @ w_k # [n_seq, n_embd] @ [n_embd, n_embd] -> [n_seq, n_embd]
k = x @ w_q # [n_seq, n_embd] @ [n_embd, n_embd] -> [n_seq, n_embd]
v = x @ w_v # [n_seq, n_embd] @ [n_embd, n_embd] -> [n_seq, n_embd]

# In picoGPT, combine w_q, w_k and w_v into a single matrix w_fc
x = x @ w_fc # [n_seq, n_embd] @ [n_embd, 3*n_embd] -> [n_seq, 3*n_embd]
  • lit-gpt: Hackable implementation of state-of-the-art open-source LLMs based on nanoGPT. Supports flash attention, 4-bit and 8-bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed. git [Mar 2023] GitHub Repo stars
  • pix2code: Generating Code from a Graphical User Interface Screenshot. Trained dataset as a pair of screenshots and simplified intermediate script for HTML, utilizing image embedding for CNN and text embedding for LSTM, encoder and decoder model. Early adoption of image-to-code. [May 2017] GitHub Repo stars
  • Screenshot to code: Turning Design Mockups Into Code With Deep Learning [Oct 2017] ref GitHub Repo stars
  • Build a Large Language Model (From Scratch):🏆Implementing a ChatGPT-like LLM from scratch, step by step GitHub Repo stars
  • Spreadsheets-are-all-you-need: Spreadsheets-are-all-you-need implements the forward pass of GPT2 entirely in Excel using standard spreadsheet functions. [Sep 2023] GitHub Repo stars
  • llm.c: LLM training in simple, raw C/CUDA [Apr 2024] GitHub Repo stars | Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20 ref
  • llama3-from-scratch: Implementing Llama3 from scratch [May 2024] GitHub Repo stars
  • Umar Jamil github:💡LLM Model explanation / building a model from scratch 📺
  • Andrej Karpathy📺: Reproduce the GPT-2 (124M) from scratch. [June 2024] / SebastianRaschka📺: Developing an LLM: Building, Training, Finetuning [June 2024]
  • Transformer Explainer: an open-source interactive tool to learn about the inner workings of a Transformer model (GPT-2) git [8 Aug 2024]
  • Beam Search [1977] in Transformers is an inference algorithm that maintains the beam_size most probable sequences until the end token appears or maximum sequence length is reached. If beam_size (k) is 1, it's a Greedy Search. If k equals the total vocabularies, it's an Exhaustive Search. ref [Mar 2022]
  • Einsum is All you Need: Einstein Summation [5 Feb 2018]
  • You could have designed state of the art positional encoding: Binary Position Encoding, Sinusoidal positional encoding, Absolute vs Relative Position Encoding, Rotary Positional encoding [17 Nov 2024]

Classification of Attention

  • ref: Must-Read Starter Guide to Mastering Attention Mechanisms in Machine Learning [12 Jun 2023]

    • Soft Attention: Assigns continuous weights to all input elements. Used in neural machine translation.
    • Hard Attention: Selects a subset of input elements to focus on while ignoring the rest. . Requires specialized training (e.g., reinforcement learning). Used in image captioning.
    • Global Attention: Attends to all input elements, capturing long-range dependencies. Suitable for tasks involving small to medium-length sequences.
    • Local Attention: Focuses on a localized input region, balancing efficiency and context. Used in time series analysis.
    • Self-Attention: Attends to parts of the input sequence itself, capturing dependencies. Core to models like BERT.
    • Multi-head Self-Attention: Performs multiple self-attentions in parallel, capturing diverse features. Essential for transformers.
    • Sparse Attention: reduces computation by focusing on a limited selection of similarity scores in a sequence, resulting in a sparse matrix. It includes implementations like "strided" and "fixed" attention and is critical for scaling to very long sequences. ref [23 Oct 2020]
    • Cross-Attention: mixes two different embedding sequences, allowing the model to attend to information from both sequences. In a Transformer, when the information is passed from encoder to decoder, that part is known as Cross-Attention. Plays a vital role in tasks like machine translation.
      ref / ref [9 Feb 2023]
    • Sliding Window Attention (SWA): Used in Longformer. It uses a fixed-size window of attention around each token, allowing the model to scale efficiently to long inputs. Each token attends to half the window size tokens on each side, significantly reducing memory overhead. ref

LLM Materials for East Asian Languages

Japanese

Korean

Learning and Supplementary Materials