Scale ML

▸ We are a cross-lab MIT AI graduate student collective focusing on Algorithms That Learn and Scale.
▸ The group is open to all MIT affiliates, and to participate, contact the organizer. We currently host bi-weekly seminars and will have hands on sessions and research socials in the future.
▸ Our coffee ☕ + baked goods 🍰 are currently funded by generous donations from Phillip Isola and Yoon Kim.
▸ We are looking for sponsors to increase our seminar snack capacity, fund research socials, and reimburse speaker travels. Please contact the organizers if interested.

▸ Join our next seminar on Zoom, note this is only open to MIT students currently: Click here to join

Discussion Schedule

TBD 1B parameter model training. (hands on session) Aniruddha Nrusimha (MIT)
TBG How to scale models with Modula. (hands on session) Jeremy Bernstein (MIT)
07/24 FineWeb: Creating a large dataset for pretraining LLMse Guilherme Penedo (Hugging Face)
07/17 Hardware-aware Algorithms for Language Modeling Tri Dao (Princeton)
07/10 LLM360: Towards Fully Transparent Open-Source LLMs Hongyi Wang (CMU)
07/3 DeciMamba: Exploring the Length Extrapolation Potential of Mamba. Assaf Ben-Kish (Tel-Aviv)
04/17 Adapting LLMs with Reinforcement Learning Idan Shenfeld
04/03 The Quest to build an (O)pen (L)anguage (Mo)del Luca Soldaini (AI2)
03/20 Efficient Deep Learning with Sparsity: Algorithms, Systems, and Applications Zhijian Liu
03/12 Building and Deploying Large Language Model Applications Efficiently and Verifiably Ying Sheng (Stanford)
03/06 In-Context Language Learning and N-gram Heads Ekin Akyürek
02/21 Neurons, norms and number systems Jeremy Bernstein
11/28 Sparsity in Transformers Shobhita Sundaram
10/18 Large-Scale RNNs in the era of Transformers Bailin Wang
11/01 Critical batch-size in deep learning Minyoung Huh (Jacob)
10/18 Tensor Program Synthesis Han Guo
10/04 Mixture of Experts (MOEs) Jyo Pari
09/13 Speculative Decoding Aniruddha Nrusimha

Critical batch-size in deep learning

What batch-size should you use for your model? What does the batch-size tell you about your task? This post discusses one main aspects of the scaling law....

Mixture of Experts (MoE)

MoE are rumored to be a critical components in scaling up to a trillion parameter model. By routing tokens to specialized modular functions, it enables models to have representational power of a much larger model than what is used for prediction. We will be discussing how MoE works and its recent advances in this literature...

Speculative decoding

A brief overview of speculative decoding, detailing the roots of LLM inference slowdowns and how algorithmic level changes can improve generation speed!...