Papers - Custom Layers
updated
Unleashing the Power of Pre-trained Language Models for Offline
Reinforcement Learning
Paper
• 2310.20587
• Published • 18
JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and
Attention
Paper
• 2310.00535
• Published • 2
Does Circuit Analysis Interpretability Scale? Evidence from Multiple
Choice Capabilities in Chinchilla
Paper
• 2307.09458
• Published • 12
The Impact of Depth and Width on Transformer Language Model
Generalization
Paper
• 2310.19956
• Published • 10
Veagle: Advancements in Multimodal Representation Learning
Paper
• 2403.08773
• Published • 10
Hash Layers For Large Sparse Models
Paper
• 2106.04426
• Published • 2
Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as
an Alternative to Attention Layers in Transformers
Paper
• 2311.10642
• Published • 25
DenseFormer: Enhancing Information Flow in Transformers via Depth
Weighted Averaging
Paper
• 2402.02622
• Published • 3
The Unreasonable Ineffectiveness of the Deeper Layers
Paper
• 2403.17887
• Published • 82
Lumiere: A Space-Time Diffusion Model for Video Generation
Paper
• 2401.12945
• Published • 86
RWKV: Reinventing RNNs for the Transformer Era
Paper
• 2305.13048
• Published • 21
Condition-Aware Neural Network for Controlled Image Generation
Paper
• 2404.01143
• Published • 13
Locating and Editing Factual Associations in GPT
Paper
• 2202.05262
• Published • 1
MLP Can Be A Good Transformer Learner
Paper
• 2404.05657
• Published • 1
Toward a Better Understanding of Fourier Neural Operators: Analysis and
Improvement from a Spectral Perspective
Paper
• 2404.07200
• Published • 2
MegaScale: Scaling Large Language Model Training to More Than 10,000
GPUs
Paper
• 2402.15627
• Published • 36
Scaling MLPs: A Tale of Inductive Bias
Paper
• 2306.13575
• Published • 17
GLIGEN: Open-Set Grounded Text-to-Image Generation
Paper
• 2301.07093
• Published • 4
All you need is a good init
Paper
• 1511.06422
• Published • 1
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
Paper
• 2404.16710
• Published • 80
Is Bigger Edit Batch Size Always Better? -- An Empirical Study on Model
Editing with Llama-3
Paper
• 2405.00664
• Published • 20
pyvene: A Library for Understanding and Improving PyTorch Models via
Interventions
Paper
• 2403.07809
• Published • 1
TokenFormer: Rethinking Transformer Scaling with Tokenized Model
Parameters
Paper
• 2410.23168
• Published • 24
Augmenting Self-attention with Persistent Memory
Paper
• 1907.01470
• Published • 1
Paper
• 2412.09764
• Published • 5