Make Neural Networks Faster
Methods for compressing and accelerating deep learning models - Papers for each task's
- Applications
- Distillation
- Pruning
- Neural architecture search
- Benchmarking
- Quantization
- Accelerating training
- Multimodal
- Task-specific tricks
- Architecture Specific Trick
- Speech
- Carbon footprint and alternative power sources
- New papers
Applications
- Natural Language Processing with Small Feed-Forward Networks
- Machine Learning at Facebook: Understanding Inference at the Edge
- Recognizing People in Photos Through Private On-Device Machine Learning
- Knowledge Transfer for Efficient On-device False Trigger Mitigation
- Smart Reply: Automated Response Suggestion for Email
- Chat Smarter with Allo
Distillation
- Model Compression
- Distilling the Knowledge in a Neural Network
- TinyBERT: Distilling BERT for Natural Language Understanding
- DistilBERT
- MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices
- Distilling Large Language Models into Tiny and Effective Students using pQRNN
- Sequence-Level Knowledge Distillation
- DynaBERT: Dynamic BERT with Adaptive Width and Depth
Pruning
- Optimal Brain Damage
- The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
- The Lottery Ticket Hypothesis: A Survey (blog post)
- Bayesian Bits: Unifying Quantization and Pruning
- Structured Pruning of Neural Networks with Budget-Aware Regularization
- Block Pruning For Faster Transformers
Neural architecture search
- SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers
- FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable NAS
- EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
- High-Performance Large-Scale Image Recognition Without Normalization
- HAT: Hardware-Aware Transformers for Efficient Natural Language Processing
Benchmarking
- Show Your Work: Improved Reporting of Experimental Results
- Showing Your Work Doesn’t Always Work
- The Hardware Lottery
- HULK: An Energy Efficiency Benchmark Platform for Responsible Natural Language Processing
- An Analysis of Deep Neural Network Models for Practical Applications
- MLPerf Inference Benchmark
- MLPerf Training Benchmark
- Roofline: an insightful visual performance model for multicore architectures
- Evaluating the Energy Efficiency of Deep Convolutional Neural Networks on CPUs and GPUs
- Deep Learning Language Modeling Workloads: Where Time Goes on Graphics Processors
- Energy and Policy Considerations for Deep Learning in NLP
- IrEne: Interpretable Energy Prediction for Transformers
- Measuring the Carbon Intensity of AI in Cloud Instances
- Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning
- Carbontracker: Tracking and Predicting the Carbon Footprint of Training Deep Learning Models
Quantization
- Scalable Methods for 8-bit Training of Neural Networks
- Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers
- Once-for-All: Train One Network and Specialize it for Efficient Deployment
- Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT
- I-BERT: Integer-only BERT Quantization
- BinaryBERT
- TernaryBERT: Distillation-aware Ultra-low Bit BERT
- Binarized Neural Networks
- Training Deep Neural Networks with 8-bit Floating Point Numbers
- HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision
Accelerating training
- Prefix-Tuning: Optimizing Continuous Prompts for Generation
- Pre-Training Transformers as Energy-Based Cloze Models
- Parameter-Efficient Transfer Learning for NLP
- Accelerating Deep Learning by Focusing on the Biggest Losers
- Dataset Distillation
- Competence-based curriculum learning for neural machine translation
- Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Task-specific tricks
- A Study of Non-autoregressive Model for Sequence Generation
- Mask-Predict: Parallel Decoding of Conditional Masked Language Models
- Non-autoregressive neural machine translation
- Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation
- Improving Low Compute Language Modeling with In-Domain Embedding Initialisation
- COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List
CNNs
- XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks
- XOR-Net: An Efficient Computation Pipeline for Binary Neural Network Inference on Edge Devices
- MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
- Fast Convolutional Nets With fbfft: A GPU Performance Evaluation
- FFT Convolutions are Faster than Winograd on Modern CPUs, Here’s Why
- Fast Algorithms for Convolutional Neural Networks
Softmax
Embeddings/inputs
Transformers
- Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing
- Do Transformer Modifications Transfer Across Implementations and Applications?
- Efficient Transformers: A Survey
- Consistent Accelerated Inference via Confident Adaptive Transformers
- PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination
- Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level
- Are Sixteen Heads Really Better Than One?
- Are Pre-trained Convolutions Better than Pre-trained Transformers?
Carbon footprint and alternative power sources
- Tackling Climate Change with Machine Learning
- On the opportunities and risks of foundation models (Section 5.3)
- Quantifying the Carbon Emissions of Machine Learning
- AutoFL: Enabling Heterogeneity-Aware Energy Efficient Federated Learning