← all projects
Deep Learning · November 2024 · 1 min read

Transformer for News Summarization

Self-attention, multi-head attention, and encoder-decoder architecture implemented from scratch. Trained on CNN/DailyMail achieving 35.1 ROUGE-L, outperforming LSTM baseline by 60%.

PythonPyTorchTransformersNLP

End-to-end Transformer for abstractive summarization — scaled dot-product attention, positional encoding, layer norm, encoder-decoder stack. Beats LSTM language model baseline by 60% on ROUGE-L. LSTM for headline generation trained in parallel as comparison.

implementationMulti-head attention, positional encoding, encoder-decoder performanceROUGE-L: 35.1, BLEU-4: 24.8 on CNN/DailyMail perplexity35.2 on language modeling task

Multi-Head Attention

Different heads learn different relationships — syntactic adjacency, coreference resolution, positional patterns, and semantic clustering. This functional specialization emerges without explicit supervision.

Scaling Behavior

Depth, heads, and sequence length ablations. Performance saturates around 8 layers and 8 heads. Transformer processes 3× more samples/sec than LSTM thanks to parallelization — no sequential dependency in self-attention.

#transformers#attention#summarization#nlp

Related projects