aiacademy: 自然語言處理 NLP 3.2 Sequence Encoding & Attention
Tags: aiacademy, attention, nlp, sequence-encoding
Sequence Encoding Basic Attention
Representations of Variable Length Data
- Input: word sequence, image pixels, audio signal, click logs
- Propery: continuity, temporal, importance distribution
- Example
- Basic combination: average, sum
- Neural combination: network architecurees should consider input domain prperties
- CNN (convolutional neural network)
- RNN (recurrent neurlal network): temporal information
Network architectures should consider the input domain properties
Recurrent Neural Networks (RNN)
-
Learning variable-length representations
- Fit for sentences and sequences ofvalues
-
Sequential computation makes parallelization difficult
-
No explicit modeling of long and short range dependenceies
-

Convolutional neural Networks
-
Easy to parallelize
- Exploit local dependencies
- Long-distance dependencies require many layers
-

Attention
- Encoder-decoder model is important in NMT
- RNNs need attention mechanism to handle long dependencies
-
Attention allows us to access any state
-

Machine Translation with Attention
- query: 拿進來去算 match 程度的資訊 z
- key: 拿過來被算的 vectors 叫 key
- 拿 query 在 key 上面做 match 程度
- key 用來算 α
- value: 可以和key是不一樣的,value 是用來做 weighted sum 的

Dot-Product Attention
- Input: a query q and a set of key-value (k-v) pairs to an output
- Output: weighted sum of values

Dot-Product Attention in Matirx
- Input: multiple queries q and a set of key-value (k-v) pairs to an output
- Output: a set of weighted sum of values

Sequence Encoding Self-Attention
Attention
- Encoder-decoder modle is important in NMT
- RNNs need attention mechanism to handle long dependencies
- Attention allows us to access any state
Using attention to replace recurrence architectures
Self-Attention
-
Constant “path length” between two positions
-
Easy to parallelize

Transformoer Idea

Encoder Self-Attention

Decoder Self-Attention

Sequence Encoding Multi Head Attention
Convolution

Self-Attention

Attention Head: who

Comparision

Sequence Encoding Transformer
Transformer Overview
- Non-recurrent encoder-decoder for MT
- PyTorch explanation by Sasha Rush
- 一定要自己看一次,超棒棒~
- http://nlp.seas.harvard.edu/2018/04/03/attention.html


Multi-Head Attention

Scaled Dot-Product Attention

Transformer Encoder Block


Encoder Input


Multi-Head Attention Details

Training Tips
- Byte-pair encodings (BPE)
- Checkpoint averaging
- ADAM optimizer with learning rate changes
- Dropout during trainiing at every layer just before adding residual
- Label smoothing
- Auto-regressive decoding with beam search and length penalties
ML Experiments

Parsing Experiments

Condluding Remarks
| Non-recurrence model is easy to paralleize | ![]() |
| Multi-head attention captures different aspects by interacting between words | |
| Positional encoding captures location information | |
| Each transformer block can be applied to diverse tasks |
