1 minute read

Tags: , , ,

Sequence Encoding Basic Attention

Representations of Variable Length Data

  • Input: word sequence, image pixels, audio signal, click logs
  • Propery: continuity, temporal, importance distribution
  • Example
    • Basic combination: average, sum
    • Neural combination: network architecurees should consider input domain prperties
      • CNN (convolutional neural network)
      • RNN (recurrent neurlal network): temporal information

Network architectures should consider the input domain properties

Recurrent Neural Networks (RNN)

  • Learning variable-length representations

    • Fit for sentences and sequences ofvalues
  • Sequential computation makes parallelization difficult

  • No explicit modeling of long and short range dependenceies

  • Imgur

Convolutional neural Networks

  • Easy to parallelize

  • Exploit local dependencies
    • Long-distance dependencies require many layers
  • Imgur

Attention

  • Encoder-decoder model is important in NMT
  • RNNs need attention mechanism to handle long dependencies
  • Attention allows us to access any state

  • Imgur

Machine Translation with Attention

  • query: 拿進來去算 match 程度的資訊 z
  • key: 拿過來被算的 vectors 叫 key
    • 拿 query 在 key 上面做 match 程度
    • key 用來算 α
  • value: 可以和key是不一樣的,value 是用來做 weighted sum 的

Imgur

Dot-Product Attention

  • Input: a query q and a set of key-value (k-v) pairs to an output
  • Output: weighted sum of values

Imgur

Dot-Product Attention in Matirx

  • Input: multiple queries q and a set of key-value (k-v) pairs to an output
  • Output: a set of weighted sum of values

Imgur

Sequence Encoding Self-Attention

Attention

  • Encoder-decoder modle is important in NMT
  • RNNs need attention mechanism to handle long dependencies
  • Attention allows us to access any state

Using attention to replace recurrence architectures

Self-Attention

  • Constant “path length” between two positions

  • Easy to parallelize

Imgur

Transformoer Idea

Imgur

Encoder Self-Attention

Imgur

Decoder Self-Attention

Imgur

Sequence Encoding Multi Head Attention

Convolution

Imgur

Self-Attention

Imgur

Attention Head: who

Imgur

Comparision

Imgur

Sequence Encoding Transformer

Transformer Overview

img

Imgur

Multi-Head Attention

Imgur

Scaled Dot-Product Attention

Imgur

Transformer Encoder Block

Imgur

Imgur

Encoder Input

Imgur

Imgur

Multi-Head Attention Details

good-chart

Training Tips

  • Byte-pair encodings (BPE)
  • Checkpoint averaging
  • ADAM optimizer with learning rate changes
  • Dropout during trainiing at every layer just before adding residual
  • Label smoothing
  • Auto-regressive decoding with beam search and length penalties

ML Experiments

Imgur

Parsing Experiments

Imgur

Condluding Remarks

Non-recurrence model is easy to paralleize
Multi-head attention captures different aspects by interacting between words
Positional encoding captures location information
Each transformer block can be applied to diverse tasks