paper: https://arxiv.org/pdf/2105.02358.pdf
F = query _linear(F) # s h a p e = (B , N, C)
attn = M_k( F ) # s h a p e = (B , N, M)
attn = softmax (attn , dim=1 )
attn = l1_norm ( attn , dim=2 )
out = M_v( attn ) # s h a p e = (B , N, C)
external attention, which computes attention between the input pixels and an external memory unit M ∈ R^(S×d), via:
A = (α)i,j = Norm(FM)
Fout = AM.
A is the attention map inferred from this learned dataset-level prior knowledge; Finally, we update the input features from M by the similarities in A
The computational complexity of external attention is O(dSN); as d and S are hyper-parameters