Dot Product in the Attention Mechanism

The dot product of two embedding vectors and with dimension is defined as Hardly the first thing that jumps to mind when thinking about a “similarity score”. Indeed, the result of a dot product is a single numbers (a scalar), with no predefined range (e.g. not between zero and one). So, it’s hard to quantify whether a particular score is high/low on its own. Still, deep learning Transformer family of models rely heavily on the dot product in the attention mechanism; to weigh the importance of...