Cross-Attention for Cross-Asset Applications: How to mash up features from multiple assets

In the previous blog post , we saw how we can apply self-attention transformers to a matrix of time series features of a single stock. The output of that transformer is a transformed feature vector r of dimension 768 × 1. 768 is the result of 12 × 64: all the lagged features are concatenated / flattened into one vector. 12 is the number of lagged months, and 64 is the dimension of the embedding space for the 52 features we constructed. What if we have a portfolio of many stocks whose returns we