[abstract] This approach significantly reduces the KV cache size relative to traditional multi-head attention
[3.3] For saving the KV cache, only the intermediate
latent representations need to be stored: [latex] where r is much smaller than nh · dh [n-sub-h, d-sub-h]
[background] In traditional multi-head attention you must cache full key and value matrices of size T x (nh · dh) where T is the token length, nh is the number of attention heads, dh is the dimensionality of each individual head
sounds like a big win for memory constrained environments like local inference
[3.3] For saving the KV cache, only the intermediate latent representations need to be stored: [latex] where r is much smaller than nh · dh [n-sub-h, d-sub-h]
[background] In traditional multi-head attention you must cache full key and value matrices of size T x (nh · dh) where T is the token length, nh is the number of attention heads, dh is the dimensionality of each individual head
sounds like a big win for memory constrained environments like local inference