Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Very cool.

Worth mentioning though that the highlighted figures (1.12 tok/s for OPT-175B for "FlexGen with Compression") are for inputs of 512 tokens and outputs of 32 tokens.

Since decoder-only transformer memory requirements scale with the square of sequence lengths, things would probably slow down significantly for very long sequences, which would be required for a back-and-forth conversation.

Still though, until reading this i had no idea that running such a model on-device was remotely feasible!



> transformer memory requirements scale with the square of sequence lengths

Not true, see: Flash Attention. You can losslessly calculate the attention in blocks using a little math trick. Essentially each subsequent block "corrects" the denominator of the last block's softmax calculation. At the end you have a perfectly* accurate softmax. Since you don't need to keep the whole sequence in memory to perform the softmax, your memory now scales linearly with respect to sequence length, and due to the lower memory bandwidth requirements and increased kernel fusion the operation also tends to be faster.

* While mathematically the calculation ends up exactly the same, in practice the result ends up slightly different due to the whims of F32 and F16 inaccuracies, and since the "max" used to calculate the softmax in a numerically stable way is calculated on a per-block basis. Doesn't significantly effect training or validation loss though.


What's the best way to get started learning this? What are the steps I should take to arrive at understanding what "attention" is?


> Since decoder-only transformer memory requirements scale with the square of sequence lengths, things would probably slow down significantly for very long sequences, which would be required for a back-and-forth conversation.

You can use tricks to keep the sequence length down even if the conversation goes on for a long time. For example, you can use the model to summarize the first n-1 lines of the conversation and append the last line to the summary as is.


This is very interesting. Could you please elaborate and maybe share links to articles if you know of any?


I don't have any sources to refer to, but "text summarization" is one of the common NLP tasks that LLMs are often benchmarked on. All of these general-purpose LLMs will be able to do a decent job at text summarization (some, such as ChatGPT, will be able to do zero-shot summarizations at high quality, whereas others need to be fine tuned for the task). If your problem is that you are feeding a large amount of text to the model and that is slow/expensive, then summarization will obviously remediate that issue. After summarizing most of the input text you still need to feed in the latest input without summarization, so for example if the user asks a question, the LLM can then accurately answer that question. (If all of the input goes into summarization, that last question may not even appear in the summarization, so results will be crap.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: