It's really interesting that these models are written in Python. Anyone know how...

albertzeyer · on Feb 20, 2023

Python is just the gluing language. All the heavy lifting happens in CUDA or CuBLAS or CuDNN or so.

Most optimizations for saving memory is by using lower precision numbers (float16 or less), quantization (int8 or int4), sparsification, etc. But this is all handled by the underlying framework like PyTorch.

There are C++ implementations but they optimize on different aspects. For example: https://github.com/OpenNMT/CTranslate2/

brrrrrm · on Feb 20, 2023

For large models, there are two main ways folks have been optimizing machine learning execution:

1. lowering precision of the operations (reducing compute "width" and increasing parallelization)

2. fusing operations into the same GPU code (reducing memory-bandwidth usage)

Neither of those optimizations would benefit from swapping to a faster language.

Why? The typical "large" neural network operation runs on the order of a dozen microseconds to milliseconds. Models are usually composed of hundred if not thousands of these. The overhead of using Python is around 0.5 microseconds per operation (best case on Intel, worst case on Apple ARM). So that's maybe a 5% net loss if things were running synchronously. But they're not! When you call GPU code, you actually do it asynchronously, so the language latency can be completely hidden.

So really, all you want in an ML language is the ability to 1. change the type of the underlying data on the fly (Python is really good at this) and 2. rewrite the operations being dispatched to on the fly (Python is also really good at this).

For smaller models (i.e. things that run in sub-microsecond world), Python is not the right choice for training or deploying.

amelius · on Feb 20, 2023

Your view of "offloading" things to a faster language is wrong. It's already written in a fast language (C++ or CUDA). Python is just an easy to use way of invoking the various libraries. Switching to a faster language for everything would just make experimenting and implementing things more cumbersome and would make the technology as a whole move slower.

ianzakalwe · on Feb 20, 2023

Python is mostly just a glue code nowadays, all data loading, processing and computations are handled by low level languages (C/C++), python is there just to instruct those low level libraries how to compose into one final computation.

ilaksh · on Feb 21, 2023

The model is not written in a programming language at all. The model is in the neural network weights.