Hacker Newsnew | past | comments | ask | show | jobs | submit | schopra909's commentslogin

honestly, it's really hard to shorten the feedback loop in this space. For this, we really just did run one experiment at a time and visually inspect the results everywhere. when you're going 0 -> 1, you're looking for "signs of life" to make sure the basic thing is working. when it comes to testing which (of the infinite levers) to the pull, a lot of it comes from intuition (which i know isn't the most fun answer). we spent a week or so just running experiments on the amount of compression we could squeeze out the VAE without significant degradation in the final results). In hindsight, spending a week on that seems like a waste, since we got the 8x spatial, 4x compression within the first 1-2 days. But in the moment, you're often unsure WHAT will be the key unlock. So, when you're in the middle of storm you're running a quick bayesian process in your head, measuring what you might learn from the outcome of the experiment vs. the time/money it would take to run the experiment. And you, hope that your intuitions become stronger over time, as you take more repetitions. More money, might help the problem (e.g. parallel experiments, more detailed explorations). But, I don't think money is a cure-all. At some point, you get lost in the sauce trying to tie the threads between all the empirical findings you have at your finger tips. Maybe one day AI models could help here integrating these all results. As it stands, they still struggle to reason about this stuff, in context of other research papers and findings (likely because all the context on arxiv is so noisy; you can't trust any particular finding and verifying findings is so hard to do, that it's hard to meta-reason about your experiments correctly).

Hadn’t seen that before! Seems very in line with what with the broader points about regularization. In table 4 they show faster convergence in 200 epochs when used alongside REPA. I’d be curious to see if it ended up beating REPA by itself with full 800 epochs of training — or if something about this new latent space, leads to plateauing itself (learns faster but caps out on expressivity). We’ve seen that phenomena before in other situations (eg UNET learns faster than DiT because of convolutions, but stops learning beyond a certain point).

yep, Apache 2.0! so anyone's welcome to download and hack away

Hi HN, I’m one of the two authors of the post and the Linum v2 text-to-video model (https://news.ycombinator.com/item?id=46721488). We're releasing our Image-Video VAE (open weights) and a deep dive on how we built it. Happy to answer questions about the work!

Great work! I have been wondering what would it take to train with higher image bit depth (10 or 12b) and/or using camera footage only, not already heavily processed images? The usefulness of video generation in most professional use cases is limited because models are too end to end and completely contaminated with stock footage. Maybe quantities of training material needed is simply not there?

Not blaming you, but asking as I don’t usually have access to professionals working with video training.


It’s a great question. In terms of pre-training even if they were was enough data at that quality, storing it and either demuxing it into raw frames OR compressing it with a sufficiently powerful encoder likely would cost a lot of $. But there’s a case to potentially use a much smaller subset of that data to dial in aesthetics towards the end of training. The gotcha there would come in terms of data diversity. Often you see that models will adapt to the new distribution and forget patterns from the old data. It’s hard to disentangle a model learning clarity of detail from concepts, so you might forget key ideas when picking up these details. Nevertheless maybe there is a way to use small amounts of this data in a RL finetuning setup? In our experience RL post training changes very little in the underlying model weights — so it might be a “light” enough touch to elicit the the desired details.

No questions but I appreciate the write-up! Thank you for sharing.

This is very cool. Side note, I really dig the JavaScript animations on the causal block diffusion blog post. Made the concept immediately clear


I think YC just release video on the basics of diffusion, but honestly I don’t have a good end to end guide.

We’re going to write up going 0->1 on a video model (all the steps) over the coming months. But it likely won’t be a class or anything like that.

https://www.linum.ai/field-notes

We want to share our learnings with folks who are curious about the space - but don’t have time to make it a full class experience.

Hopefully karpathy does that with his courses in the future!


Not public yet — we’re going to clean it up so it’s readable and release it as blog posts. First one will be everything you need to know on building a VAE for image and video. Should be out in a few weeks. We’re figuring out the write balance between spending time writing and all the work we have on our plate for the next model.

If you’re interested in this stuff, keep an eye on field notes (our blog).



Oh damn! Thanks for catching that -- going to ping the HF folks to see what they can do to fix the collection link.

In the meantime here's the individual links to the models:

https://huggingface.co/Linum-AI/linum-v2-720p https://huggingface.co/Linum-AI/linum-v2-360p


Looks like 20GB VRAM isn't enough for the 360p demo :( need to bump my specs :sweat_smile:


Should be fixed now! Thanks again for the heads up


All good, cheers!


Per the RAM comment, you may able to get it run locally with two tweaks:

https://github.com/Linum-AI/linum-v2/blob/298b1bb9186b5b9ff6...

1) Free up the t5 as soon as the text is encoded, so you reclaim GPU RAM

2) Manual Layer Offloading; move layers off GPU once they're done being used to free up space for the remaining layers + activations


Any idea on the minimum VRAM footprint with those tweaks? 20GB seems high for a 2B model. I guess the T5 encoder is responsible for that.


T5 Encoder is ~5B parameters so back of the envelope would be ~10GB of VRAM (it's in bfloat16). So, for 360p should take ~15 GB RAM (+/- a few GB based on the duration of video generated).

We can update the code over the next day or two to provide the option for delete VAE after the text encoding is computed (to save on RAM). And then report back the GB consumed for 360p, 720p 2-5 seconds on GitHub so there are more accurate numbers.

Beyond the 10 GB from the T5, there's just a lot of VRAM taken up by the context window of 720p video (even though the model itself is 2B parameters).


The 5B text encoder feels disproportionate for a 2B video model. If the text portion is dominating your VRAM usage it really hurts the inference economics.

Have you tried quantizing the T5? In my experience you can usually run these encoders in 8-bit or even 4-bit with negligible quality loss. Dropping that memory footprint would make this much more viable for consumer hardware.


That all being said, you can just delete the T5 from memory after encoding the text so save on memory.

The 2B parameters will take up 4 Gb of memory but activations will be a lot more given size of context windows for video.

A 720p 5 second video is roughly 100K tokens of context


Great idea! We haven’t tried it but def interested to see if that works as well.

When we started down this path, T5 was the standard (back in 2024).

Likely won’t be the text encoder for subsequent models, given its size (per your point) and age


I think you nailed it.

For us it’s classifiers that we train for very specific domains.

You’d think it’d be better to just finetune a smaller non-LLM model, but empirically we find the LLM finetunes (like 7B) perform better.


I think it's no surprise that any model that has a more general understanding of text performs better than some tiny ad-hoc classifier that blindly learns a couple of patterns and has no clue what it's looking at. It's going to fail in much weirder ways that make no sense, like old cnn-based vision models.


It’s not clear to me what the bottleneck for OCR to “100%” work with LLMS is.

In my work we do a lot of stuff with image understanding and captioning (not OCR). There object identification and description works great, since all the models are using a CLIP like visual backbone. But it falls apart when you ask about nuances like left/right or counting (reasoning kind of improves the latter but it’s too expensive to matter IMO).

For our tasks, it’s clear that there’s more fundamental research that needs to be done on vision understanding to push past CLIP. That would really improve LLMs for our usecases.

Curious if there’s something similar going on for OCR in the vision encoder that’s fundamentally holding it back.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: