The images get converted to tokens using the vision encoder, But the tokens are ...

The images get converted to tokens using the vision encoder, But the tokens are just embedding vectors. So it should be able to if you train it.

CNNs and LLMs are not that different. You can train an LLM architecture to do the same thing that CNNs do with a few modifications, see Vision Transformers.