LLaVA (Large Language and Vision Assistant) tool is an innovative large multimodal model designed for general-purpose visual and language understanding. It combines a vision encoder with a large language model (LLM), Vicuna, and is trained end-to-end.