ViperGPT is a framework that leverages code-generation models to compose vision-and-language models into subroutines to answer complex visual queries. Unlike end-to-end models, ViperGPT explicitly differentiates between visual processing and reasoning, making it more interpretable and generalizable. It achieves state-of-the-art results across various complex visual tasks without requiring further training.