An empirical study of when model souping (parameter averaging of independently fine-tuned models) successfully improves performance. Over 5,000 ResNet-50 soups on CIFAR-100 reveal a tradeoff between model similarity and diversity, evidence that souping is moderately transitive, and a strong correlation between in-distribution and out-of-distribution gains. Accepted to the ICLR 2026 Test-Time Updates (TTU) Workshop.