The cached accuracy snapshot provides a fast baseline for how each language model performs when asked to recover microbial phenotypes. With those reference points in mind, the next visualization explores how balanced accuracy shifts across publication years, highlighting both the pace of improvement in newer releases and the veteran models that still compete in high-signal phenotype categories.