In silico methodologies, such as (quantitative) structure-activity relationships ([Q]SARs), are available to predict a wide variety of toxicological properties and biological activities for structurally diverse substances. To obtain insights in the scientific value of these predictions, the capacity of the prediction models to generate (sufficiently) reliable results for a particular type of compounds needs to be evaluated. In the current study, performance parameters to predict the endpoint "bacterial mutagenicity" were calculated for a battery of common (Q)SAR tools, namely Toxtree, Derek Nexus, VEGA Consensus, and Sarah Nexus. Printed paper and board food contact material (FCM) constituents were chosen as study substances because many of these lack experimental data, making them an interesting group for in silico screening. Accuracy, sensitivity, specificity, positive predictivity, negative predictivity, and Matthews correlation coefficient for the individual models and for the combination of VEGA Consensus and Sarah Nexus were determined and compared. Our results demonstrate that performance varies among the four models, but can be increased by applying a combination strategy. Furthermore, the importance of the applicability domain is illustrated. Limited performance to predict the mutagenic potential of substances that are new to the model (ie, not included in the training set) is reported. In this context, the generally poor sensitivity for these new substances is also addressed.