³ We initially included a second RF approach using probability-based predictions with ROC-optimized thresholds (matching the IRT approach). However, RF probability estimates exhibited substantially worse performance across all metrics compared to using RF’s predicted classes with a Bayes classifier. This is consistent with known limitations of RF probability calibration, particularly with unbalanced class sizes. To streamline the presentation of results, we do not include the results for RF probability. However, we report these results in Supplementary Table 1.