When DIF Goes Unmodeled: Assessing the Viability of Random Forest for Diagnostic Classification

Authors

DOI:

https://doi.org/10.35566/jbds/bainmmbg

Keywords:

Classification, Differential item functioning, Item response theory, Random forest, Diagnostic assessment, Bias, Fairness, Machine learning, Psychometrics

Abstract

Psychological assessments are often used to make classification decisions (i.e., identify whether an individual meets a set of criteria for a psychological diagnosis). A psychometric approach, like item response theory (IRT), first estimates a latent trait score from item responses and then compares that score to a cut-point to determine classes. In contrast, a machine learning approach, such as random forest (RF) predicts class membership directly from item responses. While both methods show promise for diagnostic classification, their relative robustness to differential item functioning (DIF) is unclear. This study compared the classification performance of an IRT- and a RF-based approach to classification under conditions varying DIF presence and severity, along with other sample and scale characteristics via Monte Carlo simulation. Single-group IRT served as a baseline representing standard psychometric practice, as it is widely used for diagnostic classification but assumes item invariance across groups. Results indicated that when DIF was absent or minimal, both approaches yielded comparable classification metrics. However, as DIF severity increased, IRT-based classification performance declined, whereas RF maintained robust performance across conditions. These findings suggest that RF may maintain more stable classification performance than IRT-based classification when DIF is present but not explicitly accounted for in the model, making RF a viable alternative for diagnostic classification when DIF is suspected but its source or structure is unknown, unmeasured, or complex. Strengths and limitations of each approach are discussed, with particular attention to the trade-off between interpretability and classification robustness in applied assessment contexts.

References

American Psychiatric Association. (2022). Diagnostic and statistical manual of mental disorders (DSM-5-TR ed.). American Psychiatric Association Publishing. https://psychiatryonline.org/doi/book/10.1176/appi.books.9780890425787 doi: https://doi.org/10.1176/appi.books.9780890425787

Bain, C. M., Norris, J. E., Conley, A., Manapat, P. D., & Ethridge, L. E. (2025). A psychometric analysis of the duke misophonia questionnaire. Journal of Clinical Psychology, 81(12), 1195–1212. doi: https://doi.org/10.1002/jclp.70028

Battauz, M. (2019). On wald tests for differential item functioning detection. Statistical Methods & Applications, 28(1), 103–118. doi: https://doi.org/10.1007/s10260-018-00442-w

Bradford, A., Meyer, A. N. D., Khan, S., Giardina, T. D., & Singh, H. (2024). Diagnostic error in mental health: a review. BMJ Quality & Safety, 33(10), 663–672. doi: https://doi.org/10.1136/bmjqs-2023-016996

Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. doi: https://doi.org/10.1007/978-3-030-62008-0_35

Brown, C., Tollefson, N., Dunn, W., Cromwell, R., & Filion, D. (2001). The adult sensory profile: Measuring patterns of sensory processing. The American Journal of Occupational Therapy, 55(1), 75–82. doi: https://doi.org/10.5014/ajot.55.1.75

Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. Thousand Oaks, Calif. London: Sage.

Carvalho, L. F., Costa, A. R. L., Otoni, F., & Junqueira, P. (2019). Obsessive–compulsive personality disorder screening cut-off for the conscientiousness dimension of the dimensional clinical personality inventory 2. The European Journal of Psychiatry, 33(3), 112–119. doi: https://doi.org/10.1016/j.ejpsy.2019.05.002

Chalmers, R. P., Counsell, A., & Flora, D. B. (2016). It might not make a big dif: Improved differential test functioning statistics that account for sampling variability. Educational and Psychological Measurement, 76(1), 114–140. doi: https://doi.org/10.1177/0013164415584576

Classe, F., & Kern, C. (2024). Detecting differential item functioning in multidimensional graded response models with recursive partitioning. Applied Psychological Measurement, 48(3), 83–103. doi: https://doi.org/10.1177/01466216241238743

Cohen, J. (2009). Statistical power analysis for the behavioral sciences (2. ed., reprint ed.). New York, NY: Psychology Press.

De Ayala, R. J. (2009). The theory and practice of item response theory. New York, NY, US: Guilford Press.

Edwards, M. C. (2009). An introduction to item response theory using the need for cognition scale. Social and Personality Psychology Compass, 3(4), 507–529. doi: https://doi.org/10.1111/j.1751-9004.2009.00194.x

Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ, US: Lawrence Erlbaum Associates Publishers.

Florkowski, C. M. (2008). Sensitivity, specificity, receiver-operating characteristic (roc) curves and likelihood ratios: communicating the performance of diagnostic tests. The Clinical Biochemist. Reviews, 29 Suppl 1(Suppl 1), S83-87.

Garb, H. N. (2021). Race bias and gender bias in the diagnosis of psychological disorders. Clinical Psychology Review, 90, 102087. doi: https://doi.org/10.1016/j.cpr.2021.102087

Giannouli, V., & Kampakis, S. (2024). Can machine learning assist us in the classification of older patients suffering from dementia based on classic neuropsychological tests and a new financial capacity test performance? Journal of Neuropsychology. doi: https://doi.org/10.1111/jnp.12409

Gibbons, R. D., Hooker, G., Finkelman, M. D., Weiss, D. J., Pilkonis, P. A., Frank, E., … Kupfer, D. J. (2013). The cad-mdd: A computerized adaptive diagnostic screening tool for depression. The Journal of clinical psychiatry, 74(7), 669–674. doi: https://doi.org/10.4088/JCP.12m08338

Golay, P., Abrahamyan Empson, L., Mebdouhi, N., Conus, P., & Alameda, L. (2023). A better understanding of the impact of childhood trauma on depression in early psychosis: A differential item functioning approach. Schizophrenia Research, 261, 18–23. doi: https://doi.org/10.1016/j.schres.2023.09.001

Gonzalez, O. (2021). Psychometric and machine learning approaches for diagnostic assessment and tests of individual classification. Psychological Methods, 26(2), 236–254. doi: https://doi.org/10.1037/met0000317

Gonzalez, O., Georgeson, A. R., Pelham, W. E., & Fouladi, R. T. (2021). Estimating classification consistency of screening measures and quantifying the impact of measurement bias. Psychological assessment, 33(7), 596–609. doi: https://doi.org/10.1037/pas0000938

Graham, A. K., Trockel, M., Weisman, H., Fitzsimmons-Craft, E. E., Balantekin, K. N., Wilfley, D. E., & Taylor, C. B. (2019). A screening tool for detecting eating disorder risk and diagnostic symptoms among college-age women. Journal of American College Health, 67(4), 357–366. doi: https://doi.org/10.1080/07448481.2018.1483936

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (2011). Fundamentals of item response theory (Nachdr. ed.). Newbury Park: Sage.

Hill, C. (2004). Precision of parameter estimates for the graded item response model [unpublished master’s thesis] (Doctoral dissertation, . The University of North Carolina at Chapel Hill). https://catalog.lib.unc.edu/catalog/UNCb4556245

Jacobucci, R., Grimm, K. J., & Zhang, Z. (2023). Machine learning for social and behavioral research. New York, NY: The Guilford Press.

Kraus, E. B., Wild, J., & Hilbert, S. (2024). Using interpretable machine learning for differential item functioning detection in psychometric tests. Applied Psychological Measurement, 48(4–5), 167–186. doi: https://doi.org/10.1177/01466216241238744

Lai, M. H. C., Richardson, G. B., & Wa Mak, H. (2019). Quantifying the impact of partial measurement invariance in diagnostic research: An application to addiction research. Addictive behaviors, 94, 50–56. doi: https://doi.org/10.1016/j.addbeh.2018.11.029

Liu, R., Huggins-Manley, A. C., & Bulut, O. (2018). Retrofitting diagnostic classification models to responses from irt-based assessment forms. Educational and Psychological Measurement, 78(3), 357–383. doi: https://doi.org/10.1177/0013164416685599

Liu, X. (2012). Classification accuracy and cut point selection. Statistics in Medicine, 31(23), 2676–2686. doi: https://doi.org/10.1002/sim.4509

Lord, F. M. (2012). Applications of item response theory to practical testing problems. Hoboken: Taylor and Francis.

Manapat, P. D., & Edwards, M. C. (2022). Examining the robustness of the graded response and 2-parameter logistic models to violations of construct normality. Educational and Psychological Measurement, 82(5), 967–988. doi: https://doi.org/10.1177/00131644211063453

Manapat, P. D., Edwards, M. C., MacKinnon, D. P., Poldrack, R. A., & Marsch, L. A. (2021). A psychometric analysis of the brief self-control scale. Assessment, 28(2), 395–412. doi: https://doi.org/10.1177/1073191119890021

Naderalvojoud, B., Curtin, C., Asch, S. M., Humphreys, K., & Hernandez-Boussard, T. (2025). Evaluating the impact of data biases on algorithmic fairness and clinical utility of machine learning models for prolonged opioid use prediction. JAMIA Open, 8(5), ooaf115. doi: https://doi.org/10.1093/jamiaopen/ooaf115

Ohiri, S. C., Momoh, M., Christopher, O., Ikeanumba, I., & Benedict, C. (2024). Differential item functioning detection methods: An overview. International Journal of Research Publication and Reviews, 5(2), 1555–1564. doi: https://doi.org/10.55248/gengpi.5.0224.0505

Orru, G., Gemignani, A., Ciacchini, R., Bazzichi, L., & Conversano, C. (2020). Machine learning increases diagnosticity in psychometric evaluation of alexithymia in fibromyalgia. Frontiers in Medicine, 6. doi: https://doi.org/10.3389/fmed.2019.00319

Parker-Guilbert, K. S., Leifker, F. R., Sippel, L. M., & Marshall, A. D. (2014). The differential diagnostic accuracy of the ptsd checklist among men versus women in a community sample. Psychiatry Research, 220(1), 679–686. doi: https://doi.org/10.1016/j.psychres.2014.08.001

Patel, T. A., Robison, M., & Cougle, J. R. (2024). Item response theory analysis and differential item functioning of the social appearance anxiety scale. Assessment, 10731911241306370. doi: https://doi.org/10.1177/10731911241306370

Paulsen, J., Svetina, D., Feng, Y., & Valdivia, M. (2020). Examining the impact of differential item functioning on classification accuracy in cognitive diagnostic models. Applied Psychological Measurement, 44(4), 267–281. doi: https://doi.org/10.1177/0146621619858675

Quan, Y., & Wang, C. (2026). Using multilabel classification neural network to detect intersectional dif with small sample sizes. British Journal of Mathematical and Statistical Psychology, 00, 1–38. doi: https://doi.org/https://doi.org/10.1111/bmsp.70041

R Core Team. (2024). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/

Raskin, R., & Terry, H. (1988). A principal-components analysis of the narcissistic personality inventory and further evidence of its construct validity. Journal of Personality and Social Psychology, 54(5), 890–902. doi: https://doi.org/10.1037//0022-3514.54.5.890

Rosenthal, M. Z., Anand, D., Cassiello-Robbins, C., Williams, Z. J., Guetta, R. E., Trumbull, J., & Kelley, L. D. (2021). Development and initial validation of the duke misophonia questionnaire. Frontiers in Psychology, 12, 709928. doi: https://doi.org/10.3389/fpsyg.2021.709928

Sajobi, T. T., Lix, L. M., Russell, L., Schulz, D., Liu, J., Zumbo, B. D., & Sawatzky, R. (2022). Accuracy of mixture item response theory models for identifying sample heterogeneity in patient-reported outcomes: a simulation study. Quality of Life Research, 31(12), 3423–3432. doi: https://doi.org/10.1007/s11136-022-03169-0

Samejima, F. (1997). Graded response model. In W. J. Van Der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (p. 85–100). New York, NY: Springer New York. doi: https://doi.org/10.1007/978-1-4757-2691-6_5

Schiltz, H. K., & Magnus, B. E. (2020). Gender-based differential item functioning on the child behavior checklist in youth on the autism spectrum: A brief report. Research in Autism Spectrum Disorders, 79, 101669. doi: https://doi.org/10.1016/j.rasd.2020.101669

Schroder, A., Vulink, N., & Denys, D. (2013). Misophonia: Diagnostic criteria for a new psychiatric disorder. PLoS ONE, 8(1), e54706. doi: https://doi.org/10.1371/journal.pone.0054706

Sen, S., & Cohen, A. S. (2024). An evaluation of fit indices used in model selection of dichotomous mixture irt models. Educational and Psychological Measurement, 84(3), 481–509. doi: https://doi.org/10.1177/00131644231180529

Shen, W., Kiger, T. B., Davies, S. E., Rasch, R. L., Simon, K. M., & Ones, D. S. (2011). Samples in applied psychology: Over a decade of research in review. Journal of Applied Psychology, 96(5), 1055–1064. doi: https://doi.org/10.1037/a0023322

Sims, R., Michaleff, Z. A., Glasziou, P., & Thomas, R. (2021). Consequences of a diagnostic label: A systematic scoping review and thematic framework. Frontiers in Public Health, 9, 725877. doi: https://doi.org/10.3389/fpubh.2021.725877

Smits, N., Smit, F., Cuijpers, P., & De Graaf, R. (2007). Using decision theory to derive optimal cut-off scores of screening instruments: an illustration explicating costs and benefits of mental health screening. International Journal of Methods in Psychiatric Research, 16(4), 219–229. doi: https://doi.org/10.1002/mpr.230

Spann, D. J., Cicero, D. C., Straub, K. T., Pellegrini, A. M., & Kerns, J. G. (2024). Examining measures of schizotypy for gender and racial bias using item response theory and differential item functioning. Schizophrenia Research, 272, 120–127. doi: https://doi.org/10.1016/j.schres.2024.08.015

Strobl, C., Hothorn, T., & Zeileis, A. (2009). Party on! a new, conditional variable importance measure available in the party package. The R Journal, 2, 14–17.

Terry-McElrath, Y. M., & Patrick, M. E. (2018). Simultaneous alcohol and marijuana use among young adult drinkers: Age-specific changes in prevalence from 1977 to 2016. Alcoholism: Clinical and Experimental Research, 42(11), 2224–2233. doi: https://doi.org/10.1111/acer.13879

Wakefield, J. C. (2010). False positives in psychiatric diagnosis: implications for human freedom. Theoretical Medicine and Bioethics, 31(1), 5–17. doi: https://doi.org/10.1007/s11017-010-9132-2

Wakefield, J. C. (2015). Psychological justice: Dsm-5, false positive diagnosis, and fair equality of opportunity. Public Affairs Quarterly, 29(1), 32–75.

Wardell, J. D., Cunningham, J. A., Quilty, L. C., Carter, S., & Hendershot, C. S. (2020). Can the audit consumption items distinguish lower severity from high severity patients seeking treatment for alcohol use disorder? Journal of Substance Abuse Treatment, 114, 108001. doi: https://doi.org/10.1016/j.jsat.2020.108001

Williams, Z. J., Cascio, C. J., & Woynaroski, T. G. (2022). Psychometric validation of a brief self-report measure of misophonia symptoms and functional impairment: The duke-vanderbilt misophonia screening questionnaire. Frontiers in Psychology, 13. doi: https://doi.org/10.3389/fpsyg.2022.897901

Woods, C. M., Cai, L., & Wang, M. (2013). The langer-improved wald test for dif testing with multiple groups: Evaluation and comparison to two-group irt. Educational and Psychological Measurement, 73(3), 532–547. doi: https://doi.org/10.1177/0013164412464875

Wu, M. S., Lewin, A. B., Murphy, T. K., & Storch, E. A. (2014). Misophonia: Incidence, phenomenology, and clinical correlates in an undergraduate student sample: Misophonia. Journal of Clinical Psychology, 70(10), 994–1007. doi: https://doi.org/10.1002/jclp.22098

Yavuz Temel, G. (2023). A simulation and empirical study of differential test functioning (dtf). Psych, 5, 478–496. doi: https://doi.org/10.3390/psych5020032

Youngstrom, E. A. (2013). A primer on receiver operating characteristic analysis and diagnostic efficiency statistics for pediatric psychology: We are ready to roc. Journal of Pediatric Psychology, 39(2), 204. doi: https://doi.org/10.1093/jpepsy/jst062

Zumbo, B. (1999). A handbook on the theory and methods of differential item functioning (dif): Logistic regression modeling as a unitary framework for binary and likert-type (ordinal) item scores. Ottawa, Canada: Directorate of Human Resources Research and Evaluation.

Downloads

Published

2026-04-16

Issue

Section

Theory and Methods

How to Cite

Bain, C., Manapat, P. D., Manapat, D., Brenna, K., & Grimm, K. (2026). When DIF Goes Unmodeled: Assessing the Viability of Random Forest for Diagnostic Classification. Journal of Behavioral Data Science, 6(2), 1-39. https://doi.org/10.35566/jbds/bainmmbg

Most read articles by the same author(s)