Model-based algorithms to ascertain smoking in administrative health data: a registry-based validation study

Accurate measurement of smoking in population-based administrative health data (AHD) poses challenges due to the indirect nature of smoking-related information collection. While most studies use rule-based algorithms (RBAs) based on diagnosis codes, model-based algorithms (MBAs) utilizing machine learning (ML) with diverse data features might have better sensitivity and accuracy. We developed ML model-based algorithms (MBAs) for ascertaining smoking in AHD and compared them to RBAs. We conducted