Neural text sanitization with privacy risk indicators: an empirical analysis

Abstract Text sanitization is the task of redacting a document to mask all occurrences of (direct or indirect) personal identifiers, with the goal of concealing the identity of the individual(s) referred in it. In this paper, we consider a two-step approach to text sanitization and provide a detailed analysis of its empirical performance on two recently published datasets: the Text Anonymization Benchmark (Pilán et al., 2022) and a collection of Wikipedia biographies (Papadopoulou et al., 2022a)