Language Resources and Evaluation
This paper presents a case study on motion-related metaphors that demonstrates the viability of the MetaNet computational metaphor identification system in a corpus-based analysis of the expression of conceptual metaphor. The rich annotation that is produced by the MetaNet system supports many types of linguistic analysis, such as the examination of the relative frequencies within a corpus of the…
Under-resourced languages (and musics) pose a challenge to machine translation (MT). The challenge is greater when the content of the collected dataset is a varied sample taken from a data population that is even more diverse and dynamic. This is the challenge of Arab music vocal improvisation (mawwal). Here, we present the development of AMICOR, a parallel dataset consisting of vocal improvisato…
Social bias in language models continues to create fairness risks in multilingual and multicultural environments. Existing datasets provide limited cultural diversity, insufficient support for overlapping bias categories, and minimal availability of human-interpretable reasoning, which reduces transparency and reliability in the bias detection. The ToxicBias-Reasoning dataset addresses these gaps…
In this work, we introduce the construction of a machine translation (MT) assisted and human-in-the-loop multilingual parallel corpus with annotations of multi-word expressions (MWEs), named AlphaMWE. The MWEs include verbal MWEs (vMWEs) defined in the PARSEME shared task that have a verb as the head of the studied terms. The annotated vMWEs are also bilingually and multilingually aligned manuall…
In this paper, we propose SEMCAT (Semantic Evaluation Metric Conforms to AMR Theory), a novel similarity measuring method for Abstract Meaning Representation (AMR). AMR is a semantic structure used to explicitly express the truth-conditional meaning aspect of a natural language sentence. Our evaluation strategy is mainly designed to reflect the theoretical basis of AMR. Specifically, based on the…
Abstract Text sanitization is the task of redacting a document to mask all occurrences of (direct or indirect) personal identifiers, with the goal of concealing the identity of the individual(s) referred in it. In this paper, we consider a two-step approach to text sanitization and provide a detailed analysis of its empirical performance on two recently published datasets: the Text Anonymization …
Millions of people worldwide face barriers in accessing and understanding complex written information due to limited literacy. Automatic text simplification (ATS) addresses this challenge by transforming complex texts into simpler, more accessible versions. However, most existing ATS research focuses on English, leaving Spanish, a language spoken by over 500 million people, underrepresented. This…
This paper introduces JurisTCU, a Brazilian Portuguese dataset for legal information retrieval (LIR). The dataset is freely available ( https://huggingface.co/datasets/LeandroRibeiro/JurisTCU ) and consists of 16,045 jurisprudential documents from the Brazilian Federal Court of Accounts, along with 150 queries annotated with relevance judgments. It addresses the scarcity of Portuguese-language LI…
research.ioSign up to keep scrolling
Create your feed subscriptions, save articles, keep scrolling.