The trade-off between robustness and reliability in chinese legal large language models: an empirical study

Legal large language models (LLMs) deployed in high-stakes judicial settings must exhibit robustness against non-substantive linguistic variations while preserving acute sensitivity to legally determinative facts and norms. This study investigates this robustness–reliability trade-off within the context of Chinese legal tasks. We curate a dataset of 5,000 Chinese judicial question–answer pairs and generate semantic-preserving adversarial rewrites, retaining only those validated by an embedding-b