Abstract
Transformer-based models have demonstrated much success in various natural language processing (NLP) tasks. However, they are often vulnerable to adversarial attacks, such as data poisoning, that can intentionally fool the model into generating incorrect results. In this paper, we present a novel, compound variant of a data poisoning attack on a transformer-based model that maximizes the poisoning effect while minimizing the scope of poisoning. We do so by combining the established data poisoning technique (label flipping) with a novel adversarial artifact selection and insertion technique aimed at minimizing detectability and the scope of the poisoning footprint. We find that using a combination of these two techniques, we achieve a state-of-the-art attack success rate (ASR) of ~90% while poisoning only 0.5% of the original training set, thus minimizing the scope and detectability of the poisoning action. These findings have the potential to advance the development of better data poisoning detection methods.