Repository logo
Institutional Digital Repository
Shreenivas Deshpande Library, IIT (BHU), Varanasi

BPE beyond Word Boundary: How NOT to use Multi Word Expressions in Neural Machine Translation

dc.contributor.authorKumar D.; Thawani A.
dc.date.accessioned2025-05-23T11:24:19Z
dc.description.abstractBPE tokenization merges characters into longer tokens by finding frequently occurring contiguous patterns within the word boundary. An intuitive relaxation would be to extend a BPE vocabulary with multi-word expressions (MWEs): bigrams (in_a), trigrams (out_of_the), and skip-grams (he·his). In the context of Neural Machine Translation (NMT), we replace the least frequent subword/whole-word tokens with the most frequent MWEs. We find that these modifications to BPE end up hurting the model, resulting in a net drop of BLEU and chrF scores across two language pairs. We observe that naively extending BPE beyond word boundaries results in incoherent tokens which are themselves better represented as individual words. Moreover, we find that Pointwise Mutual Information (PMI) instead of frequency finds better MWEs (e.g., New_Y ork, Statue_of_Liberty, neither·nor) which consistently improves translation performance. We release all code at https://github.com/pegasus-lynx/mwe-bpe. © 2022 Association for Computational Linguistics.
dc.identifier.doihttps://doi.org/10.18653/v1/2022.insights-1.24
dc.identifier.urihttp://172.23.0.11:4000/handle/123456789/9950
dc.relation.ispartofseriesInsights 2022 - 3rd Workshop on Insights from Negative Results in NLP, Proceedings of the Workshop
dc.titleBPE beyond Word Boundary: How NOT to use Multi Word Expressions in Neural Machine Translation

Files

Collections