NLP researchers Nikita Martynov and Irina Krotova talk about RuPAWS, a unique dataset designed to identify paraphrases.
The dataset makes linguistic model 79% more accurate
Nikita Martynov and Irina Krotova, NLP researchers from MTS AI, spoke at LREC 2022 – one of the major international conferences on language resources and natural language processing held in Marseille on 20–25 June 2022.
“LREC is an international conference with a focus on linguistic data. It takes place once every two years with assistance of the European Language Resources Association, and this year it took place for the 13th time. With machine learning, the first essential step is gathering high-quality data to evaluate and train models. No wonder that research papers on these topics have numerous citations, and Google Scholars says this conference is the 6th major event in computational linguistics based on the H5-index (research paper citation index),” said Irina Krotova, Senior Developer at MTS AI’s NLP Team.
This platform traditionally brings together scientists, developers and business representatives to promote natural language processing technologies, products and services. The LREC participants present cutting-edge R&D projects in the field of NLP technologies, language resources and datasets for various domains, including not only conventional text and audio formats, but also sign language, for example. They discuss future avenues of research in linguistics and machine learning, their application in products, and the development of new standards, as well as the existing options for international collaboration.
At LREC 2022, Nikita Martynov and Irina Krotova presented the article titled RuPAWS: A Russian Adversarial Dataset for Paraphrase Identification. This publication, released with the support of Skoltech, talks about a unique dataset for the Russian language.
RuPAWS is an open dataset that can be used to train and test paraphrase identification models. It was designed and tested at the MTS AI – Skoltech joint laboratory.
RuPAWS incorporates 17,346 pairs of paraphrases, i.e., rephrased sentences that have identical meanings, but are made up of different words. The dataset also contains approximately 3 thousand sentences that are very similar in terms of vocabulary, but are not paraphrases. These are sentences like “Which airline offers a cheap flight from Amsterdam to Jakarta?” and “Which airline offers cheap flights from Jakarta to Amsterdam?” Unlike humans, ML models that were trained on classical datasets may not see the difference between these phrases.
Russian-language examples in the existing datasets are not enough for adequate identification of paraphrases. For example, a SoTA, a multilingual customized version of BERT-a, identifies sentences with a high rate of word intersections as paraphrases even in cases when they are not paraphrases. RuPAWS helps cope with this problem. If you add this dataset when training a language model, the accuracy rate in processing complex references will almost double, reaching 79%.
In the future, RuPAWS can be used to train search engines, language assistants, voice and text bots. This will allow these services to effectively identify paraphrases and correctly respond to user queries, no longer offering “Kineshma–Moscow bus” in response to a query “Tickets from Moscow to Kineshma.” You can read more about the dataset in the article.