RuPAWS Dataset Introduced at LREC 2022 Conference

fgfg Picture

NLP researchers Nikita Martynov and Irina Krotova talk about RuPAWS, a unique dataset designed to identify paraphrases. 

The dataset makes linguistic model 79% more accurate

Nikita Martynov and Irina Krotova, NLP researchers from MTS AI, spoke at LREC 2022 – one of the major international conferences on language resources and natural language processing held in Marseille on 20–25 June 2022.

“LREC is an international conference with a focus on linguistic data. It takes place once every two years with assistance of the European Language Resources Association, and this year it took place for the 13th time. With machine learning, the first essential step is gathering high-quality data to evaluate and train models. No wonder that research papers on these topics have numerous citations, and Google Scholars says this conference is the 6th major event in computational linguistics based on the H5-index (research paper citation index),” said Irina Krotova, Senior Developer at MTS AI’s NLP Team.

This platform traditionally brings together scientists, developers and business representatives to promote natural language processing technologies, products and services. The LREC participants present cutting-edge R&D projects in the field of NLP technologies, language resources and datasets for various domains, including not only conventional text and audio formats, but also sign language, for example. They discuss future avenues of research in linguistics and machine learning, their application in products, and the development of new standards, as well as the existing options for international collaboration.

At LREC 2022, Nikita Martynov and Irina Krotova presented the article titled RuPAWS: A Russian Adversarial Dataset for Paraphrase Identification. This publication, released with the support of Skoltech, talks about a unique dataset for the Russian language.

RuPAWS is an open dataset that can be used to train and test paraphrase identification models. It was designed and tested at the MTS AI – Skoltech joint laboratory.

RuPAWS incorporates 17,346 pairs of paraphrases, i.e., rephrased sentences that have identical meanings, but are made up of different words. The dataset also contains approximately 3 thousand sentences that are very similar in terms of vocabulary, but are not paraphrases. These are sentences like “Which airline offers a cheap flight from Amsterdam to Jakarta?” and “Which airline offers cheap flights from Jakarta to Amsterdam?” Unlike humans, ML models that were trained on classical datasets may not see the difference between these phrases.

Russian-language examples in the existing datasets are not enough for adequate identification of paraphrases. For example, a SoTA, a multilingual customized version of BERT-a, identifies sentences with a high rate of word intersections as paraphrases even in cases when they are not paraphrases. RuPAWS helps cope with this problem. If you add this dataset when training a language model, the accuracy rate in processing complex references will almost double, reaching 79%.

In the future, RuPAWS can be used to train search engines, language assistants, voice and text bots. This will allow these services to effectively identify paraphrases and correctly respond to user queries, no longer offering “Kineshma–Moscow bus” in response to a query “Tickets from Moscow to Kineshma.” You can read more about the dataset in the article.

Latest Articles
See more


Media about MTS AI




AI Trends

Team news



How to create custom-built bots with the use of the solution from MTS AI
AI Trends
Logic vs. Bias and AI Drawing Thoughts
AI Trends
Theory of Mind and Exam for Doctor
AI Trends
Search for Extraterrestrial Intelligence, Earth Study and Fight against Inflation
How to help an online retailer to increase sales by means of the AI Speech Analytics
MTS AI accelerator – booster for DeepTech projects at any stage of development
AI Trends
Order in Data, Laws of Physics, and Seeing through Walls
AI Trends
LLM Ban and Green Light for Autonomous Vehicles
AI Trends
AI: from Atoms to Space
AI Trends
Interview with Chatbot, and AI System for Instant Aging