RuPAWS Dataset Introduced at LREC 2022 Conference

Picture

NLP researchers Nikita Martynov and Irina Krotova talk about RuPAWS, a unique dataset designed to identify paraphrases. 

The dataset makes linguistic model 79% more accurate

Nikita Martynov and Irina Krotova, NLP researchers from MTS AI, spoke at LREC 2022 – one of the major international conferences on language resources and natural language processing held in Marseille on 20–25 June 2022.

“LREC is an international conference with a focus on linguistic data. It takes place once every two years with assistance of the European Language Resources Association, and this year it took place for the 13th time. With machine learning, the first essential step is gathering high-quality data to evaluate and train models. No wonder that research papers on these topics have numerous citations, and Google Scholars says this conference is the 6th major event in computational linguistics based on the H5-index (research paper citation index),” said Irina Krotova, Senior Developer at MTS AI’s NLP Team.

This platform traditionally brings together scientists, developers and business representatives to promote natural language processing technologies, products and services. The LREC participants present cutting-edge R&D projects in the field of NLP technologies, language resources and datasets for various domains, including not only conventional text and audio formats, but also sign language, for example. They discuss future avenues of research in linguistics and machine learning, their application in products, and the development of new standards, as well as the existing options for international collaboration.

At LREC 2022, Nikita Martynov and Irina Krotova presented the article titled RuPAWS: A Russian Adversarial Dataset for Paraphrase Identification. This publication, released with the support of Skoltech, talks about a unique dataset for the Russian language.

RuPAWS is an open dataset that can be used to train and test paraphrase identification models. It was designed and tested at the MTS AI – Skoltech joint laboratory.

RuPAWS incorporates 17,346 pairs of paraphrases, i.e., rephrased sentences that have identical meanings, but are made up of different words. The dataset also contains approximately 3 thousand sentences that are very similar in terms of vocabulary, but are not paraphrases. These are sentences like “Which airline offers a cheap flight from Amsterdam to Jakarta?” and “Which airline offers cheap flights from Jakarta to Amsterdam?” Unlike humans, ML models that were trained on classical datasets may not see the difference between these phrases.

Russian-language examples in the existing datasets are not enough for adequate identification of paraphrases. For example, a SoTA, a multilingual customized version of BERT-a, identifies sentences with a high rate of word intersections as paraphrases even in cases when they are not paraphrases. RuPAWS helps cope with this problem. If you add this dataset when training a language model, the accuracy rate in processing complex references will almost double, reaching 79%.

In the future, RuPAWS can be used to train search engines, language assistants, voice and text bots. This will allow these services to effectively identify paraphrases and correctly respond to user queries, no longer offering “Kineshma–Moscow bus” in response to a query “Tickets from Moscow to Kineshma.” You can read more about the dataset in the article.

News
Latest Articles
See more

Investment

Media about MTS AI

Solutions

Cases

Partnership

AI Trends

Team news

Events

Tech

AI Trends
Crafty Negotiator, Decent Diagnostician and Mediocre Cook
Cases
MTS AI Helps Launch MTS Video Surveillance for Business
Cases
Using Audiogram to Develop AI Operator for MTS Call Center
Cases
MTS AI Adds Spam Calls Transcript Feature to MTS service
AI Trends
Light Waves for ML Computing and Robotic Cockroach Exterminator
In The Focus Of AI
Understanding Vertebrate Evolution Helps Robot Engineers; Voice-Based Methods Enable Medical Diagnostics
Cases
MTS AI Trains Artificial Intelligence to Improve Video Quality and Skip Movie Credits for KION
AI Trends
AI Saves Animals and Helps Understand Medical Texts
AI Trends
Patrol Robot and AI-Enabled Architectural Masterpiece Design Solution
Events
RuPAWS Dataset Introduced at LREC 2022 Conference