RuPAWS Dataset Introduced at LREC 2022 Conference

Picture

NLP researchers Nikita Martynov and Irina Krotova talk about RuPAWS, a unique dataset designed to identify paraphrases. 

The dataset makes linguistic model 79% more accurate

Nikita Martynov and Irina Krotova, NLP researchers from MTS AI, spoke at LREC 2022 – one of the major international conferences on language resources and natural language processing held in Marseille on 20–25 June 2022.

“LREC is an international conference with a focus on linguistic data. It takes place once every two years with assistance of the European Language Resources Association, and this year it took place for the 13th time. With machine learning, the first essential step is gathering high-quality data to evaluate and train models. No wonder that research papers on these topics have numerous citations, and Google Scholars says this conference is the 6th major event in computational linguistics based on the H5-index (research paper citation index),” said Irina Krotova, Senior Developer at MTS AI’s NLP Team.

This platform traditionally brings together scientists, developers and business representatives to promote natural language processing technologies, products and services. The LREC participants present cutting-edge R&D projects in the field of NLP technologies, language resources and datasets for various domains, including not only conventional text and audio formats, but also sign language, for example. They discuss future avenues of research in linguistics and machine learning, their application in products, and the development of new standards, as well as the existing options for international collaboration.

At LREC 2022, Nikita Martynov and Irina Krotova presented the article titled RuPAWS: A Russian Adversarial Dataset for Paraphrase Identification. This publication, released with the support of Skoltech, talks about a unique dataset for the Russian language.

RuPAWS is an open dataset that can be used to train and test paraphrase identification models. It was designed and tested at the MTS AI – Skoltech joint laboratory.

RuPAWS incorporates 17,346 pairs of paraphrases, i.e., rephrased sentences that have identical meanings, but are made up of different words. The dataset also contains approximately 3 thousand sentences that are very similar in terms of vocabulary, but are not paraphrases. These are sentences like “Which airline offers a cheap flight from Amsterdam to Jakarta?” and “Which airline offers cheap flights from Jakarta to Amsterdam?” Unlike humans, ML models that were trained on classical datasets may not see the difference between these phrases.

Russian-language examples in the existing datasets are not enough for adequate identification of paraphrases. For example, a SoTA, a multilingual customized version of BERT-a, identifies sentences with a high rate of word intersections as paraphrases even in cases when they are not paraphrases. RuPAWS helps cope with this problem. If you add this dataset when training a language model, the accuracy rate in processing complex references will almost double, reaching 79%.

In the future, RuPAWS can be used to train search engines, language assistants, voice and text bots. This will allow these services to effectively identify paraphrases and correctly respond to user queries, no longer offering “Kineshma–Moscow bus” in response to a query “Tickets from Moscow to Kineshma.” You can read more about the dataset in the article.

News
Latest Articles
See more

Investment

Media about MTS AI

Solutions

Cases

Partnership

Team news

Events

Tech

Events
RuPAWS Dataset Introduced at LREC 2022 Conference
Tech
NLP Researchers Create Paraphrase Identification Dataset
Solutions
Voice and Text Bots Transforming the Customer Service
Tech
AI-Based Profanity Editor: Way to Make Online Communication Safer
Cases
MTS AI Helps DalTransUgol Get Rid of Garbage in Hoppers
Solutions
Video Surveillance in Retail: An MTS AI Solution
Cases
MTS AI Trains Artificial Intelligence to Pick Movie Posters
Cases
MTS AI Helps Segezha Pulp and Paper Mill Trim Production Flaws
Solutions
Solution for Video Surveillance in Manufacturing Industry by MTS AI
Events
MTS AI takes part in the NVIDIA GTC