RuPAWS Dataset Introduced at LREC 2022 Conference

fgfg Picture

NLP researchers Nikita Martynov and Irina Krotova talk about RuPAWS, a unique dataset designed to identify paraphrases. 

The dataset makes linguistic model 79% more accurate

Nikita Martynov and Irina Krotova, NLP researchers from MTS AI, spoke at LREC 2022 – one of the major international conferences on language resources and natural language processing held in Marseille on 20–25 June 2022.

“LREC is an international conference with a focus on linguistic data. It takes place once every two years with assistance of the European Language Resources Association, and this year it took place for the 13th time. With machine learning, the first essential step is gathering high-quality data to evaluate and train models. No wonder that research papers on these topics have numerous citations, and Google Scholars says this conference is the 6th major event in computational linguistics based on the H5-index (research paper citation index),” said Irina Krotova, Senior Developer at MTS AI’s NLP Team.

This platform traditionally brings together scientists, developers and business representatives to promote natural language processing technologies, products and services. The LREC participants present cutting-edge R&D projects in the field of NLP technologies, language resources and datasets for various domains, including not only conventional text and audio formats, but also sign language, for example. They discuss future avenues of research in linguistics and machine learning, their application in products, and the development of new standards, as well as the existing options for international collaboration.

At LREC 2022, Nikita Martynov and Irina Krotova presented the article titled RuPAWS: A Russian Adversarial Dataset for Paraphrase Identification. This publication, released with the support of Skoltech, talks about a unique dataset for the Russian language.

RuPAWS is an open dataset that can be used to train and test paraphrase identification models. It was designed and tested at the MTS AI – Skoltech joint laboratory.

RuPAWS incorporates 17,346 pairs of paraphrases, i.e., rephrased sentences that have identical meanings, but are made up of different words. The dataset also contains approximately 3 thousand sentences that are very similar in terms of vocabulary, but are not paraphrases. These are sentences like “Which airline offers a cheap flight from Amsterdam to Jakarta?” and “Which airline offers cheap flights from Jakarta to Amsterdam?” Unlike humans, ML models that were trained on classical datasets may not see the difference between these phrases.

Russian-language examples in the existing datasets are not enough for adequate identification of paraphrases. For example, a SoTA, a multilingual customized version of BERT-a, identifies sentences with a high rate of word intersections as paraphrases even in cases when they are not paraphrases. RuPAWS helps cope with this problem. If you add this dataset when training a language model, the accuracy rate in processing complex references will almost double, reaching 79%.

In the future, RuPAWS can be used to train search engines, language assistants, voice and text bots. This will allow these services to effectively identify paraphrases and correctly respond to user queries, no longer offering “Kineshma–Moscow bus” in response to a query “Tickets from Moscow to Kineshma.” You can read more about the dataset in the article.

News
Latest Articles
See more

Investment

Media about MTS AI

Solutions

Cases

Partnership

AI Trends

Team news

Events

Tech

AI Trends
The pursuit of speed and water for LLM
AI Trends
NLP for Africa and Evaluator-Bot for the USE in Russia
AI Trends
Hollywood AI Wars, Lincoln Bot and Protection against Deepfake
AI Trends
Alternative to Turing Test and Limits for AI in Science
AI Trends
Conductor, Tennis Player and Public Opinion Imitator
AI Trends
About Trust in AI and Faith in Drones of Future
AI Trends
AI Against Superbugs and Japanese Equivalent of ChatGPT
AI Trends
Bee Protector, Ocean Explorer and English Tutor
AI Trends
6G race, Amazon Forest Protection and HR Transformation of IBM
AI Trends
AI in Finance and Finance for Development of AI