NLP Researchers Create Paraphrase Identification Dataset


RuPAWS will help train search engines, AI assistants, chat- and voice bots to correctly interpret user queries.

What makes RuPAWS special

Working together with Skoltech, NLP researchers at MTS AI created a unique RuPAWS dataset that can be used to train and test paraphrase identification models.

A paraphrase is a sentence with the same meaning as the original, but put in different words. Accurate identification of paraphrases and use of appropriate datasets is essential if your task is to train search engines, voice assistants and chat- and voice bots. With accurate paraphrase recognition, AI assistants will correctly respond to app and web-service users and provide information that fully meets their queries.

The RuPAWS dataset includes 17,346 pairs of paraphrases and contains a variety of sentences that contain many identical words but have different meanings. These are sentences like “Can a bad man turn good?” and “Can a good man turn bad?”.

It is clear for people that these sentences are not paraphrases, but ML models that were trained using traditional datasets could get it wrong.

“RuPAWS differs from other Russian-language datasets by focusing on rare examples of paraphrases that are extremely hard to classify – this approach is known as adversarial attacks against machine learning systems,” said MTS AI’s engineer Nikita Martynov.

The authors of PAWS, a similar dataset for the English language, were the first to come up with the idea of this dataset. It is based on texts from social media and Wikipedia, which why the collected data is suited for numerous practical tasks. RuPAWS is the PAWS dataset that was translated with the help of NMT (neural machine translation) and then double-checked manually.

Paraphrase classification datasets for the Russian language already exist, but they lack tricky examples. One of the benchmark datasets, ParaPhraser, is comparable to RuPAWS in terms of its volume (9,151 pairs of sentences) and is being successfully used to train and test ML models. But even the SoTA (state-of-the-art) – a solution for Russian-language paraphrase classification, and RuBERT – a multilingual customized version of BERT-a, identify sentences with a high rate of word intersections as paraphrases in cases when they are not paraphrases.

Case studies completed at the MTS-Skoltech joint lab have shown that the RuPAWS dataset can successfully handle this issue.

“Experiments have shown that a model trained on data from both datasets demonstrates almost no quality losses when classifying examples from ParaPhraser, whereas its accuracy in processing sophisticated examples almost doubles, up to 79%,” said Irina Krotova, Senior Developer at MTS AI’s NLP team.

Let us take a look at several examples of sentences that contain many identical words but have different meanings. A linguistic model trained using ParaPhraser recognized them as paraphrases even though they were semantically different. At the same time, a linguistic model trained on two datasets – ParaPhraser and RuPAWS – did not make the same mistakes.

phrase 1phrase 2commentrecognized as paraphrases (Paraphraser)recognized
paraphrases Paraphraser+RuPAWS
Can a good man turn bad?Can a bad man turn good?replacement of adjectives0.960.02
Which airline offers cheap flights from Amsterdam to Jakarta?Which airline offers cheap flights from Jakarta to Amsterdam?replacement of nouns0.970.08
Another rendition of the opera by Karl Aage Rasmussen was recorded in 2005 and released in 2006.Another adaptation of the opera by Karl Aage
Rasmussen was released in 2005
and recorded in 2006.
replacement of verbs0.960.03
Evariste Baizeau (3 June 1821 – 6 February 1910, Nantes) was a French military physician.Evariste Baizeau (3 June 1821 – 6 February 1910,
Nantes) was a French military physicist.
replacement of one words with another0.960.02

MTS AI’s NLP researchers Nikita Martynov and Irina Krotova will present an article on the new RuPAWS linguistic corpus at LREC 2022, one of the major international conferences to be held in Marseille on 20-25 June.

Latest Articles
See more


Media about MTS AI




AI Trends

Team news



AI Trends
Crafty Negotiator, Decent Diagnostician and Mediocre Cook
MTS AI Helps Launch MTS Video Surveillance for Business
Using Audiogram to Develop AI Operator for MTS Call Center
MTS AI Adds Spam Calls Transcript Feature to MTS service
AI Trends
Light Waves for ML Computing and Robotic Cockroach Exterminator
In The Focus Of AI
Understanding Vertebrate Evolution Helps Robot Engineers; Voice-Based Methods Enable Medical Diagnostics
MTS AI Trains Artificial Intelligence to Improve Video Quality and Skip Movie Credits for KION
AI Trends
AI Saves Animals and Helps Understand Medical Texts
AI Trends
Patrol Robot and AI-Enabled Architectural Masterpiece Design Solution
RuPAWS Dataset Introduced at LREC 2022 Conference