08.06.2022

NLP Researchers Create Paraphrase Identification Dataset

RuPAWS will help train search engines, AI assistants, chat- and voice bots to correctly interpret user queries.

What makes RuPAWS special

Working together with Skoltech, NLP researchers at MTS AI created a unique RuPAWS dataset that can be used to train and test paraphrase identification models.

A paraphrase is a sentence with the same meaning as the original, but put in different words. Accurate identification of paraphrases and use of appropriate datasets is essential if your task is to train search engines, voice assistants and chat- and voice bots. With accurate paraphrase recognition, AI assistants will correctly respond to app and web-service users and provide information that fully meets their queries.

The RuPAWS dataset includes 17,346 pairs of paraphrases and contains a variety of sentences that contain many identical words but have different meanings. These are sentences like “Can a bad man turn good?” and “Can a good man turn bad?”.

It is clear for people that these sentences are not paraphrases, but ML models that were trained using traditional datasets could get it wrong.

“RuPAWS differs from other Russian-language datasets by focusing on rare examples of paraphrases that are extremely hard to classify – this approach is known as adversarial attacks against machine learning systems,” said MTS AI’s engineer Nikita Martynov.

The authors of PAWS, a similar dataset for the English language, were the first to come up with the idea of this dataset. It is based on texts from social media and Wikipedia, which why the collected data is suited for numerous practical tasks. RuPAWS is the PAWS dataset that was translated with the help of NMT (neural machine translation) and then double-checked manually.

Paraphrase classification datasets for the Russian language already exist, but they lack tricky examples. One of the benchmark datasets, ParaPhraser, is comparable to RuPAWS in terms of its volume (9,151 pairs of sentences) and is being successfully used to train and test ML models. But even the SoTA (state-of-the-art) – a solution for Russian-language paraphrase classification, and RuBERT – a multilingual customized version of BERT-a, identify sentences with a high rate of word intersections as paraphrases in cases when they are not paraphrases.

Case studies completed at the MTS-Skoltech joint lab have shown that the RuPAWS dataset can successfully handle this issue.

“Experiments have shown that a model trained on data from both datasets demonstrates almost no quality losses when classifying examples from ParaPhraser, whereas its accuracy in processing sophisticated examples almost doubles, up to 79%,” said Irina Krotova, Senior Developer at MTS AI’s NLP team.

Let us take a look at several examples of sentences that contain many identical words but have different meanings. A linguistic model trained using ParaPhraser recognized them as paraphrases even though they were semantically different. At the same time, a linguistic model trained on two datasets – ParaPhraser and RuPAWS – did not make the same mistakes.

phrase 1	phrase 2	comment	recognized as paraphrases (Paraphraser)	recognized as paraphrases Paraphraser+RuPAWS
Can a good man turn bad?	Can a bad man turn good?	replacement of adjectives	0.96	0.02
Which airline offers cheap flights from Amsterdam to Jakarta?	Which airline offers cheap flights from Jakarta to Amsterdam?	replacement of nouns	0.97	0.08
Another rendition of the opera by Karl Aage Rasmussen was recorded in 2005 and released in 2006.	Another adaptation of the opera by Karl Aage Rasmussen was released in 2005 and recorded in 2006.	replacement of verbs	0.96	0.03
Evariste Baizeau (3 June 1821 – 6 February 1910, Nantes) was a French military physician.	Evariste Baizeau (3 June 1821 – 6 February 1910, Nantes) was a French military physicist.	replacement of one words with another	0.96	0.02

MTS AI’s NLP researchers Nikita Martynov and Irina Krotova will present an article on the new RuPAWS linguistic corpus at LREC 2022, one of the major international conferences to be held in Marseille on 20-25 June.

News