NLP Researchers Create Paraphrase Identification Dataset

fgfg Picture

RuPAWS will help train search engines, AI assistants, chat- and voice bots to correctly interpret user queries.

What makes RuPAWS special

Working together with Skoltech, NLP researchers at MTS AI created a unique RuPAWS dataset that can be used to train and test paraphrase identification models.

A paraphrase is a sentence with the same meaning as the original, but put in different words. Accurate identification of paraphrases and use of appropriate datasets is essential if your task is to train search engines, voice assistants and chat- and voice bots. With accurate paraphrase recognition, AI assistants will correctly respond to app and web-service users and provide information that fully meets their queries.

The RuPAWS dataset includes 17,346 pairs of paraphrases and contains a variety of sentences that contain many identical words but have different meanings. These are sentences like “Can a bad man turn good?” and “Can a good man turn bad?”.

It is clear for people that these sentences are not paraphrases, but ML models that were trained using traditional datasets could get it wrong.

“RuPAWS differs from other Russian-language datasets by focusing on rare examples of paraphrases that are extremely hard to classify – this approach is known as adversarial attacks against machine learning systems,” said MTS AI’s engineer Nikita Martynov.

The authors of PAWS, a similar dataset for the English language, were the first to come up with the idea of this dataset. It is based on texts from social media and Wikipedia, which why the collected data is suited for numerous practical tasks. RuPAWS is the PAWS dataset that was translated with the help of NMT (neural machine translation) and then double-checked manually.

Paraphrase classification datasets for the Russian language already exist, but they lack tricky examples. One of the benchmark datasets, ParaPhraser, is comparable to RuPAWS in terms of its volume (9,151 pairs of sentences) and is being successfully used to train and test ML models. But even the SoTA (state-of-the-art) – a solution for Russian-language paraphrase classification, and RuBERT – a multilingual customized version of BERT-a, identify sentences with a high rate of word intersections as paraphrases in cases when they are not paraphrases.

Case studies completed at the MTS-Skoltech joint lab have shown that the RuPAWS dataset can successfully handle this issue.

“Experiments have shown that a model trained on data from both datasets demonstrates almost no quality losses when classifying examples from ParaPhraser, whereas its accuracy in processing sophisticated examples almost doubles, up to 79%,” said Irina Krotova, Senior Developer at MTS AI’s NLP team.

Let us take a look at several examples of sentences that contain many identical words but have different meanings. A linguistic model trained using ParaPhraser recognized them as paraphrases even though they were semantically different. At the same time, a linguistic model trained on two datasets – ParaPhraser and RuPAWS – did not make the same mistakes.

phrase 1phrase 2commentrecognized as paraphrases (Paraphraser)recognized
as
paraphrases Paraphraser+RuPAWS
Can a good man turn bad?Can a bad man turn good?replacement of adjectives0.960.02
Which airline offers cheap flights from Amsterdam to Jakarta?Which airline offers cheap flights from Jakarta to Amsterdam?replacement of nouns0.970.08
Another rendition of the opera by Karl Aage Rasmussen was recorded in 2005 and released in 2006.Another adaptation of the opera by Karl Aage
Rasmussen was released in 2005
and recorded in 2006.
replacement of verbs0.960.03
Evariste Baizeau (3 June 1821 – 6 February 1910, Nantes) was a French military physician.Evariste Baizeau (3 June 1821 – 6 February 1910,
Nantes) was a French military physicist.
replacement of one words with another0.960.02

MTS AI’s NLP researchers Nikita Martynov and Irina Krotova will present an article on the new RuPAWS linguistic corpus at LREC 2022, one of the major international conferences to be held in Marseille on 20-25 June.

News
Latest Articles
See more
Tech
MTS AI Launches an Open Large Language Model
AI Trends
Drop in Lidar Prices and Rise of Industrial Robotization in China
AI Trends
Centaur to Simulate Human Behavior and Inspiration from Kandinsky
AI Trends
Habermas Machine and AI for Additive Technologies
Tech
MTS AI Opens Public Access to Kodify Demo Version
Solutions
MTS AI Creates AI Assistant for Bank Employees
AI Trends
Reliability of LLMs and Alternative to Lidars
Cases
MTS AI and VisionService Present MAX System
AI Trends
AI in Science and Cameron in Stability AI
Tech
MTS AI taught Cotype Lite to communicate in the Tatar language.