NLP Researchers Create Paraphrase Identification Dataset

fgfg Picture

RuPAWS will help train search engines, AI assistants, chat- and voice bots to correctly interpret user queries.

What makes RuPAWS special

Working together with Skoltech, NLP researchers at MTS AI created a unique RuPAWS dataset that can be used to train and test paraphrase identification models.

A paraphrase is a sentence with the same meaning as the original, but put in different words. Accurate identification of paraphrases and use of appropriate datasets is essential if your task is to train search engines, voice assistants and chat- and voice bots. With accurate paraphrase recognition, AI assistants will correctly respond to app and web-service users and provide information that fully meets their queries.

The RuPAWS dataset includes 17,346 pairs of paraphrases and contains a variety of sentences that contain many identical words but have different meanings. These are sentences like “Can a bad man turn good?” and “Can a good man turn bad?”.

It is clear for people that these sentences are not paraphrases, but ML models that were trained using traditional datasets could get it wrong.

“RuPAWS differs from other Russian-language datasets by focusing on rare examples of paraphrases that are extremely hard to classify – this approach is known as adversarial attacks against machine learning systems,” said MTS AI’s engineer Nikita Martynov.

The authors of PAWS, a similar dataset for the English language, were the first to come up with the idea of this dataset. It is based on texts from social media and Wikipedia, which why the collected data is suited for numerous practical tasks. RuPAWS is the PAWS dataset that was translated with the help of NMT (neural machine translation) and then double-checked manually.

Paraphrase classification datasets for the Russian language already exist, but they lack tricky examples. One of the benchmark datasets, ParaPhraser, is comparable to RuPAWS in terms of its volume (9,151 pairs of sentences) and is being successfully used to train and test ML models. But even the SoTA (state-of-the-art) – a solution for Russian-language paraphrase classification, and RuBERT – a multilingual customized version of BERT-a, identify sentences with a high rate of word intersections as paraphrases in cases when they are not paraphrases.

Case studies completed at the MTS-Skoltech joint lab have shown that the RuPAWS dataset can successfully handle this issue.

“Experiments have shown that a model trained on data from both datasets demonstrates almost no quality losses when classifying examples from ParaPhraser, whereas its accuracy in processing sophisticated examples almost doubles, up to 79%,” said Irina Krotova, Senior Developer at MTS AI’s NLP team.

Let us take a look at several examples of sentences that contain many identical words but have different meanings. A linguistic model trained using ParaPhraser recognized them as paraphrases even though they were semantically different. At the same time, a linguistic model trained on two datasets – ParaPhraser and RuPAWS – did not make the same mistakes.

phrase 1phrase 2commentrecognized as paraphrases (Paraphraser)recognized
as
paraphrases Paraphraser+RuPAWS
Can a good man turn bad?Can a bad man turn good?replacement of adjectives0.960.02
Which airline offers cheap flights from Amsterdam to Jakarta?Which airline offers cheap flights from Jakarta to Amsterdam?replacement of nouns0.970.08
Another rendition of the opera by Karl Aage Rasmussen was recorded in 2005 and released in 2006.Another adaptation of the opera by Karl Aage
Rasmussen was released in 2005
and recorded in 2006.
replacement of verbs0.960.03
Evariste Baizeau (3 June 1821 – 6 February 1910, Nantes) was a French military physician.Evariste Baizeau (3 June 1821 – 6 February 1910,
Nantes) was a French military physicist.
replacement of one words with another0.960.02

MTS AI’s NLP researchers Nikita Martynov and Irina Krotova will present an article on the new RuPAWS linguistic corpus at LREC 2022, one of the major international conferences to be held in Marseille on 20-25 June.

News
Latest Articles
See more
AI Trends
AI in Science and Cameron in Stability AI
Tech
MTS AI taught Cotype Lite to communicate in the Tatar language.
AI Trends
Chinese Version of J.A.R.V.I.S. and Agentic AI
Tech
MTS AI Presented Cotype PRO
AI Trends
Batteries for Microrobots and Global Spending on AI
Solutions
How AI Helps Optimize the Quality Control and Sorting Process in Manufacturing
AI Trends
Studying DNA and Fighting LLM Overconfidence
Без рубрики
MTS integrates WordPulse for analyzing calls and chats
Cases
MTS AI created an AI moderator for NUUM
AI Trends
Elastic Batteries and Struggle for Fairness of AI Decisions