NLP Researchers Create Paraphrase Identification Dataset

fgfg Picture

RuPAWS will help train search engines, AI assistants, chat- and voice bots to correctly interpret user queries.

What makes RuPAWS special

Working together with Skoltech, NLP researchers at MTS AI created a unique RuPAWS dataset that can be used to train and test paraphrase identification models.

A paraphrase is a sentence with the same meaning as the original, but put in different words. Accurate identification of paraphrases and use of appropriate datasets is essential if your task is to train search engines, voice assistants and chat- and voice bots. With accurate paraphrase recognition, AI assistants will correctly respond to app and web-service users and provide information that fully meets their queries.

The RuPAWS dataset includes 17,346 pairs of paraphrases and contains a variety of sentences that contain many identical words but have different meanings. These are sentences like “Can a bad man turn good?” and “Can a good man turn bad?”.

It is clear for people that these sentences are not paraphrases, but ML models that were trained using traditional datasets could get it wrong.

“RuPAWS differs from other Russian-language datasets by focusing on rare examples of paraphrases that are extremely hard to classify – this approach is known as adversarial attacks against machine learning systems,” said MTS AI’s engineer Nikita Martynov.

The authors of PAWS, a similar dataset for the English language, were the first to come up with the idea of this dataset. It is based on texts from social media and Wikipedia, which why the collected data is suited for numerous practical tasks. RuPAWS is the PAWS dataset that was translated with the help of NMT (neural machine translation) and then double-checked manually.

Paraphrase classification datasets for the Russian language already exist, but they lack tricky examples. One of the benchmark datasets, ParaPhraser, is comparable to RuPAWS in terms of its volume (9,151 pairs of sentences) and is being successfully used to train and test ML models. But even the SoTA (state-of-the-art) – a solution for Russian-language paraphrase classification, and RuBERT – a multilingual customized version of BERT-a, identify sentences with a high rate of word intersections as paraphrases in cases when they are not paraphrases.

Case studies completed at the MTS-Skoltech joint lab have shown that the RuPAWS dataset can successfully handle this issue.

“Experiments have shown that a model trained on data from both datasets demonstrates almost no quality losses when classifying examples from ParaPhraser, whereas its accuracy in processing sophisticated examples almost doubles, up to 79%,” said Irina Krotova, Senior Developer at MTS AI’s NLP team.

Let us take a look at several examples of sentences that contain many identical words but have different meanings. A linguistic model trained using ParaPhraser recognized them as paraphrases even though they were semantically different. At the same time, a linguistic model trained on two datasets – ParaPhraser and RuPAWS – did not make the same mistakes.

phrase 1phrase 2commentrecognized as paraphrases (Paraphraser)recognized
as
paraphrases Paraphraser+RuPAWS
Can a good man turn bad?Can a bad man turn good?replacement of adjectives0.960.02
Which airline offers cheap flights from Amsterdam to Jakarta?Which airline offers cheap flights from Jakarta to Amsterdam?replacement of nouns0.970.08
Another rendition of the opera by Karl Aage Rasmussen was recorded in 2005 and released in 2006.Another adaptation of the opera by Karl Aage
Rasmussen was released in 2005
and recorded in 2006.
replacement of verbs0.960.03
Evariste Baizeau (3 June 1821 – 6 February 1910, Nantes) was a French military physician.Evariste Baizeau (3 June 1821 – 6 February 1910,
Nantes) was a French military physicist.
replacement of one words with another0.960.02

MTS AI’s NLP researchers Nikita Martynov and Irina Krotova will present an article on the new RuPAWS linguistic corpus at LREC 2022, one of the major international conferences to be held in Marseille on 20-25 June.

News
Latest Articles
See more
AI Trends
Impact of Typos on LLMs and Genome Research
Tech
MWS AI Launches an Open AI Assistant for Software Programmers
AI Trends
New Action Model and LLM Thought Tracking
In The Focus Of AI
Over 60% of Russians Cannot Tell a Deepfake Picture from a Real One
Solutions
Fraud Detector for Text and Voice Messages
Solutions
MWS AI Launches Corporate AI Assistants for Document Search and Analytics
Tech
MTS AI to Market New AI Assistant for Developers
Cases
МТS Live to use LLM from MTS AI to generate descriptions for events on its ticket showcase
In The Focus Of AI
AI Hackers and New DeepSeek Competitors
In The Focus Of AI
Dolphin Language and LLM Self-detoxification