At the Finopolis 2025 forum, MWS AI presented its first vision language model (VLM), Cotype VL, capable of analyzing and interpreting visuals and texts at the same time. The model can be delivered both as a standalone product and as part of AI assistants for a broad range of business scenarios: from searching documents that contain visual data to screenshot-based client support and preparing reports with graphical data.
Cotype VL includes 32 billion parameters, is capable of recognizing images that contain printed, handwritten and mixed texts and takes visual context into account when translating texts from one language to another. The model also generates brief and detailed image descriptions, answers complex and logic-based questions about image contents that require reasoning, comparisons and conclusions.
Cotype VL supports Russian, English, Chinese and other languages, making it convenient for companies with international document flows. It can be deployed on-premises and further trained using client data if needed.
“Vision language models are a key element for creating new generation of AI assistants capable of understanding both texts and complex visual information, giving more precise advice that account for all input data provided by the user and interacting autonomously with various corporate system and application interfaces. Our new model can handle drawings, schematics, technical illustrations, maps and other visual data. That makes it desirable for AI solutions intended for design and engineering services, legal, finance and HR departments, as well as marketing where handling content in various formats is required,” said Denis Filippov, CEO, MWS AI.
In order to train Cotype VL, the team compiled a set of Russian-language data from various domains, including finance, industry, IT, telecom, and healthcare. The dataset consists of over 150,000 documents with visual data such as scans, screenshots, contracts, letters, agreements, diagrams, spreadsheets, and schematics with maps and drawings, where structure and element placement play a key role. The training dataset also includes handwritten notes and exercise books, documents like certificates or postcards containing both handwritten and printed information, printed receipts, tickets, letters of commendation, and medical tests. Additionally, interface screenshots from business applications, engineering software, MTS ecosystem apps, and others were used to train the model. MWS AI developed a tool for generating synthetic data based on real-life examples. All text and visual data from open sources were depersonalized.








