A large language model KazLLM has been developed.

13 February 2025

A large language model, KazLLM, has been developed as part of the directive issued by the Head of State to promote artificial intelligence in the Kazakh language.

To implement this directive, the Ministry of Science and Higher Education of the Republic of Kazakhstan, in collaboration with the Institute of Information Systems and Artificial Intelligence (ISSAI), research institutes, and higher education institutions, has undertaken efforts to build a comprehensive Kazakh language corpus for the national language model KazLLM.

This initiative is expected to contribute to the development of effective solutions for processing, translating, and analyzing text data in the Kazakh language, as well as integrating it into modern technologies. In the context of globalization and the country's efforts to preserve its cultural identity, the project holds particular significance.

More than 140 researchers and staff from 26 leading scientific institutes and universities participated in the development of the Kazakh language corpus for KazLLM. They worked on compiling vast amounts of data across 115 scientific fields, including economics, finance, mathematics, history, biology, chemistry, medicine, and technology. For example, Al-Farabi Kazakh National University contributed data on philosophy, ethics, PR, astronomy, astrophysics, and information technologies. The Institute of Mathematics and Mathematical Modeling compiled data in the field of mathematics, while the Sh. Ualikhanov Institute of History and Ethnology provided historical content. Medical universities were responsible for assembling medical data. This collaboration with scientific and academic institutions has enabled the creation of a unique Kazakh-language dataset, ensuring high-quality and efficient model development.

As of today, an open-source version of KazLLM is available on the platform https://huggingface.co/issai.

KazLLM, as an important component of digital infrastructure, can be used for non-commercial scientific and academic purposes, as well as for the development of chatbots, virtual assistants, and automatic translation systems, similar to Google Translate.

A large language model KazLLM has been developed.