TurkicNLP

Open-source NLP toolkit for 24 Turkic languages
From Turkish to Sakha, Kazakh to Uyghur — tokenization, morphology, POS tagging, dependency parsing, NER, transliteration, embeddings, and machine translation in one pip install.

Why TurkicNLP? Over 200 million people speak a Turkic language, yet most lack basic NLP tools. TurkicNLP bridges this gap with a unified Python library covering 6 language branches — Oghuz, Kipchak, Karluk, Siberian, Oghur, and Arghu — plus historical languages like Ottoman Turkish and Old Turkic runic inscriptions.

Highlights

Morphological analysis via rule-based FSTs (~20 languages) and neural models (21 languages)
Multi-script support — automatic detection and bidirectional transliteration (Cyrillic ↔ Latin, Arabic ↔ Latin, Runic → Latin)
Neural pipelines — POS tagging, lemmatization, dependency parsing, and NER models
Embeddings & translation — sentence vectors and machine translation
Morpheme tokenizer — hybrid neural+FST segmentation with labeled morphemes for 16 languages

Quick start

pip install "turkicnlp[all]"

import turkicnlp
turkicnlp.download("kaz")
nlp = turkicnlp.Pipeline("kaz", processors=["tokenize", "pos", "lemma", "depparse"])
doc = nlp("Мен мектепке бардым")

Explore


Library	turkic-nlp/turkicnlp
Code samples	turkic-nlp/turkic-nlp-code-samples
Paper	arXiv:2602.19174
Website	turkic-nlp.github.io
Community & communication	TurkicNLP Discord
Datasets & models	🤗 HuggingFace

Turkic languages map
Turkic languages span from Turkey to Siberia, China to the Balkans

Maintained by Sherzod Hakimov · Contributions welcome