LughaatNLP: Urdu Language Preprocessing Library

Welcome to LughaatNLP: Simplifying Urdu Text Processing

About LughaatNLP

LughaatNLP is an exciting new open-source Python library designed to streamline natural language processing tasks for the Urdu language. Developed by Muhammad Noman, a student at Iqra University in Pakistan, this comprehensive toolkit offers a wide range of features tailored specifically for Urdu text preprocessing. LughaatNLP aims to simplify this process by providing a suite of tools to handle various aspects of Urdu text normalization, tokenization, lemmatization, stop word removal, stemming, spell checking, part-of-speech tagging, and named entity recognition.

Why LughaatNLP?

LughaatNLP stands out as a vital tool for Urdu language processing due to its comprehensive suite of features tailored specifically for Urdu text preprocessing. Unlike generic natural language processing libraries, LughaatNLP is designed to address the unique challenges presented by the Urdu language script and structure. One of the primary reasons to choose LughaatNLP is its ability to handle various aspects of Urdu text normalization, tokenization, lemmatization, stop word removal, stemming, spell checking, part-of-speech tagging, and named entity recognition—all within a single library. This consolidated approach streamlines the entire NLP pipeline for Urdu, making it accessible and efficient for developers and researchers working with Urdu text data. Moreover, LughaatNLP is an open-source library, which means it is continuously evolving with contributions from the community. This collaborative nature ensures that the library remains up-to-date with advancements in Urdu language processing techniques and is customizable to suit diverse use cases. Whether you are working on sentiment analysis, information retrieval, machine translation, or any other NLP task involving Urdu text, LughaatNLP offers a robust foundation to build upon. It empowers users to explore the nuances of Urdu language data with ease and precision, ultimately advancing the capabilities of Urdu language processing within the broader field of natural language understanding and AI.

Key Features of LughaatNLP

Tokenization

Accurate tokenization is a crucial first step in NLP pipelines, as it involves breaking down text into individual units (words, numbers, punctuations) for further processing. LughaatNLP's tokenization module is designed to handle the intricacies of the Urdu script and language structure, ensuring precise tokenization of Urdu text.

Example:

Input: میرا نام نومان ہے

Output: ['میرا', 'نام', 'نومان', 'ہے']
Lemmatization

Lemmatization is the process of converting inflected words to their base or dictionary form. LughaatNLP's lemmatization module enhances text analysis and comprehension by reducing the complexity of Urdu words, enabling more accurate understanding of their meanings.

Example:

Input: کھاتے ہیں

Output: کھانا
Stop Word Removal

Stop words are common words that carry little to no semantic value, such as articles, prepositions, and conjunctions. LughaatNLP's stop word removal module allows users to filter out these words from Urdu text, focusing the analysis on meaningful content.

Example:

Input: میں نے کتاب پڑھی اور اچھی لگی

Output: ['کتاب', 'پڑھی', 'اچھی', 'لگی']
Normalization

Urdu text often contains diacritics, character variations, and orthographic variations that can introduce noise and inconsistencies. LughaatNLP normalization module standardizes Urdu text by removing diacritics, normalizing character variations, handling common orthographic variations, and preserving special characters used in Urdu.

Example:

Input: بَاغ

Output: باغ
Stemming

Stemming is the process of reducing words to their root or stem form, which can be beneficial for various NLP tasks, such as information retrieval and text categorization. LughaatNLP stemming module improves text analysis and comprehension by extracting the stem forms of Urdu words.

Example:

Input: کھاتے

Output: کھا
Spell Checking

Misspelled words can introduce noise and errors in NLP systems. LughaatNLP spell checking module identifies and corrects misspelled words in Urdu text, enhancing text quality and readability.

Example:

Input: میری بیٹھی ہے

Output: میری بیٹی ہے
Part-of-Speech Tagging

Part-of-speech tagging assigns grammatical categories (e.g., nouns, verbs, adjectives) to words in text. LughaatNLP's POS tagging module facilitates syntactic analysis and understanding of sentence structures in Urdu text, enabling more advanced NLP tasks.

Example:

Input: وہ کھیل رہا ہے

Output: [('وہ', 'PN'), ('کھیل', 'V'), ('رہا', 'AUX'), ('ہے', 'AUX')]
Named Entity Recognition

Named entity recognition (NER) is the task of identifying and categorizing named entities, such as persons, organizations, and locations, within text. LughaatNLP's NER module enables information extraction and semantic analysis of Urdu text by recognizing and classifying these entities.

Example:

Input: علی کراچی سے آیا

Output: [('علی', 'PERSON'), ('کراچی', 'LOCATION')]

Get Started with LughaatNLP

Follow these simple steps to integrate LughaatNLP into your projects:

Installation: Install LughaatNLP via PyPI using pip.
Import & Use: Import LughaatNLP and its modules for various text processing tasks.
Explore Functions: Discover and apply functions for normalization, tokenization, lemmatization, and more.
Enhance Your Projects: Leverage LughaatNLP to build powerful Urdu language applications and research projects.

Watch Tutorial Videos

Learn how to use LughaatNLP effectively with our tutorial videos:

Installation and Usage

LughaatNLP is available on PyPI (Pypi Link) and can be easily installed using pip in notebook or CMD:

pip install lughaatNLP

Test it On Google Colab

In this notebook, you can run all functions!

(Google Colab Link)

Urdu Language Preprocessing using LughaatNLP

Import Libraries and Create an instance of an object

# For Normalization, Lemmetization/Stemming, Stopwords Removing and Spell CEking
from LughaatNLP import LughaatNLP
urdu_text_processing = LughaatNLP()

# For Part of Speech
from LughaatNLP import POS_urdu
pos_tagger = POS_urdu()

# For Name Entity Relation
from LughaatNLP import NER_Urdu
ner_urdu = NER_Urdu()

Normalization

LughaatNLP provides various text normalization functions:

normalize_characters(text): Normalizes Urdu characters in the given text by mapping incorrect Urdu characters to their correct forms.
normalize_combine_characters(text): Simplifies Urdu characters by combining certain combinations into their correct single forms.
normalize(text): Performs all-in-one normalization on the Urdu text, including character normalization, diacritic removal, punctuation handling, digit conversion, and special character preservation.
remove_diacritics(text): Removes diacritics (zabar, zer, pesh) from the Urdu text.
punctuations_space(text): Removes spaces after punctuations (excluding numbers) and removes spaces before punctuations in the Urdu text.
replace_digits(text): Replaces English digits with Urdu digits.
remove_numbers_urdu(text): Removes Urdu numbers from the Urdu text.
remove_numbers_english(text): Removes English numbers from the Urdu text.
remove_whitespace(text): Removes extra whitespaces from the Urdu text.
preserve_special_characters(text): Adds spaces around special characters in the Urdu text to facilitate tokenization.
remove_numbers(text): Removes both Urdu and English numbers from the Urdu text.
remove_english(text): Removes English characters from the Urdu text.
pure_urdu(text): Removes all non-Urdu characters and numbers from the text, leaving only Urdu characters and special characters used in Urdu.
just_urdu(text): Removes all non-Urdu characters, numbers, and special characters, leaving only pure Urdu text (no special characters used in Urdu).
remove_urls(text): Removes URLs from the Urdu text.
remove_special_characters(text): Removes all special characters from the Urdu text.
remove_special_characters_exceptUrdu(text): Removes all special characters from the Urdu text, except for those commonly used in the Urdu language (e.g., ؟, ۔ , ،).

Example 1:

This function performs all-in-one normalization on the Urdu text, including character normalization, diacritic removal, punctuation handling, digit conversion, and special character preservation.

from LughaatNLP import LughaatNLP

urdu_text_processing = LughaatNLP()

text = "آپ کیسے ہیں؟ میں 23 سال کا ہوں۔"

normalized_text = urdu_text_processing.normalize(text)

print(normalized_text)

Output:

اپ کیسے ہیں ؟ میں ۲۳ سال کا ہوں ۔

Lemmatization and Stemming

Lemmatization and stemming are text normalization techniques used in natural language processing to reduce words to their base or root forms. Stemming reduces words to their base or root forms without considering linguistic meaning, whereas lemmatization always aims to reduce words to their full and meaningful base forms based on dictionary definitions or word context.

Example 1: Lemmatization

This function performs lemmatization on the Urdu sentence, replacing words with their base or dictionary form.

from LughaatNLP import LughaatNLP

urdu_text_processing = LughaatNLP()

sentence = "میں کتابیں پڑھتا ہوں۔"
lemmatized_sentence = urdu_text_processing.lemmatize_sentence(sentence)
print(lemmatized_sentence)

Output:

میں کتاب پڑھنا ہوں۔

Example 2: Stemming

This function performs stemming on the Urdu sentence, reducing words to their root or stem form.

from LughaatNLP import LughaatNLP

urdu_text_processing = LughaatNLP()

sentence = "میں کتابیں پڑھتا ہوں۔"
stemmed_sentence = urdu_text_processing.urdu_stemmer(sentence)
print("Urdu Stemming ", stemmed_sentence)

Output:

میں کتاب پڑھ ہوں۔

Stop Words Removing

Stop words are common words in a language (such as "کہ", "کیا", "اور", "لیکن", "بھی") that are often filtered out during text processing or analysis because they are considered irrelevant for tasks like searching or natural language understanding in Urdu language.

Example 1:

This function removes stopwords from the Urdu text.

from LughaatNLP import LughaatNLP

urdu_text_processing = LughaatNLP()

text = "میں اس کتاب کو پڑھنا چاہتا ہوں۔"
filtered_text = urdu_text_processing.remove_stopwords(text)
print(filtered_text)

Output:

کتاب پڑھنا چاہتا ہوں۔

Spell Checker

Spell checking involves identifying and correcting misspelled words in Urdu text using various functions:

corrected_sentence_spelling(input_word, threshold)

This function takes an input sentence and a similarity threshold as arguments and returns the corrected sentence with potentially misspelled words replaced by the most similar words from the vocabulary.
most_similar_word(input_word, threshold)

This function takes an input word and a similarity threshold as arguments and returns the most similar word from the vocabulary based on the Levenshtein distance.
get_similar_words_percentage(input_word, threshold)

This function takes an input word and a similarity threshold as arguments and returns a list of tuples containing similar words and their corresponding similarity percentages.
get_similar_words(input_word, threshold)

This function takes an input word and a similarity threshold as arguments and returns a list of similar words from the vocabulary based on the Levenshtein distance.

These functions leverage the Levenshtein distance algorithm to calculate the similarity between the input word or sentence and the words in the vocabulary. The threshold parameter is used to filter out words with a similarity percentage below the specified threshold.

Example 1:

This function takes an input sentence and a similarity threshold as arguments and returns the corrected sentence with potentially misspelled words replaced by the most similar words from the vocabulary.

from LughaatNLP import LughaatNLP

spell_checker = LughaatNLP()

sentence = 'سسب سےا بڑاا ملکا ہے'
corrected_sentence = spell_checker.corrected_sentence_spelling(sentence, 60)
print(corrected_sentence)

Output:

سب سے بڑا ملک ہے

Tokenization

Tokenization involves breaking down Urdu text into individual tokens, considering the intricacies of Urdu script and language structure.

Example 1:

This function tokenizes the Urdu text into individual tokens (words, numbers, and punctuations).

from LughaatNLP import LughaatNLP

tokens_words = LughaatNLP()

text = "میں پاکستان سے ہوں۔"
tokens = tokens_words.urdu_tokenize(text)
print(tokens)

Output:

['میں', 'پاکستان', 'سے', 'ہوں۔']

Part of Speech

The pos_tags_urdu function is used for part-of-speech tagging in Urdu text. It takes an Urdu sentence as input and returns a list of dictionaries where each word is paired with its assigned part-of-speech tag, such as nouns (NN), verbs (VB), adjectives (ADJ), etc.

Example 1:

The pos_tags_urdu function is used for part-of-speech tagging in Urdu text.

from LughaatNLP import POS_urdu

pos_tagger = POS_urdu()

sentence = "میرے والدین نے میری تعلیم اور تربیت میں بہت محنت کی تاکہ میں اپنی زندگی میں کامیاب ہو سکوں۔"
predicted_pos_tags = pos_tagger.pos_tags_urdu(sentence)
print(predicted_pos_tags)

Named Entity Relation

The ner_tags_urdu function performs named entity recognition on Urdu text, assigning named entity tags (such as U-LOCATION for locations) to identified entities in the input sentence. It outputs a dictionary where words are mapped to their corresponding named entity tags, facilitating tasks like information extraction and text analysis specific to Urdu language.

Example 1:

This Function will return dictionary words with their corresponding tags of Name entity Relation

from LughaatNLP import NER_Urdu

ner_urdu = NER_Urdu()

sentence = "اس کتاب میں پاکستان کی تاریخ بیان کی گئی ہے۔"
word_tag_dict = ner_urdu.ner_tags_urdu(sentence)
print(word_tag_dict)

Output:

{'اس': 'O', 'کتاب': 'O', 'میں': 'O', 'پاکستان': 'U-LOCATION', ...}

See other functions in the documentation: Documentation Link

Contribute to LughaatNLP

Join our open-source community and contribute to LughaatNLP's development:

Report issues or suggest improvements on our GitHub repository.
Submit pull requests to enhance LughaatNLP's functionality.

Explore LughaatNLP Today

LughaatNLP represents a significant step forward in enabling natural language processing for the Urdu language. By providing a comprehensive set of tools for Urdu text preprocessing, this library aims to facilitate the development of various NLP applications and research projects involving Urdu text data. Whether you're a researcher, developer, or enthusiast interested in Urdu language processing, LughaatNLP is an invaluable resource worth exploring.

Start using LughaatNLP and experience the convenience of Urdu language preprocessing for your NLP projects. Whether you're a researcher, developer, or language enthusiast, LughaatNLP is here to simplify Urdu text analysis.

For more information, visit the GitHub repository or reach out to Muhammad Noman via email muhammadnomanshafiq76@gmail.com or LinkedIn LinkedIn.

Search This Blog

LughaatNLP

Mastering Urdu Text Processing: Introducing LughaatNLP for Simplified Natural Language Understanding

Welcome to LughaatNLP: Simplifying Urdu Text Processing

About LughaatNLP

Why LughaatNLP?

Key Features of LughaatNLP

Tokenization

Lemmatization

Stop Word Removal

Normalization

Stemming

Spell Checking

Part-of-Speech Tagging

Named Entity Recognition

Get Started with LughaatNLP

Watch Tutorial Videos

Installation and Usage

Urdu Language Preprocessing using LughaatNLP

Import Libraries and Create an instance of an object

Normalization

Lemmatization and Stemming

Stop Words Removing

Spell Checker

`corrected_sentence_spelling(input_word, threshold)`

`most_similar_word(input_word, threshold)`

`get_similar_words_percentage(input_word, threshold)`

`get_similar_words(input_word, threshold)`

Tokenization

Part of Speech

Named Entity Relation

Contribute to LughaatNLP

Explore LughaatNLP Today

Comments

Post a Comment