Linguist.Page ๐Ÿ‡ต๐Ÿ‡ธ


This guide explains how to use the egrep tool to search inside Arabic text files on Linux/Unix systems. Itโ€™s especially useful for linguists, translators, and those interested in text analysis.

Basics of Searching Arabic Text

# Simple search for an Arabic word
egrep 'word' file.txt

# Case-insensitive search
egrep -i 'word' file.txt

# Show line numbers with the result
egrep -n 'word' file.txt

# Search for a full sentence or phrase
egrep 'this is a full sentence' file.txt

Searching for Specific Words

# Search for multiple words (logical OR)
egrep 'word1|word2|word3' file.txt

# Search for a word at the beginning of a line
egrep '^word' file.txt

# Search for a word at the end of a line
egrep 'word$' file.txt

# Search for exact word matches (using word boundaries)
egrep '\bword\b' file.txt

Searching Using Ranges

# Search for letters within a specific range (e.g. from Alef to แนขฤd)
egrep '[ุฃ-ุต]' file.txt

# Search for any approximate Arabic letter
egrep '[ุก-ูŠ]' file.txt

# Search for words starting with a specific letter
egrep '\b[ุงุฃุฅุข]' file.txt

Searching for Specific Characters or Symbols

# Search for diacritics (short vowels)
egrep '[ู‹ูŒููŽููู‘ู’ูฐ]' file.txt

# Search for different types of hamzas
egrep '[ุฃุฅุขุกุคุฆ]' file.txt

# Search for alif maqsura
egrep 'ู‰' file.txt

# Search for tanween markers
egrep '[ู‹ูŒู]' file.txt

# Search for Arabic/Indic digits
egrep '[ู -ูฉ]' file.txt

# Search for both Indic and Western digits
egrep '[ู -ูฉ0-9]' file.txt

Advanced Search Patterns

# Search for a word with optional diacritics (using ? for optional repetition)
egrep 'ุณูŽ?ู„ูŽ?ู…ูŽ?' file.txt

# Search for words ending with diacritics
egrep '[ุก-ูŠ][ูŽู‹ููŒููู’ู‘ูฐ]+' file.txt

# Search for a line containing one word followed by another
egrep 'word.*word2' file.txt

# Ignore diacritics in the search
egrep 'ุณ[^ูŽู‹ููŒููู’ู‘ูฐ]*ู„[^ูŽู‹ููŒููู’ู‘ูฐ]*ู…' file.txt

Important Notes About Encoding

Make sure the file is UTF-8 encoded:

file -i file.txt

Convert from Windows-1256 encoding to UTF-8:

iconv -f WINDOWS-1256 -t UTF-8 file.txt > file_utf8.txt

To avoid display issues:

LANG=ar_SA.UTF-8 egrep 'word' file.txt

Install better Arabic locale support (depends on your distro):

sudo apt install locales
sudo dpkg-reconfigure locales

Quick Reference

DescriptionCommand
Simple searchegrep ‘word’ file
Case-insensitive searchegrep -i ‘word’ file
Show line numbersegrep -n ‘word’ file
Search multiple wordsegrep ‘word1
Word at beginning of lineegrep ‘^word’ file
Word at end of lineegrep ‘word$’ file
Any Arabic letteregrep ‘[ุก-ูŠ]’ file
Diacriticsegrep ‘[ู‹ูŒููŽููู‘ู’ูฐ]’ file
Alif Maqsuraegrep ‘ู‰’ file
Arabic/Indic digitsegrep ‘[ู -ูฉ]’ file
Convert encoding to UTF-8iconv -f WINDOWS-1256 -t UTF-8 file.txt > file_utf8.txt

Additional Tips

Count occurrences of a word

egrep -o 'word' file.txt | wc -l

Combine egrep with less for interactive viewing

egrep 'word' file.txt | less

Search multiple files recursively

egrep -r 'word' folder/

Enable result highlighting (if disabled)

GREP_OPTIONS='--color=auto'