This guide explains how to use the egrep
tool to search inside Arabic text files on Linux/Unix systems. It’s especially useful for linguists, translators, and those interested in text analysis.
Basics of Searching Arabic Text
# Simple search for an Arabic word
egrep 'word' file.txt
# Case-insensitive search
egrep -i 'word' file.txt
# Show line numbers with the result
egrep -n 'word' file.txt
# Search for a full sentence or phrase
egrep 'this is a full sentence' file.txt
Searching for Specific Words
# Search for multiple words (logical OR)
egrep 'word1|word2|word3' file.txt
# Search for a word at the beginning of a line
egrep '^word' file.txt
# Search for a word at the end of a line
egrep 'word$' file.txt
# Search for exact word matches (using word boundaries)
egrep '\bword\b' file.txt
Searching Using Ranges
# Search for letters within a specific range (e.g. from Alef to Ṣād)
egrep '[أ-ص]' file.txt
# Search for any approximate Arabic letter
egrep '[ء-ي]' file.txt
# Search for words starting with a specific letter
egrep '\b[اأإآ]' file.txt
Searching for Specific Characters or Symbols
# Search for diacritics (short vowels)
egrep '[ًٌٍَُِّْٰ]' file.txt
# Search for different types of hamzas
egrep '[أإآءؤئ]' file.txt
# Search for alif maqsura
egrep 'ى' file.txt
# Search for tanween markers
egrep '[ًٌٍ]' file.txt
# Search for Arabic/Indic digits
egrep '[٠-٩]' file.txt
# Search for both Indic and Western digits
egrep '[٠-٩0-9]' file.txt
Advanced Search Patterns
# Search for a word with optional diacritics (using ? for optional repetition)
egrep 'سَ?لَ?مَ?' file.txt
# Search for words ending with diacritics
egrep '[ء-ي][ًٌٍَُِّْٰ]+' file.txt
# Search for a line containing one word followed by another
egrep 'word.*word2' file.txt
# Ignore diacritics in the search
egrep 'س[^ًٌٍَُِّْٰ]*ل[^ًٌٍَُِّْٰ]*م' file.txt
Important Notes About Encoding
Make sure the file is UTF-8 encoded:
file -i file.txt
Convert from Windows-1256 encoding to UTF-8:
iconv -f WINDOWS-1256 -t UTF-8 file.txt > file_utf8.txt
To avoid display issues:
LANG=ar_SA.UTF-8 egrep ‘word’ file.txt
Install better Arabic locale support (depends on your distro):
sudo apt install locales sudo dpkg-reconfigure locales
Quick Reference
Description Command Simple search egrep ‘word’ file Case-insensitive search egrep -i ‘word’ file Show line numbers egrep -n ‘word’ file Search multiple words egrep ‘word1|word2’ file Word at beginning of line egrep ‘^word’ file Word at end of line egrep ‘word$’ file Any Arabic letter egrep ‘[ء-ي]’ file Diacritics egrep ‘[ًٌٍَُِّْٰ]’ file Alif Maqsura egrep ‘ى’ file Arabic/Indic digits egrep ‘[٠-٩]’ file Convert encoding to UTF-8 iconv -f WINDOWS-1256 -t UTF-8 file.txt > file_utf8.txt
Additional Tips
Count occurrences of a word
egrep -o ‘word’ file.txt | wc -l
Combine egrep with less for interactive viewing
egrep ‘word’ file.txt | less
Search multiple files recursively
egrep -r ‘word’ folder/
Enable result highlighting (if disabled)
GREP_OPTIONS=’–color=auto’
Let me know if you’d like this published as a GitHub Markdown file or converted to HTML.