This guide explains how to use the egrep
tool to search inside Arabic text files on Linux/Unix systems. Itโs especially useful for linguists, translators, and those interested in text analysis.
Basics of Searching Arabic Text
# Simple search for an Arabic word
egrep 'word' file.txt
# Case-insensitive search
egrep -i 'word' file.txt
# Show line numbers with the result
egrep -n 'word' file.txt
# Search for a full sentence or phrase
egrep 'this is a full sentence' file.txt
Searching for Specific Words
# Search for multiple words (logical OR)
egrep 'word1|word2|word3' file.txt
# Search for a word at the beginning of a line
egrep '^word' file.txt
# Search for a word at the end of a line
egrep 'word$' file.txt
# Search for exact word matches (using word boundaries)
egrep '\bword\b' file.txt
Searching Using Ranges
# Search for letters within a specific range (e.g. from Alef to แนขฤd)
egrep '[ุฃ-ุต]' file.txt
# Search for any approximate Arabic letter
egrep '[ุก-ู]' file.txt
# Search for words starting with a specific letter
egrep '\b[ุงุฃุฅุข]' file.txt
Searching for Specific Characters or Symbols
# Search for diacritics (short vowels)
egrep '[ูููููููููฐ]' file.txt
# Search for different types of hamzas
egrep '[ุฃุฅุขุกุคุฆ]' file.txt
# Search for alif maqsura
egrep 'ู' file.txt
# Search for tanween markers
egrep '[ููู]' file.txt
# Search for Arabic/Indic digits
egrep '[ู -ูฉ]' file.txt
# Search for both Indic and Western digits
egrep '[ู -ูฉ0-9]' file.txt
Advanced Search Patterns
# Search for a word with optional diacritics (using ? for optional repetition)
egrep 'ุณู?ูู?ู
ู?' file.txt
# Search for words ending with diacritics
egrep '[ุก-ู][ูููููููููฐ]+' file.txt
# Search for a line containing one word followed by another
egrep 'word.*word2' file.txt
# Ignore diacritics in the search
egrep 'ุณ[^ูููููููููฐ]*ู[^ูููููููููฐ]*ู
' file.txt
Important Notes About Encoding
Make sure the file is UTF-8 encoded:
file -i file.txt
Convert from Windows-1256 encoding to UTF-8:
iconv -f WINDOWS-1256 -t UTF-8 file.txt > file_utf8.txt
To avoid display issues:
LANG=ar_SA.UTF-8 egrep 'word' file.txt
Install better Arabic locale support (depends on your distro):
sudo apt install locales
sudo dpkg-reconfigure locales
Quick Reference
Description | Command |
---|---|
Simple search | egrep ‘word’ file |
Case-insensitive search | egrep -i ‘word’ file |
Show line numbers | egrep -n ‘word’ file |
Search multiple words | egrep ‘word1 |
Word at beginning of line | egrep ‘^word’ file |
Word at end of line | egrep ‘word$’ file |
Any Arabic letter | egrep ‘[ุก-ู]’ file |
Diacritics | egrep ‘[ูููููููููฐ]’ file |
Alif Maqsura | egrep ‘ู’ file |
Arabic/Indic digits | egrep ‘[ู -ูฉ]’ file |
Convert encoding to UTF-8 | iconv -f WINDOWS-1256 -t UTF-8 file.txt > file_utf8.txt |
Additional Tips
Count occurrences of a word
egrep -o 'word' file.txt | wc -l
Combine egrep with less for interactive viewing
egrep 'word' file.txt | less
Search multiple files recursively
egrep -r 'word' folder/
Enable result highlighting (if disabled)
GREP_OPTIONS='--color=auto'