The (egrep) Tool

Linguist.Page 🇵🇸

This guide explains how to use the egrep tool to search inside Arabic text files on Linux/Unix systems. It’s especially useful for linguists, translators, and those interested in text analysis.

Basics of Searching Arabic Text

# Simple search for an Arabic word
egrep 'word' file.txt

# Case-insensitive search
egrep -i 'word' file.txt

# Show line numbers with the result
egrep -n 'word' file.txt

# Search for a full sentence or phrase
egrep 'this is a full sentence' file.txt

Searching for Specific Words

# Search for multiple words (logical OR)
egrep 'word1|word2|word3' file.txt

# Search for a word at the beginning of a line
egrep '^word' file.txt

# Search for a word at the end of a line
egrep 'word$' file.txt

# Search for exact word matches (using word boundaries)
egrep '\bword\b' file.txt

Searching Using Ranges

# Search for letters within a specific range (e.g. from Alef to Ṣād)
egrep '[أ-ص]' file.txt

# Search for any approximate Arabic letter
egrep '[ء-ي]' file.txt

# Search for words starting with a specific letter
egrep '\b[اأإآ]' file.txt

Searching for Specific Characters or Symbols

# Search for diacritics (short vowels)
egrep '[ًٌٍَُِّْٰ]' file.txt

# Search for different types of hamzas
egrep '[أإآءؤئ]' file.txt

# Search for alif maqsura
egrep 'ى' file.txt

# Search for tanween markers
egrep '[ًٌٍ]' file.txt

# Search for Arabic/Indic digits
egrep '[٠-٩]' file.txt

# Search for both Indic and Western digits
egrep '[٠-٩0-9]' file.txt

Advanced Search Patterns

# Search for a word with optional diacritics (using ? for optional repetition)
egrep 'سَ?لَ?مَ?' file.txt

# Search for words ending with diacritics
egrep '[ء-ي][ًٌٍَُِّْٰ]+' file.txt

# Search for a line containing one word followed by another
egrep 'word.*word2' file.txt

# Ignore diacritics in the search
egrep 'س[^ًٌٍَُِّْٰ]*ل[^ًٌٍَُِّْٰ]*م' file.txt

Important Notes About Encoding

Make sure the file is UTF-8 encoded:

file -i file.txt

Convert from Windows-1256 encoding to UTF-8:

iconv -f WINDOWS-1256 -t UTF-8 file.txt > file_utf8.txt

To avoid display issues:

LANG=ar_SA.UTF-8 egrep 'word' file.txt

Install better Arabic locale support (depends on your distro):

sudo apt install locales
sudo dpkg-reconfigure locales

Quick Reference

Description	Command
Simple search	egrep ‘word’ file
Case-insensitive search	egrep -i ‘word’ file
Show line numbers	egrep -n ‘word’ file
Search multiple words	egrep ‘word1
Word at beginning of line	egrep ‘^word’ file
Word at end of line	egrep ‘word$’ file
Any Arabic letter	egrep ‘[ء-ي]’ file
Diacritics	egrep ‘[ًٌٍَُِّْٰ]’ file
Alif Maqsura	egrep ‘ى’ file
Arabic/Indic digits	egrep ‘[٠-٩]’ file
Convert encoding to UTF-8	iconv -f WINDOWS-1256 -t UTF-8 file.txt > file_utf8.txt

Additional Tips

Count occurrences of a word

egrep -o 'word' file.txt | wc -l

Combine egrep with less for interactive viewing

egrep 'word' file.txt | less

Search multiple files recursively

egrep -r 'word' folder/

Enable result highlighting (if disabled)

GREP_OPTIONS='--color=auto'