The Linguist

This guide explains how to use the egrep tool to search inside Arabic text files on Linux/Unix systems. It’s especially useful for linguists, translators, and those interested in text analysis.

Basics of Searching Arabic Text

# Simple search for an Arabic word
egrep 'word' file.txt

# Case-insensitive search
egrep -i 'word' file.txt

# Show line numbers with the result
egrep -n 'word' file.txt

# Search for a full sentence or phrase
egrep 'this is a full sentence' file.txt

Searching for Specific Words

# Search for multiple words (logical OR)
egrep 'word1|word2|word3' file.txt

# Search for a word at the beginning of a line
egrep '^word' file.txt

# Search for a word at the end of a line
egrep 'word$' file.txt

# Search for exact word matches (using word boundaries)
egrep '\bword\b' file.txt

Searching Using Ranges

# Search for letters within a specific range (e.g. from Alef to Ṣād)
egrep '[أ-ص]' file.txt

# Search for any approximate Arabic letter
egrep '[ء-ي]' file.txt

# Search for words starting with a specific letter
egrep '\b[اأإآ]' file.txt

Searching for Specific Characters or Symbols

# Search for diacritics (short vowels)
egrep '[ًٌٍَُِّْٰ]' file.txt

# Search for different types of hamzas
egrep '[أإآءؤئ]' file.txt

# Search for alif maqsura
egrep 'ى' file.txt

# Search for tanween markers
egrep '[ًٌٍ]' file.txt

# Search for Arabic/Indic digits
egrep '[٠-٩]' file.txt

# Search for both Indic and Western digits
egrep '[٠-٩0-9]' file.txt

Advanced Search Patterns

# Search for a word with optional diacritics (using ? for optional repetition)
egrep 'سَ?لَ?مَ?' file.txt

# Search for words ending with diacritics
egrep '[ء-ي][ًٌٍَُِّْٰ]+' file.txt

# Search for a line containing one word followed by another
egrep 'word.*word2' file.txt

# Ignore diacritics in the search
egrep 'س[^ًٌٍَُِّْٰ]*ل[^ًٌٍَُِّْٰ]*م' file.txt

Important Notes About Encoding

Make sure the file is UTF-8 encoded:

file -i file.txt

Convert from Windows-1256 encoding to UTF-8:

iconv -f WINDOWS-1256 -t UTF-8 file.txt > file_utf8.txt

To avoid display issues:

LANG=ar_SA.UTF-8 egrep ‘word’ file.txt

Install better Arabic locale support (depends on your distro):

sudo apt install locales sudo dpkg-reconfigure locales

Quick Reference

Description Command Simple search egrep ‘word’ file Case-insensitive search egrep -i ‘word’ file Show line numbers egrep -n ‘word’ file Search multiple words egrep ‘word1|word2’ file Word at beginning of line egrep ‘^word’ file Word at end of line egrep ‘word$’ file Any Arabic letter egrep ‘[ء-ي]’ file Diacritics egrep ‘[ًٌٍَُِّْٰ]’ file Alif Maqsura egrep ‘ى’ file Arabic/Indic digits egrep ‘[٠-٩]’ file Convert encoding to UTF-8 iconv -f WINDOWS-1256 -t UTF-8 file.txt > file_utf8.txt

Additional Tips

Count occurrences of a word

egrep -o ‘word’ file.txt | wc -l

Combine egrep with less for interactive viewing

egrep ‘word’ file.txt | less

Search multiple files recursively

egrep -r ‘word’ folder/

Enable result highlighting (if disabled)

GREP_OPTIONS=’–color=auto’

Let me know if you’d like this published as a GitHub Markdown file or converted to HTML.