In the name of Allah, and may peace and blessings be upon the Messenger of Allah.
O Allah, we ask You for forgiveness and well-being in this life and the Hereafter.
Introduction
About a year ago, Allah blessed me with the opportunity to volunteer as a developer for the “Rasaif” project.
What is the Rasaif Project?
Rasaif al-Siḥāḥ li-Tarājim al-Fuṣḥāḥ is a platform that displays words in context alongside their English equivalents. This remarkable project stands out for its curated selection of eloquent classical Arabic texts. It was launched by the Saudi translator Ahmad al-Ghamdi, author of the book Al-Aranjiyyah, with the help of dedicated volunteers—may Allah accept their efforts. You can watch this video to learn more about the initiative.
Challenges
Sustaining Volunteer Efforts
Ensuring that volunteers continue formatting books using Microsoft Word.Displaying Search Results in Context
The ability to show search results within their full context, along with their corresponding text in the other language.Fast and Comprehensive Search
Providing an instant, full-text search across all books, similar to Shamela Library.Easier Book Uploads
Enabling users to upload an entire formatted book in one step, instead of paragraph-by-paragraph uploads as before.Custom Search Filters
Allowing users to filter searches by:- Book
- Category
- Author
- Language
Sequential Result Navigation
Allowing users to navigate forward and backward between results within the same book.Search Data Analysis
Tracking the words visitors search for, how often they are searched, and whether they exist in the corpus.
Tools Used
Pandoc
To convert Word documents into HTML format.Jupyter Notebook
To process HTML files, clean the texts using the Pandas library, and export the results into JSON-ND for use in Elasticsearch.Elasticsearch
To provide advanced text search functionality.
Steps
- Use Pandoc to convert the file:
pandoc akhlaq.docx -t html -o akhlaq.html
- Import the necessary libraries in Jupyter Notebook:
import os
import sys
import pandas as pd
from bs4 import BeautifulSoup
from pathlib import Path
import io
- Define variables:
book_name = 'Kitab al-Zuhd al-Kabir'
book_language = 'French'
book_author = 'Abu Bakr al-Bayhaqi'
file_no = "1"
book = "zohd"
- Convert HTML to CSV and load into a Pandas DataFrame:
# Define file path
path = rf'organized/books/{book}/{file_no}.html'
# Parse the HTML using BeautifulSoup
with open(path, 'r', encoding='utf-8') as file:
soup = BeautifulSoup(file, 'html.parser')
# Extract table headers
try:
table = soup.find("table")
header = [th.get_text(strip=True) for th in table.find("tr").find_all("th")]
except AttributeError:
raise ValueError("Table or headers not found in the HTML file.")
# Extract row data
data = []
for row in table.find_all("tr")[1:]: # Skip header row
row_data = [td.get_text(strip=True) for td in row.find_all("td")]
data.append(row_data)
# Convert to Pandas DataFrame
dataFrame = pd.DataFrame(data=data, columns=header)
- Clean and modify the data:
# Drop unnecessary columns
df = dataFrame.drop(columns=[0, 2, 4])
# Rename columns
df.rename(columns={3: 'Original', 1: 'Translation'}, inplace=True)
# Define patterns to escape or replace
patterns = [
('"', '\\"'), # Escape double quotes
('“', '\\"'), # Escape left double quote
('”', '\\"'), # Escape right double quote
(':', '\\"'), # Escape colon
(r'\]', '\\"'), # Escape right square bracket
(r'\[', '\\"'), # Escape left square bracket
(r'\n', ' ') # Replace newlines with spaces
]
# Apply cleaning to both columns
for column in ['Original', 'Translation']:
for pattern, replacement in patterns:
df[column] = df[column].str.replace(pattern, replacement, regex=True)