Al-Rasaif (Why & How) | Linguist | اللُّــغَــوِيّــــ
O Allah, lift the affliction from Gaza 🇵🇸

Linguist | اللُّــغَــوِيّــــ

Everything text-related, simplified.



In the name of Allah, and may peace and blessings be upon the Messenger of Allah.
O Allah, we ask You for pardon and well-being in this world and the Hereafter.

Introduction

Around a year ago, Allah blessed me with the opportunity to volunteer as a developer for the “Rasaif” project.


What is Rasaif?

Rasaif Al-Sihah Li Tarajim Al-Fusah is a platform that displays words in their context alongside their English equivalents. This remarkable project is distinguished by its selection of eloquent books from classical authors. It was initiated by the Saudi translator Ahmad Al-Ghamdi, author of Al-Aranjiyah, in collaboration with dedicated volunteers—may Allah accept their efforts. You can watch this video to learn more about the initiative.


Challenges

  1. Sustaining Volunteer Efforts
    Ensuring the continuity of volunteers working on formatting books using Word.

  2. Displaying Search Results in Context
    The ability to show search results within their context, along with their equivalent text in the other language.

  3. Comprehensive and Fast Search Feature
    Providing an instant search functionality that spans all books, similar to Al-Maktabah Al-Shamilah.

  4. Streamlined Book Upload
    Enabling the upload of a fully formatted book in one step, rather than paragraph by paragraph as before.

  5. Customizable Search Options
    Allowing search customization by:

    • Book
    • Category
    • Author
    • Language
  6. Sequential Result Navigation
    Allowing users to navigate to previous and next results within the same book.

  7. Search Analytics
    Tracking the words searched for by visitors, their search frequency, and whether they were found in the books.


Tools Used

  1. Pandoc
    For converting Word files into HTML format.

  2. Jupyter Notebook
    To process HTML files, refine texts using the Pandas library, and then convert the output into JSON-ND format for Elasticsearch.

  3. Elasticsearch
    For advanced text search capabilities.


Steps

1- Using Pandoc for file conversion:

1
pandoc akhlaq.docx -t html -o akhlaq.html

2- Importing essential libraries in Jupyter Notebook

1
2
3
4
5
6
import os
import sys
import pandas as pd
from bs4 import BeautifulSoup
from pathlib import Path
import io

3- Declaring Variables

1
2
3
4
5
book_name = 'كتاب الزهد الكبير'
book_language = 'الفرنسية'
book_author = 'أبو بكر البيهقي'
file_no = "1"
book = "zohd"

4- Converting HTML to CSV and then loading it into a DataFrame using Pandas.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Define the file path
path = rf'organized/books/{book}/{file_no}.html'

# Parse the HTML file with BeautifulSoup
with open(path, 'r', encoding='utf-8') as file:
    soup = BeautifulSoup(file, 'html.parser')

# Extract table headers
try:
    table = soup.find("table")
    header = [th.get_text(strip=True) for th in table.find("tr").find_all("th")]
except AttributeError:
    raise ValueError("Table or headers not found in the HTML file.")

# Extract table rows
data = []
for row in table.find_all("tr")[1:]:  # Skip the header row
    row_data = [td.get_text(strip=True) for td in row.find_all("td")]
    data.append(row_data)

# Convert to a Pandas DataFrame
dataFrame = pd.DataFrame(data=data, columns=header)

5- Data cleaning and modification operations.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# Drop unnecessary columns
df = dataFrame.drop(columns=[0, 2, 4])

# Rename columns
df.rename(columns={3: 'Original', 1: 'Translation'}, inplace=True)

# Define patterns to replace
patterns = [
    ('\"', '\\\"'),  # Escape double quotes
    ('“', '\\\"'),   # Escape left double quotation mark
    ('”', '\\\"'),   # Escape right double quotation mark
    (':', '\\\"'),   # Escape colon
    (r'\]', '\\\"'), # Escape closing square bracket
    (r'\[', '\\\"'), # Escape opening square bracket
    (r'\n', ' ')     # Replace newline with space
]

# Clean both 'Original' and 'Translation' columns
for column in ['Original', 'Translation']:
    for pattern, replacement in patterns:
        df[column] = df[column].str.replace(pattern, replacement, regex=True)