Detect the encoding of a file
Type here
file -i corpus.txt
List all available encodings on the system
Type here
iconv -l
Convert a file from Windows encoding to UTF-8
Type here
iconv -f CP1256 -t UTF-8 input.txt -o output.txt
Convert a file from Latin-1 to UTF-8
Type here
iconv -f ISO-8859-1 -t UTF-8 input.txt -o output.txt
Inspect raw bytes of a file for encoding issues
Type here
xxd corpus.txt | head -n 20
Check locale and language settings
Type here
locale
List all available locales on the system
Type here
locale -a