| | | | | linguist.page@gmail.com

Detect the encoding of a file

Type here
file -i corpus.txt

List all available encodings on the system

Type here
iconv -l

Convert a file from Windows encoding to UTF-8

Type here
iconv -f CP1256 -t UTF-8 input.txt -o output.txt

Convert a file from Latin-1 to UTF-8

Type here
iconv -f ISO-8859-1 -t UTF-8 input.txt -o output.txt

Inspect raw bytes of a file for encoding issues

Type here
xxd corpus.txt | head -n 20

Check locale and language settings

Type here
locale

List all available locales on the system

Type here
locale -a