Count lines, words, and characters in a file
Type here
wc filename
Count only lines in a corpus file
Type here
wc -l corpus.txt
Count words in a corpus file
Type here
wc -w corpus.txt
Count characters in a corpus file
Type here
wc -c corpus.txt
Sort file contents alphabetically
Type here
sort filename
Sort in reverse order
Type here
sort -r filename
Display unique lines only
Type here
uniq filename
Remove duplicate lines from sorted output
Type here
sort corpus.txt | uniq
Count frequency of each unique line
Type here
sort corpus.txt | uniq -c
Sort by frequency descending to get word counts
Type here
sort corpus.txt | uniq -c | sort -rn
Print only duplicate lines
Type here
sort corpus.txt | uniq -d
Shuffle lines randomly for training data prep
Type here
shuf corpus.txt > corpus_shuffled.txt