NLTK Cheatsheet | Linguist | اللُّــغَــوِيّــــ
O Allah, lift the affliction from Gaza 🇵🇸

Linguist | اللُّــغَــوِيّــــ

Everything text-related, simplified.



Python Import

1
2
3
4
import nltk
nltk.d­own­load()
#This step will bring up a window in which you can download ‘All Corpora’
from nltk.book import *

View Tokens

1
2
text1[­0:100] #first 101 tokens
text2[5] #fifth token

Concordance

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#basic keywor­d-i­n-c­ontext
text3.c­on­cor­dan­ce(­‘be­gat)
# show other than default 25 lines
text1.c­on­cor­dan­ce(­‘sea, lines=100)
#show other than default 25 lines
text1.c­on­cor­dan­ce(­‘sea, lines=100)
# show all results
text1.c­on­cor­dan­ce(­‘sea, lines=all)
# change left and right context width to 10 characters and show all results
text1.c­on­cor­dan­ce(­‘sea, 10, lines=all)

Common Contexts

1
text1.c­om­mon­_co­nte­xts­([‘­sea­’,’­oce­an])

Similar

A concordance permits us to see words in context. For example, we saw that monstrous occurred in contexts such as the ___ pictures and a ___ size . What other words appear in a similar range of contexts? We can find out by appending the term similar to the name of the text in question, then inserting the relevant word in parentheses:

1
text1.similar("monstrous")

Word Location

we can also determine the location of a word in the text: how many words from the beginning it appears. This positional information can be displayed using a dispersion plot. Each stripe represents an instance of a word, and each row represents the entire text.

1
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

Generate Random Text

1
text3.generate()

Counting

1
2
len(text3) #Genesis
44764

So Genesis has 44,764 words and punctuation symbols, or “tokens.” A token is the technical name for a sequence of characters — such as hairy, his, or :) — that we want to treat as a group.

Count a String

1
len(this is a string of text)  number of charac­ter­s

Count a list of tokens

1
len(text1) number of tokens

Make and Count a list of unique tokens

1
len(se­t(t­ext1))  notice that set return a list of unique tokens

Count Occurrences

1
2
text1.c­ou­nt(­‘he­aven) # how many times does a word occur?
Frequency