NLTK Cheatsheet

“Arabic: Your Key to the Quran”

Python Import

1
2
3
4


import nltk
nltk.d­own­load()
#This step will bring up a window in which you can download ‘All Corpora’
from nltk.book import *

View Tokens

1
2


text1[­0:100] #first 101 tokens
text2[5] #fifth token

Concordance

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


#basic keywor­d-i­n-c­ontext
text3.c­on­cor­dan­ce(­‘be­gat’)
# show other than default 25 lines
text1.c­on­cor­dan­ce(­‘sea’, lines=100)
#show other than default 25 lines
text1.c­on­cor­dan­ce(­‘sea’, lines=100)
# show all results
text1.c­on­cor­dan­ce(­‘sea’, lines=all)
# change left and right context width to 10 characters and show all results
text1.c­on­cor­dan­ce(­‘sea’, 10, lines=all)

Common Contexts

1

text1.c­om­mon­_co­nte­xts­([‘­sea­’,’­oce­an’])

Similar

A concordance permits us to see words in context. For example, we saw that monstrous occurred in contexts such as the ___ pictures and a ___ size . What other words appear in a similar range of contexts? We can find out by appending the term similar to the name of the text in question, then inserting the relevant word in parentheses:

1

text1.similar("monstrous")

Word Location

we can also determine the location of a word in the text: how many words from the beginning it appears. This positional information can be displayed using a dispersion plot. Each stripe represents an instance of a word, and each row represents the entire text.

1

text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

Generate Random Text

1

text3.generate()

Counting

1
2


len(text3) #Genesis
44764

So Genesis has 44,764 words and punctuation symbols, or “tokens.” A token is the technical name for a sequence of characters — such as hairy, his, or :) — that we want to treat as a group.

Count a String

1

len(‘this is a string of text’) – number of charac­ter­s

Count a list of tokens

1

len(text1) –number of tokens

Make and Count a list of unique tokens

1

len(se­t(t­ext1)) – notice that set return a list of unique tokens

Count Occurrences

1
2


text1.c­ou­nt(­‘he­aven’) # how many times does a word occur?
Frequency