Natural Language

Natural Language Processing; Chart Parsing and PageRanking (Chapter 22-23)

nlp.Rules(**rules)[source]

Create a dictionary mapping symbols to alternative sequences. >>> Rules(A = “B C | D E”) {‘A’: [[‘B’, ‘C’], [‘D’, ‘E’]]}

nlp.Lexicon(**rules)[source]

Create a dictionary mapping symbols to alternative words. >>> Lexicon(Article = “the | a | an”) {‘Article’: [‘the’, ‘a’, ‘an’]}

class nlp.Grammar(name, rules, lexicon)[source]

Bases: object

A context-free grammar defined by a set of rewrite rules and a lexicon.

Rules map non-terminal categories to alternative right-hand sides, while the lexicon maps categories to the words that belong to them.

rewrites_for(cat)[source]

Return a sequence of possible rhs’s that cat can be rewritten as.

isa(word, cat)[source]

Return True iff word is of category cat

cnf_rules()[source]

Returns the tuple (X, Y, Z) for rules in the form: X -> Y Z

generate_random(S='S')[source]

Replace each token in S by a random entry in grammar (recursively).

nlp.ProbRules(**rules)[source]

Create a dictionary mapping symbols to alternative sequences, with probabilities. >>> ProbRules(A = “B C [0.3] | D E [0.7]”) {‘A’: [([‘B’, ‘C’], 0.3), ([‘D’, ‘E’], 0.7)]}

nlp.ProbLexicon(**rules)[source]

Create a dictionary mapping symbols to alternative words, with probabilities. >>> ProbLexicon(Article = “the [0.5] | a [0.25] | an [0.25]”) {‘Article’: [(‘the’, 0.5), (‘a’, 0.25), (‘an’, 0.25)]}

class nlp.ProbGrammar(name, rules, lexicon)[source]

Bases: object

A probabilistic context-free grammar.

Like Grammar, but every rule and lexicon entry carries a probability, so derivations can be sampled and scored by likelihood.

rewrites_for(cat)[source]

Return a sequence of possible rhs’s that cat can be rewritten as.

isa(word, cat)[source]

Return True iff word is of category cat

cnf_rules()[source]

Returns the tuple (X, Y, Z, p) for rules in the form: X -> Y Z [p]

generate_random(S='S')[source]

Replace each token in S by a random entry in grammar (recursively). Returns a tuple of (sentence, probability).

class nlp.Chart(grammar, trace=False)[source]

Bases: object

Class for parsing sentences using a chart data structure. >>> chart = Chart(E0) >>> len(chart.parses(‘the stench is in 2 2’)) 1

parses(words, S='S')[source]

Return a list of parses; words can be a list or string.

parse(words, S='S')[source]

Parse a list of words; according to the grammar. Leave results in the chart.

add_edge(edge)[source]

Add edge to chart, and see if it extends or predicts another edge.

scanner(j, word)[source]

For each edge expecting a word of this category here, extend the edge.

predictor(edge)[source]

Add to chart any rules for B that could help extend this edge.

extender(edge)[source]

See what edges can be extended by this edge.

nlp.CYK_parse(words, grammar)[source]

[Figure 23.5]

nlp.load_page_html(address_list)[source]

Download HTML page content for every URL address passed as argument

nlp.init_pages(address_list)[source]

Create a dictionary of pages from a list of URL addresses

nlp.strip_raw_html(raw_html)[source]

Remove the <head> section of the HTML which contains links to stylesheets etc., and remove all other unnecessary HTML

Given a set of pages that have their outlinks determined, we can fill out a page’s inlinks by looking through all other page’s outlinks

Search a page’s HTML content for URL links to other pages

nlp.only_wikipedia_urls(urls)[source]

Some example HTML page data is from wikipedia. This function converts relative wikipedia links to full wikipedia URLs

nlp.expand_pages(pages)[source]

Adds in every page that links to or is linked from one of the relevant pages.

nlp.relevant_pages(query)[source]

Relevant pages are pages that contain all of the query words. They are obtained by intersecting the hit lists of the query words.

nlp.normalize(pages)[source]

Normalize divides each page’s score by the sum of the squares of all pages’ scores (separately for both the authority and hub scores).

class nlp.ConvergenceDetector[source]

Bases: object

If the hub and authority values of the pages are no longer changing, we have reached a convergence and further iterations will have no effect. This detects convergence so that we can stop the HITS algorithm as early as possible.

detect()[source]

Return True once the average change in hub and authority values across all indexed pages falls below the convergence threshold, otherwise False.

Return the addresses of indexed pages that link into the given page.

Return the addresses of indexed pages that the given page links out to.

class nlp.Page(address, in_links=None, out_links=None, hub=0, authority=0)[source]

Bases: object

A web page used by the HITS algorithm, holding its address, inbound and outbound links, and its current hub and authority scores.

nlp.HITS(query)[source]

The HITS algorithm for computing hubs and authorities with respect to a query.

Natural Language Processing (Chapter 22)

nlp4e.Rules(**rules)[source]

Create a dictionary mapping symbols to alternative sequences. >>> Rules(A = “B C | D E”) {‘A’: [[‘B’, ‘C’], [‘D’, ‘E’]]}

nlp4e.Lexicon(**rules)[source]

Create a dictionary mapping symbols to alternative words. >>> Lexicon(Article = “the | a | an”) {‘Article’: [‘the’, ‘a’, ‘an’]}

class nlp4e.Grammar(name, rules, lexicon)[source]

Bases: object

A context-free grammar defined by a set of rewrite rules and a lexicon.

Rules map non-terminal categories to alternative right-hand sides, while the lexicon maps categories to the words that belong to them.

rewrites_for(cat)[source]

Return a sequence of possible rhs’s that cat can be rewritten as.

isa(word, cat)[source]

Return True iff word is of category cat

cnf_rules()[source]

Returns the tuple (X, Y, Z) for rules in the form: X -> Y Z

generate_random(S='S')[source]

Replace each token in S by a random entry in grammar (recursively).

nlp4e.ProbRules(**rules)[source]

Create a dictionary mapping symbols to alternative sequences, with probabilities. >>> ProbRules(A = “B C [0.3] | D E [0.7]”) {‘A’: [([‘B’, ‘C’], 0.3), ([‘D’, ‘E’], 0.7)]}

nlp4e.ProbLexicon(**rules)[source]

Create a dictionary mapping symbols to alternative words, with probabilities. >>> ProbLexicon(Article = “the [0.5] | a [0.25] | an [0.25]”) {‘Article’: [(‘the’, 0.5), (‘a’, 0.25), (‘an’, 0.25)]}

class nlp4e.ProbGrammar(name, rules, lexicon)[source]

Bases: object

A probabilistic context-free grammar.

Like Grammar, but every rule and lexicon entry carries a probability, so derivations can be sampled and scored by likelihood.

rewrites_for(cat)[source]

Return a sequence of possible rhs’s that cat can be rewritten as.

isa(word, cat)[source]

Return True iff word is of category cat

cnf_rules()[source]

Returns the tuple (X, Y, Z, p) for rules in the form: X -> Y Z [p]

generate_random(S='S')[source]

Replace each token in S by a random entry in grammar (recursively). Returns a tuple of (sentence, probability).

class nlp4e.Chart(grammar, trace=False)[source]

Bases: object

Class for parsing sentences using a chart data structure. >>> chart = Chart(E0) >>> len(chart.parses(‘the stench is in 2 2’)) 1

parses(words, S='S')[source]

Return a list of parses; words can be a list or string.

parse(words, S='S')[source]

Parse a list of words; according to the grammar. Leave results in the chart.

add_edge(edge)[source]

Add edge to chart, and see if it extends or predicts another edge.

scanner(j, word)[source]

For each edge expecting a word of this category here, extend the edge.

predictor(edge)[source]

Add to chart any rules for B that could help extend this edge.

extender(edge)[source]

See what edges can be extended by this edge.

class nlp4e.Tree(root, *args)[source]

Bases: object

A simple parse-tree node with a root label and a list of child leaves (which may themselves be subtrees), as built by CYK parsing.

nlp4e.CYK_parse(words, grammar)[source]

[Figure 22.6]

nlp4e.subspan(N)[source]

returns all tuple(i, j, k) covering a span (i, k) with i <= j < k

class nlp4e.TextParsingProblem(initial, grammar, goal='S')[source]

Bases: Problem

A search problem that parses a list of words bottom-up into a goal symbol.

States are partially-reduced word/category sequences; actions replace words with their lexical categories and replace spans of categories by the rules that produce them, so that a solution path corresponds to a valid parse.

actions(state)[source]

Return the successor states reachable from state: first replace each word by each of its lexical categories, and once only categories remain replace spans that match a grammar rule’s right-hand side by that rule.

result(state, action)[source]

Return the state resulting from applying action (the action is itself the already-computed successor state).

h(state)[source]

Heuristic estimating remaining cost as the number of symbols still left to reduce in state.

nlp4e.astar_search_parsing(words, gramma)[source]

bottom-up parsing using A* search to find whether a list of words is a sentence

nlp4e.beam_search_parsing(words, gramma, b=3)[source]

bottom-up text parsing using beam search

Statistical Language Processing tools (Chapter 22)

We define Unigram and Ngram text models, use them to generate random text, and show the Viterbi algorithm for segmentation of letters into words. Then we show a very simple Information Retrieval system, and an example working on a tiny sample of Unix manual pages.

class text.UnigramWordModel(observations, default=0)[source]

Bases: CountingProbDist

This is a discrete probability distribution over words, so you can add, sample, or get P[word], just like with CountingProbDist. You can also generate a random text, n words long, with P.samples(n).

samples(n)[source]

Return a string of n words, random according to the model.

class text.NgramWordModel(n, observation_sequence=None, default=0)[source]

Bases: CountingProbDist

This is a discrete probability distribution over n-tuples of words. You can add, sample or get P[(word1, …, wordn)]. The method P.samples(n) builds up an n-word sequence; P.add_cond_prob and P.add_sequence add data.

add_cond_prob(ngram)[source]

Build the conditional probabilities P(wn | (w1, …, wn-1)

add_sequence(words)[source]

Add each tuple words[i:i+n], using a sliding window.

samples(nwords)[source]

Generate an n-word sentence by picking random samples according to the model. At first pick a random n-gram and from then on keep picking a character according to P(c|wl-1, wl-2, …, wl-n+1) where wl-1 … wl-n+1 are the last n - 1 words in the generated sentence so far.

class text.NgramCharModel(n, observation_sequence=None, default=0)[source]

Bases: NgramWordModel

An n-gram model over characters rather than words, trained on the character n-grams of each word (prefixed with a space marker).

add_sequence(words)[source]

Add an empty space to every word to catch the beginning of words.

class text.UnigramCharModel(observation_sequence=None, default=0)[source]

Bases: NgramCharModel

A unigram (single-character) probability model: the frequency distribution of individual characters across the observed words.

add_sequence(words)[source]

Count every character of every word in words into the model.

text.viterbi_segment(text, P)[source]

Find the best segmentation of the string of characters, given the UnigramWordModel P.

class text.IRSystem(stopwords='the a of')[source]

Bases: object

A very simple Information Retrieval System, as discussed in Sect. 23.2. The constructor s = IRSystem(‘the a’) builds an empty system with two stopwords. Next, index several documents with s.index_document(text, url). Then ask queries with s.query(‘query words’, n) to retrieve the top n matching documents. Queries are literal words from the document, except that stopwords are ignored, and there is one special syntax: The query “learn: man cat”, for example, runs “man cat” and indexes it.

index_collection(filenames)[source]

Index a whole collection of files.

index_document(text, url)[source]

Index the text of a document.

query(query_text, n=10)[source]

Return a list of n (score, docid) pairs for the best matches. Also handle the special syntax for ‘learn: command’.

score(word, docid)[source]

Compute a score for this word on the document with this docid.

total_score(words, docid)[source]

Compute the sum of the scores of these words on the document with this docid.

present(results)[source]

Present the results as a list.

present_results(query_text, n=10)[source]

Get results for the query and present them.

class text.UnixConsultant[source]

Bases: IRSystem

A trivial IR system over a small collection of Unix man pages.

class text.Document(title, url, nwords)[source]

Bases: object

Metadata for a document: title and url; maybe add others later.

text.words(text, reg=re.compile('[a-z0-9]+'))[source]

Return a list of the words in text, ignoring punctuation and converting everything to lowercase (to canonicalize).

>>> words("``EGAD!'' Edgar cried.")
['egad', 'edgar', 'cried']
text.canonicalize(text)[source]

Return a canonical text: only lowercase letters and blanks.

>>> canonicalize("``EGAD!'' Edgar cried.")
'egad edgar cried'
text.shift_encode(plaintext, n)[source]

Encode text with a shift cipher that moves each letter up by n letters. >>> shift_encode(‘abc z’, 1) ‘bcd a’

text.rot13(plaintext)[source]

Encode text by rotating letters by 13 spaces in the alphabet. >>> rot13(‘hello’) ‘uryyb’ >>> rot13(rot13(‘hello’)) ‘hello’

text.translate(plaintext, function)[source]

Translate chars of a plaintext with the given function.

text.maketrans(from_, to_)[source]

Create a translation table and return the proper function.

text.encode(plaintext, code)[source]

Encode text using a code which is a permutation of the alphabet.

text.bigrams(text)[source]

Return a list of pairs in text (a sequence of letters or words). >>> bigrams(‘this’) [‘th’, ‘hi’, ‘is’] >>> bigrams([‘this’, ‘is’, ‘a’, ‘test’]) [[‘this’, ‘is’], [‘is’, ‘a’], [‘a’, ‘test’]]

class text.ShiftDecoder(training_text)[source]

Bases: object

There are only 26 possible encodings, so we can try all of them, and return the one with the highest probability, according to a bigram probability distribution.

score(plaintext)[source]

Return a score for text based on how common letters pairs are.

decode(ciphertext)[source]

Return the shift decoding of text with the best score.

text.all_shifts(text)[source]

Return a list of all 26 possible encodings of text by a shift cipher.

class text.PermutationDecoder(training_text, ciphertext=None)[source]

Bases: object

This is a much harder problem than the shift decoder. There are 26! permutations, so we can’t try them all. Instead we have to search. We want to search well, but there are many things to consider: Unigram probabilities (E is the most common letter); Bigram probabilities (TH is the most common bigram); word probabilities (I and A are the most common one-letter words, etc.); etc. We could represent a search state as a permutation of the 26 letters, and alter the solution through hill climbing. With an initial guess based on unigram probabilities, this would probably fare well. However, I chose instead to have an incremental representation. A state is represented as a letter-to-letter map; for example {‘z’: ‘e’} to represent that ‘z’ will be translated to ‘e’.

decode(ciphertext)[source]

Search for a decoding of the ciphertext.

score(code)[source]

Score is product of word scores, unigram scores, and bigram scores. This can get very small, so we use logs and exp.

class text.PermutationDecoderProblem(initial=None, goal=None, decoder=None)[source]

Bases: Problem

A search problem for breaking a permutation (substitution) cipher.

A state is a partial mapping from plaintext characters to cipher characters; the search incrementally assigns the most frequent unassigned character until every character in the decoder’s domain has been mapped.

actions(state)[source]

Yield (plain_char, cipher_char) assignments extending state: pick the highest-probability unassigned plaintext char and pair it with each unused cipher char.

result(state, action)[source]

Return a new state extending state with the (plain, cipher) mapping given by action.

goal_test(state)[source]

We’re done when all letters in search domain are assigned.