Natural Language
Natural Language Processing; Chart Parsing and PageRanking (Chapter 22-23)
- nlp.Rules(**rules)[source]
Create a dictionary mapping symbols to alternative sequences. >>> Rules(A = “B C | D E”) {‘A’: [[‘B’, ‘C’], [‘D’, ‘E’]]}
- nlp.Lexicon(**rules)[source]
Create a dictionary mapping symbols to alternative words. >>> Lexicon(Article = “the | a | an”) {‘Article’: [‘the’, ‘a’, ‘an’]}
- class nlp.Grammar(name, rules, lexicon)[source]
Bases:
objectA context-free grammar defined by a set of rewrite rules and a lexicon.
Rules map non-terminal categories to alternative right-hand sides, while the lexicon maps categories to the words that belong to them.
- nlp.ProbRules(**rules)[source]
Create a dictionary mapping symbols to alternative sequences, with probabilities. >>> ProbRules(A = “B C [0.3] | D E [0.7]”) {‘A’: [([‘B’, ‘C’], 0.3), ([‘D’, ‘E’], 0.7)]}
- nlp.ProbLexicon(**rules)[source]
Create a dictionary mapping symbols to alternative words, with probabilities. >>> ProbLexicon(Article = “the [0.5] | a [0.25] | an [0.25]”) {‘Article’: [(‘the’, 0.5), (‘a’, 0.25), (‘an’, 0.25)]}
- class nlp.ProbGrammar(name, rules, lexicon)[source]
Bases:
objectA probabilistic context-free grammar.
Like
Grammar, but every rule and lexicon entry carries a probability, so derivations can be sampled and scored by likelihood.
- class nlp.Chart(grammar, trace=False)[source]
Bases:
objectClass for parsing sentences using a chart data structure. >>> chart = Chart(E0) >>> len(chart.parses(‘the stench is in 2 2’)) 1
- nlp.load_page_html(address_list)[source]
Download HTML page content for every URL address passed as argument
- nlp.strip_raw_html(raw_html)[source]
Remove the <head> section of the HTML which contains links to stylesheets etc., and remove all other unnecessary HTML
- nlp.determine_inlinks(page)[source]
Given a set of pages that have their outlinks determined, we can fill out a page’s inlinks by looking through all other page’s outlinks
- nlp.find_outlinks(page, handle_urls=None)[source]
Search a page’s HTML content for URL links to other pages
- nlp.only_wikipedia_urls(urls)[source]
Some example HTML page data is from wikipedia. This function converts relative wikipedia links to full wikipedia URLs
- nlp.expand_pages(pages)[source]
Adds in every page that links to or is linked from one of the relevant pages.
- nlp.relevant_pages(query)[source]
Relevant pages are pages that contain all of the query words. They are obtained by intersecting the hit lists of the query words.
- nlp.normalize(pages)[source]
Normalize divides each page’s score by the sum of the squares of all pages’ scores (separately for both the authority and hub scores).
- class nlp.ConvergenceDetector[source]
Bases:
objectIf the hub and authority values of the pages are no longer changing, we have reached a convergence and further iterations will have no effect. This detects convergence so that we can stop the HITS algorithm as early as possible.
- nlp.get_in_links(page)[source]
Return the addresses of indexed pages that link into the given page.
- nlp.get_out_links(page)[source]
Return the addresses of indexed pages that the given page links out to.
- class nlp.Page(address, in_links=None, out_links=None, hub=0, authority=0)[source]
Bases:
objectA web page used by the HITS algorithm, holding its address, inbound and outbound links, and its current hub and authority scores.
- nlp.HITS(query)[source]
The HITS algorithm for computing hubs and authorities with respect to a query.
Natural Language Processing (Chapter 22)
- nlp4e.Rules(**rules)[source]
Create a dictionary mapping symbols to alternative sequences. >>> Rules(A = “B C | D E”) {‘A’: [[‘B’, ‘C’], [‘D’, ‘E’]]}
- nlp4e.Lexicon(**rules)[source]
Create a dictionary mapping symbols to alternative words. >>> Lexicon(Article = “the | a | an”) {‘Article’: [‘the’, ‘a’, ‘an’]}
- class nlp4e.Grammar(name, rules, lexicon)[source]
Bases:
objectA context-free grammar defined by a set of rewrite rules and a lexicon.
Rules map non-terminal categories to alternative right-hand sides, while the lexicon maps categories to the words that belong to them.
- nlp4e.ProbRules(**rules)[source]
Create a dictionary mapping symbols to alternative sequences, with probabilities. >>> ProbRules(A = “B C [0.3] | D E [0.7]”) {‘A’: [([‘B’, ‘C’], 0.3), ([‘D’, ‘E’], 0.7)]}
- nlp4e.ProbLexicon(**rules)[source]
Create a dictionary mapping symbols to alternative words, with probabilities. >>> ProbLexicon(Article = “the [0.5] | a [0.25] | an [0.25]”) {‘Article’: [(‘the’, 0.5), (‘a’, 0.25), (‘an’, 0.25)]}
- class nlp4e.ProbGrammar(name, rules, lexicon)[source]
Bases:
objectA probabilistic context-free grammar.
Like
Grammar, but every rule and lexicon entry carries a probability, so derivations can be sampled and scored by likelihood.
- class nlp4e.Chart(grammar, trace=False)[source]
Bases:
objectClass for parsing sentences using a chart data structure. >>> chart = Chart(E0) >>> len(chart.parses(‘the stench is in 2 2’)) 1
- class nlp4e.Tree(root, *args)[source]
Bases:
objectA simple parse-tree node with a root label and a list of child leaves (which may themselves be subtrees), as built by CYK parsing.
- class nlp4e.TextParsingProblem(initial, grammar, goal='S')[source]
Bases:
ProblemA search problem that parses a list of words bottom-up into a goal symbol.
States are partially-reduced word/category sequences; actions replace words with their lexical categories and replace spans of categories by the rules that produce them, so that a solution path corresponds to a valid parse.
- actions(state)[source]
Return the successor states reachable from
state: first replace each word by each of its lexical categories, and once only categories remain replace spans that match a grammar rule’s right-hand side by that rule.
- nlp4e.astar_search_parsing(words, gramma)[source]
bottom-up parsing using A* search to find whether a list of words is a sentence
Statistical Language Processing tools (Chapter 22)
We define Unigram and Ngram text models, use them to generate random text, and show the Viterbi algorithm for segmentation of letters into words. Then we show a very simple Information Retrieval system, and an example working on a tiny sample of Unix manual pages.
- class text.UnigramWordModel(observations, default=0)[source]
Bases:
CountingProbDistThis is a discrete probability distribution over words, so you can add, sample, or get P[word], just like with CountingProbDist. You can also generate a random text, n words long, with P.samples(n).
- class text.NgramWordModel(n, observation_sequence=None, default=0)[source]
Bases:
CountingProbDistThis is a discrete probability distribution over n-tuples of words. You can add, sample or get P[(word1, …, wordn)]. The method P.samples(n) builds up an n-word sequence; P.add_cond_prob and P.add_sequence add data.
- class text.NgramCharModel(n, observation_sequence=None, default=0)[source]
Bases:
NgramWordModelAn n-gram model over characters rather than words, trained on the character n-grams of each word (prefixed with a space marker).
- class text.UnigramCharModel(observation_sequence=None, default=0)[source]
Bases:
NgramCharModelA unigram (single-character) probability model: the frequency distribution of individual characters across the observed words.
- text.viterbi_segment(text, P)[source]
Find the best segmentation of the string of characters, given the UnigramWordModel P.
- class text.IRSystem(stopwords='the a of')[source]
Bases:
objectA very simple Information Retrieval System, as discussed in Sect. 23.2. The constructor s = IRSystem(‘the a’) builds an empty system with two stopwords. Next, index several documents with s.index_document(text, url). Then ask queries with s.query(‘query words’, n) to retrieve the top n matching documents. Queries are literal words from the document, except that stopwords are ignored, and there is one special syntax: The query “learn: man cat”, for example, runs “man cat” and indexes it.
- query(query_text, n=10)[source]
Return a list of n (score, docid) pairs for the best matches. Also handle the special syntax for ‘learn: command’.
- class text.UnixConsultant[source]
Bases:
IRSystemA trivial IR system over a small collection of Unix man pages.
- class text.Document(title, url, nwords)[source]
Bases:
objectMetadata for a document: title and url; maybe add others later.
- text.words(text, reg=re.compile('[a-z0-9]+'))[source]
Return a list of the words in text, ignoring punctuation and converting everything to lowercase (to canonicalize).
>>> words("``EGAD!'' Edgar cried.") ['egad', 'edgar', 'cried']
- text.canonicalize(text)[source]
Return a canonical text: only lowercase letters and blanks.
>>> canonicalize("``EGAD!'' Edgar cried.") 'egad edgar cried'
- text.shift_encode(plaintext, n)[source]
Encode text with a shift cipher that moves each letter up by n letters. >>> shift_encode(‘abc z’, 1) ‘bcd a’
- text.rot13(plaintext)[source]
Encode text by rotating letters by 13 spaces in the alphabet. >>> rot13(‘hello’) ‘uryyb’ >>> rot13(rot13(‘hello’)) ‘hello’
- text.translate(plaintext, function)[source]
Translate chars of a plaintext with the given function.
- text.encode(plaintext, code)[source]
Encode text using a code which is a permutation of the alphabet.
- text.bigrams(text)[source]
Return a list of pairs in text (a sequence of letters or words). >>> bigrams(‘this’) [‘th’, ‘hi’, ‘is’] >>> bigrams([‘this’, ‘is’, ‘a’, ‘test’]) [[‘this’, ‘is’], [‘is’, ‘a’], [‘a’, ‘test’]]
- class text.ShiftDecoder(training_text)[source]
Bases:
objectThere are only 26 possible encodings, so we can try all of them, and return the one with the highest probability, according to a bigram probability distribution.
- text.all_shifts(text)[source]
Return a list of all 26 possible encodings of text by a shift cipher.
- class text.PermutationDecoder(training_text, ciphertext=None)[source]
Bases:
objectThis is a much harder problem than the shift decoder. There are 26! permutations, so we can’t try them all. Instead we have to search. We want to search well, but there are many things to consider: Unigram probabilities (E is the most common letter); Bigram probabilities (TH is the most common bigram); word probabilities (I and A are the most common one-letter words, etc.); etc. We could represent a search state as a permutation of the 26 letters, and alter the solution through hill climbing. With an initial guess based on unigram probabilities, this would probably fare well. However, I chose instead to have an incremental representation. A state is represented as a letter-to-letter map; for example {‘z’: ‘e’} to represent that ‘z’ will be translated to ‘e’.
- class text.PermutationDecoderProblem(initial=None, goal=None, decoder=None)[source]
Bases:
ProblemA search problem for breaking a permutation (substitution) cipher.
A state is a partial mapping from plaintext characters to cipher characters; the search incrementally assigns the most frequent unassigned character until every character in the decoder’s domain has been mapped.
- actions(state)[source]
Yield (plain_char, cipher_char) assignments extending
state: pick the highest-probability unassigned plaintext char and pair it with each unused cipher char.