第二章 存取語料庫與詞彙資源

  1. 有哪些常見且有用的語料庫與詞彙資源?如何用Python去存取它們?
  2. 有哪些Python概念可以幫上大忙?
  3. 如何避免在撰寫Python程式碼時發生重複的情形(白工)


2.1 存取語料庫(Accessing Text Corpora)

如同前述,語料庫(corpus)就是一大堆文本(text)所組成,許多語料庫的建置的資源種類是十分豐富且多元的,如我們第一章曾用過的歷年美國總統就職演說稿(text4)就是利用指令「 from book import *」去存取的一些事先定義好的文本而已。然而現在開始我們必須學著去利用其他外部的文本(或是你自己研究要用的文本),這節將實作各種類型語料庫與文本,我們就可以瞭解如何去選擇各種文本以及利用它們。

古騰堡文本(Gutenberg Corpus)


>>> import nltk
>>> nltk.corpus.gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

我們從中挑選了一個文本Jane Austen的Emma,並給他的一個名稱「emma」然後查一下他的文本長度:

>>> emma = nltk.corpus.gutenberg.words('austen-emma.txt'
>>> len(emma) 


>>> from nltk.corpus import gutenberg 
>>> gutenberg.fileids() 
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', ...] 
>>> emma = gutenberg.words('austen-emma.txt')


>>> for fileid in gutenberg.fileids(): 
...         num_chars = len(gutenberg.raw(fileid)) 
...         num_words = len(gutenberg.words(fileid)) 
...         num_sents = len(gutenberg.sents(fileid)) 
...         num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)])) 
...         print int(num_chars/num_words), int(num_words/num_sents), int(num_words/num_vocab), fileid 
4 21 26 austen-emma.txt 
4 23 16 austen-persuasion.txt 
4 24 22 austen-sense.txt 
4 33 79 bible-kjv.txt 
4 18 5 blake-poems.txt 
4 17 14 bryant-stories.txt 
4 17 12 burgess-busterbrown.txt 
4 16 12 carroll-alice.txt 
4 17 11 chesterton-ball.txt 
4 19 11 chesterton-brown.txt 
4 16 10 chesterton-thursday.txt 
4 18 24 edgeworth-parents.txt 
4 24 15 melville-moby_dick.txt 
4 52 10 milton-paradise.txt 
4 12 8 shakespeare-caesar.txt 
4 13 7 shakespeare-hamlet.txt 
4 13 6 shakespeare-macbeth.txt 
4 35 12 whitman-leaves.txt

這支程式對每個文本提供了三種統計結果:平均文字長度、平均句子長度、平均字彙出現次數(就是前一章曾談過得詞彙多元性lexical diversity),觀察一下全部文本都是4的平均文字長度,表示這些文本的常用字長度約為4個(事實上應該是3而不是4,因為計算所有文本總長時有把空格也算進去),另外平均句長、詞彙多元性則可以有效反映不同作者的特性!

剛剛這個例子也顯示了我們直接存取了這些文本的原始資料(raw data)而非切割後的tokenraw()函數所提供給我們的內容是純粹從檔案中提領出來的,完全沒有經過任何語言上的處理!所以像是「len(gutenberg.raw('blake-poems.txt')」就是用來顯示說這個文本中有多少個「字元(letters)」,這包括了字與字之間的所有空格!而函數sents()則是已經將文本切割成句子後的文字串列:

>>> macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt'
>>> macbeth_sentences 
[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...]
>>> macbeth_sentences[1037] 
['Double', ',', 'double', ',', 'toile', 'and', 'trouble', ';', 'Fire', 'burne', ',', 'and', 'Cauldron', 'bubble']
>>> longest_len = max([len(s) for s in macbeth_sentences]) 
>>> [s for s in macbeth_sentences if len(s) == longest_len] 
[['Doubtfull', 'it', 'stood', ',', 'As', 'two', 'spent', 'Swimmers', ',', 'that', 'doe', 'cling', 'together', ',', 'And', 'choake', 'their', 'Art', ':', 'The', 'mercilesse', 'Macdonwald', ...], ...]


雖然古騰堡計畫擁有數以萬計的電子書,但是他的內容都是結構完整的既定文獻。比起這種文本,我們更需要去分析一些相對沒這麼嚴謹的語言文本。NLTK蒐集了一些這種類型的網路資源,像是Firefox論壇的討論串、New York街頭的對話記錄、電影加勒比海海盜的劇本、個人廣告以及葡萄酒的評論等:

>>> from nltk.corpus import webtext 
>>> for fileid in webtext.fileids(): 
...      print fileid, webtext.raw(fileid)[:65], '...'
firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se... 
grail.txt SCENE 1: [wind] [clop clop clop] KING ARTHUR: Whoa there! [clop... 
overheard.txt White guy: So, do you have any plans for this evening? Asian girl... 
pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr... 
singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun... 
wine.txt Lovely delicate, fragrant Rhone wine. Polished leather and strawb...

這裡也有一些從美國海軍研究學校所提供的即時通訊聊天記錄,這文本超過一萬篇紀錄,都已將使用者匿名處理(UserNNN)並且人工移除了一些不適的內容。其一共整理在15個檔案中,每個檔案包括某一天的上百篇紀錄並會在檔名上標記其聊天式代號(以年齡來區分:teens, 20s, 30s, 40s等)。如「10-19-20s_706posts.xml」則表示為2006年10月19日在聊天室20s所蒐集的706篇紀錄。

>>> from nltk.corpus import nps_chat 
>>> chatroom = nps_chat.posts('10-19-20s_706posts.xml'
>>> chatroom[123] 
['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',', 'I', 'can', 'look', 'in', 'a', 'mirror', '.']

布朗語料庫(Brown Corpus)


ID File Genre Description
A16 ca16 news Chicago Tribune: Society Reportage
B02 cb02 editorial Christian Science Monitor: Editorials
C17 cc17 reviews Time Magazine: Reviews
D12 cd12 religion Underwood: Probing the Ethics of Realtors
E36 ce36 hobbies Norling: Renting a Car in Europe
F25 cf25 lore Boroff: Jewish Teenage Culture
G22 cg22 belles_lettres Reiner: Coping with Runaway Technology
H15 ch15 government US Office of Civil and Defence Mobilization: The Family Fallout Shelter


>>> from nltk.corpus import brown 
>>> brown.categories() 
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] 
>>> brown.words(categories='news'
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] 
>>> brown.words(fileids=['cg22']) 
['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...] 
>>> brown.sents(categories=['news', 'editorial', 'reviews']) 
[['The', 'Fulton', 'County'...], ['The', 'jury', 'further'...], ...]

對於要研究不同文體差異的人來說,布朗語料庫提供了便利的資源!讓我們來比較一下文體之間的一些基本動詞用法,首先指定好一個特定類型的文本(記得要先import nltk):

>>> from nltk.corpus import brown 
>>> news_text = brown.words(categories='news'
>>> fdist = nltk.FreqDist([w.lower() for w in news_text]) 
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will'
>>> for m in modals: 
...     print m + ':', fdist[m], 
can: 94 could: 87 may: 93 might: 38 must: 53 will: 389



路透社語料庫(Reuters Corpus)


>>> from nltk.corpus import reuters 
>>> reuters.fileids() 
['test/14826', 'test/14828', 'test/14829', 'test/14832', ...] 
>>> reuters.categories() 
['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', ...]


>>> reuters.categories('training/9865'
['barley', 'corn', 'grain', 'wheat']
>>> reuters.categories(['training/9865', 'training/9880']) 
['barley', 'corn', 'grain', 'money-fx', 'wheat']
>>> reuters.fileids('barley'
['test/15618', 'test/15649', 'test/15676', 'test/15728', 'test/15871', ...]
>>> reuters.fileids(['barley', 'corn']) 
['test/14832', 'test/14858', 'test/15033', 'test/15043', 'test/15106', 'test/15287', 'test/15341', 'test/15618', 'test/15618', 'test/15648', ...]


>>> reuters.words('training/9865')[:14] 
['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', 'BIDS', 'DETAILED', 'French', 'operators', 'have', 'requested', 'licences', 'to', 'export']
>>> reuters.words(['training/9865', 'training/9880']) 
>>> reuters.words(categories='barley'
>>> reuters.words(categories=['barley', 'corn']) 
['THAI', 'TRADE', 'DEFICIT', 'WIDENS', 'IN', 'FIRST', ...]

總統就職演講稿語料庫(Inaugural Address Corpus)


>>> from nltk.corpus import inaugural 
>>> inaugural.fileids() 
['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', ...]
>>> [fileid[:4] for fileid in inaugural.fileids()] 
['1789', '1793', '1797', '1801', '1805', '1809', '1813', '1817', '1821', ...]


然後再來我們來看看這兩個詞「America」與「citizen」在年代間的變化。以下的程式碼先轉換文本中的字體為「小寫(w.lower())」,並檢查只要是目標詞彙開頭(startswith())的就納入計算(所以就算是American's or Citizens都會被納入)!以下就是這段條件型的次數分配過程以及所產生的精美圖表!

>>> cfd = nltk.ConditionalFreqDist( 
...     (target, fileid[:4]) 
...     for fileid in inaugural.fileids() 
...     for w in inaugural.words(fileid) 
...     for target in ['america', 'citizen'
...     if w.lower().startswith(target))
>>> cfd.plot()

註記文本語料庫(Annotated Text Corpora)

其實還有很多豐富的語料庫可以取用,這邊要討論就是有語言註記的文本,如擁有詞類標記(POS tags)、命名實體(named entities)、語法結構(syntactic structures)、語意角色(semantic roles)等等。而想要用NLTK去存取這些語料庫是十分容易的,因為NLTK的data工具包就已經有這些語料庫或是部分sample了,可以免費地去下載來使用!下表列出了部分的語料庫清單,詳細內容可以直接連這裡(http://www.nltk.org/data),這裡則是提供了一些範例(http://www.nltk.org/howto

Corpus Compiler Contents
Brown Corpus Francis, Kucera 15 genres, 1.15M words, tagged, categorized
CESS Treebanks CLiC-UB 1M words, tagged and parsed (Catalan, Spanish)
Chat-80 Data Files Pereira & Warren World Geographic Database
CMU Pronouncing Dictionary CMU 127k entries
CoNLL 2000 Chunking Data CoNLL 270k words, tagged and chunked
CoNLL 2002 Named Entity CoNLL 700k words, pos- and named-entity-tagged (Dutch, Spanish)
CoNLL 2007 Dependency Treebanks (sel) CoNLL 150k words, dependency parsed (Basque, Catalan)
Dependency Treebank Narad Dependency parsed version of Penn Treebank sample
Floresta Treebank Diana Santos et al 9k sentences, tagged and parsed (Portuguese)
Gazetteer Lists Various Lists of cities and countries
Genesis Corpus Misc web sources 6 texts, 200k words, 6 languages
Gutenberg (selections) Hart, Newby, et al 18 texts, 2M words



>>> nltk.corpus.cess_esp.words() 
['El', 'grupo', 'estatal', 'Electricit\xe9_de_France', ...] 
>>> nltk.corpus.floresta.words() 
['Um', 'revivalismo', 'refrescante', 'O', '7_e_Meio', ...] 
>>> nltk.corpus.indian.words('hindi.pos'
['\xe0\xa4\xaa\xe0\xa5\x82\xe0\xa4\xb0\xe0\xa5\x8d\xe0\xa4\xa3', '\xe0\xa4\xaa\xe0\xa5\x8d\xe0\xa4\xb0\xe0\xa4\xa4\xe0\xa4\xbf\xe0\xa4\xac\xe0\xa4 \x82\xe0\xa4\xa7', ...] 
>>> nltk.corpus.udhr.fileids() 
['Abkhaz-Cyrillic+Abkh', 'Abkhaz-UTF8', 'Achehnese-Latin1', 'Achuar-Shiwiar-Latin1', 'Adja-UTF8', 'Afaan_Oromo_Oromiffa-Latin1', 'Afrikaans-Latin1', 'Aguaruna-Latin1', 'Akuapem_Twi-UTF8', 'Albanian_Shqip-Latin1', 'Amahuaca', 'Amahuaca-Latin1', ...] 
>>> nltk.corpus.udhr.words('Javanese-Latin1')[11:] 
[u'Saben', u'umat', u'manungsa', u'lair', u'kanthi', ...]


>>> from nltk.corpus import udhr
>>> languages = ['Chickasaw', 'English', 'German_Deutsch',
...     'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']
>>> cfd = nltk.ConditionalFreqDist(
...     (lang, len(word))
...     for lang in languages
...     for word in udhr.words(lang + '-Latin1'))
>>> cfd.plot(cumulative=True)





Example Description
fileids() 列出語料庫的檔案
fileids([categories]) 列出語料庫中某類別的檔案
categories() 列出語料庫的類別
categories([fileids]) 列出語料庫中某檔案的類別
raw() 顯示該語料庫的原始內容
raw(fileids=[f1,f2,f3]) 顯示所指定之檔案的原始內容
raw(categories=[c1,c2]) 顯示所指定類別的原始內容
words() 顯示該語料庫所有的詞彙
words(fileids=[f1,f2,f3]) 顯示所指定檔案之詞彙
words(categories=[c1,c2]) 顯示所指定類別之詞彙
sents() 顯示該語料庫所有的句子
sents(fileids=[f1,f2,f3]) 顯示所指定檔案之句子
sents(categories=[c1,c2]) 顯示所指定類別之句子
abspath(fileid) 顯示在光碟中的語料庫檔案
encoding(fileid) 對某檔案進行編碼
open(fileid) 開啟某特定文本檔案並讀取
root() 顯示本機端的語料庫路徑位置
readme() 開啟語料庫的說明文件


>>> raw = gutenberg.raw("burgess-busterbrown.txt"
>>> raw[1:20] 'The Adventures of B'
>>> words = gutenberg.words("burgess-busterbrown.txt"
>>> words[1:20] 
['The', 'Adventures', 'of', 'Buster', 'Bear', 'by', 'Thornton', 'W', '.', 'Burgess', '1920', ']', 'I', 'BUSTER', 'BEAR', 'GOES', 'FISHING', 'Buster', 'Bear']
>>> sents = gutenberg.sents("burgess-busterbrown.txt"
>>> sents[1:20] 
[['I'], ['BUSTER', 'BEAR', 'GOES', 'FISHING'], ['Buster', 'Bear', 'yawned', 'as', 'he', 'lay', 'on', 'his', 'comfortable', 'bed', 'of', 'leaves', 'and', 'watched', 'the', 'first', 'early', 'morning', 'sunbeams', 'creeping', 'through', ...], ...]


好啦!講了這麼多,也許只想看這裡!當你想要把自己電腦裡的文本也進行如同上述那樣去存取,可以試試看NLTK的「PlaintextCorpusReader」!確認文本檔案在你電腦的目錄位置,像下面的例子是在Linux的「/usr/share/dict」底下,接著會把這個位置設定在「corpus_root」值裡頭,另外第二個參數則放檔案名稱「如'a.txt', 'test/b.txt'」或一些符合樣式條件(也就是regular expressions,3.4將會對此多做說明)的檔案名稱「如'[abc]/.*\.txt'」:

>>> from nltk.corpus import PlaintextCorpusReader 
>>> corpus_root = '/usr/share/dict'
>>> wordlists = PlaintextCorpusReader(corpus_root, '.*')  
>>> wordlists.fileids() 
['README', 'connectives', 'propernames', 'web2', 'web2a', 'words']
>>> wordlists.words('connectives'
['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', ...]

那假設你的電腦裡有「Penn Treebank」語料庫的副本,然後是放在Windows底下的「C:\corpora」裡面,你則可以用「BracketParseCorpusReader」來存取這個語料庫。我們這次將corpus_root指向語料庫裡頭的華爾街日報的文本資料,並且利用上述的檔案樣式來列出我們所要的檔案:

>>> from nltk.corpus import BracketParseCorpusReader 
>>> corpus_root = r"C:\corpora\penntreebank\parsed\mrg\wsj"
>>> file_pattern = r".*/wsj_.*\.mrg"
>>> ptb = BracketParseCorpusReader(corpus_root, file_pattern) 
>>> ptb.fileids() 
['00/wsj_0001.mrg', '00/wsj_0002.mrg', '00/wsj_0003.mrg', '00/wsj_0004.mrg', ...]
>>> len(ptb.sents()) 
>>> ptb.sents(fileids='20/wsj_2013.mrg')[19] 
['The', '55-year-old', 'Mr.', 'Noriega', 'is', "n't", 'as', 'smooth', 'as', 'the', 'shah', 'of', 'Iran', ',', 'as', 'well-born', 'as', 'Nicaragua', "'s", 'Anastasio', 'Somoza', ',', 'as', 'imperial', 'as', 'Ferdinand', 'Marcos', 'of', 'the', 'Philippines', 'or', 'as', 'bloody', 'as', 'Haiti', "'s", 'Baby', Doc', 'Duvalier', '.']

