python - Use of PunktSentenceTokenizer in NLTK -
i learning natural language processing using nltk. came across code using punktsentencetokenizer
actual use cannot understand in given code. code given :
import nltk nltk.corpus import state_union nltk.tokenize import punktsentencetokenizer train_text = state_union.raw("2005-gwbush.txt") sample_text = state_union.raw("2006-gwbush.txt") custom_sent_tokenizer = punktsentencetokenizer(train_text) #a tokenized = custom_sent_tokenizer.tokenize(sample_text) #b def process_content(): try: in tokenized[:5]: words = nltk.word_tokenize(i) tagged = nltk.pos_tag(words) print(tagged) except exception e: print(str(e)) process_content()
so, why use punktsentencetokenizer. , going on in line marked , b. mean there training text , other sample text, need 2 data sets part of speech tagging.
line marked a
, b
not able understand.
ps : did try in nltk book not understand real use of punktsentencetokenizer
punktsentencetokenizer
abstract class default sentence tokenizer, i.e. sent_tokenize()
, provided in nltk. implmentation of unsupervised multilingual sentence boundary detection (kiss , strunk (2005). see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#l79
given paragraph multiple sentence, e.g:
>>> nltk.corpus import state_union >>> train_text = state_union.raw("2005-gwbush.txt").split('\n') >>> train_text[11] u'two weeks ago, stood on steps of capitol , renewed commitment of our nation guiding ideal of liberty all. evening set forth policies advance ideal @ home , around world. '
you can use sent_tokenize()
:
>>> sent_tokenize(train_text[11]) [u'two weeks ago, stood on steps of capitol , renewed commitment of our nation guiding ideal of liberty all.', u'this evening set forth policies advance ideal @ home , around world. '] >>> sent in sent_tokenize(train_text[11]): ... print sent ... print '--------' ... 2 weeks ago, stood on steps of capitol , renewed commitment of our nation guiding ideal of liberty all. -------- evening set forth policies advance ideal @ home , around world. --------
the sent_tokenize()
uses pre-trained model nltk_data/tokenizers/punkt/english.pickle
. can specify other languages, list of available languages pre-trained models in nltk are:
alvas@ubi:~/nltk_data/tokenizers/punkt$ ls czech.pickle finnish.pickle norwegian.pickle slovene.pickle danish.pickle french.pickle polish.pickle spanish.pickle dutch.pickle german.pickle portuguese.pickle swedish.pickle english.pickle greek.pickle py3 turkish.pickle estonian.pickle italian.pickle readme
given text in language, this:
>>> german_text = u"die orgellandschaft südniedersachsen umfasst das gebiet der landkreise goslar, göttingen, hameln-pyrmont, hildesheim, holzminden, northeim und osterode harz sowie die stadt salzgitter. Über 70 historische orgeln vom 17. bis 19. jahrhundert sind in der südniedersächsischen orgellandschaft vollständig oder in teilen erhalten. " >>> sent in sent_tokenize(german_text, language='german'): ... print sent ... print '---------' ... die orgellandschaft südniedersachsen umfasst das gebiet der landkreise goslar, göttingen, hameln-pyrmont, hildesheim, holzminden, northeim und osterode harz sowie die stadt salzgitter. --------- Über 70 historische orgeln vom 17. bis 19. jahrhundert sind in der südniedersächsischen orgellandschaft vollständig oder in teilen erhalten. ---------
to train own punkt model, see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py , training data format nltk punkt
Comments
Post a Comment