python - Use of PunktSentenceTokenizer in NLTK -


i learning natural language processing using nltk. came across code using punktsentencetokenizer actual use cannot understand in given code. code given :

import nltk nltk.corpus import state_union nltk.tokenize import punktsentencetokenizer  train_text = state_union.raw("2005-gwbush.txt") sample_text = state_union.raw("2006-gwbush.txt")  custom_sent_tokenizer = punktsentencetokenizer(train_text) #a  tokenized = custom_sent_tokenizer.tokenize(sample_text)   #b  def process_content(): try:     in tokenized[:5]:         words = nltk.word_tokenize(i)         tagged = nltk.pos_tag(words)         print(tagged)  except exception e:     print(str(e))   process_content() 

so, why use punktsentencetokenizer. , going on in line marked , b. mean there training text , other sample text, need 2 data sets part of speech tagging.

line marked a , b not able understand.

ps : did try in nltk book not understand real use of punktsentencetokenizer

punktsentencetokenizer abstract class default sentence tokenizer, i.e. sent_tokenize(), provided in nltk. implmentation of unsupervised multilingual sentence boundary detection (kiss , strunk (2005). see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#l79

given paragraph multiple sentence, e.g:

>>> nltk.corpus import state_union >>> train_text = state_union.raw("2005-gwbush.txt").split('\n') >>> train_text[11] u'two weeks ago, stood on steps of capitol , renewed commitment of our nation guiding ideal of liberty all. evening set forth policies advance ideal @ home , around world. ' 

you can use sent_tokenize():

>>> sent_tokenize(train_text[11]) [u'two weeks ago, stood on steps of capitol , renewed commitment of our nation guiding ideal of liberty all.', u'this evening set forth policies advance ideal @ home , around world. '] >>> sent in sent_tokenize(train_text[11]): ...     print sent ...     print '--------' ...  2 weeks ago, stood on steps of capitol , renewed commitment of our nation guiding ideal of liberty all. -------- evening set forth policies advance ideal @ home , around world.  -------- 

the sent_tokenize() uses pre-trained model nltk_data/tokenizers/punkt/english.pickle. can specify other languages, list of available languages pre-trained models in nltk are:

alvas@ubi:~/nltk_data/tokenizers/punkt$ ls czech.pickle     finnish.pickle  norwegian.pickle   slovene.pickle danish.pickle    french.pickle   polish.pickle      spanish.pickle dutch.pickle     german.pickle   portuguese.pickle  swedish.pickle english.pickle   greek.pickle    py3                turkish.pickle estonian.pickle  italian.pickle  readme 

given text in language, this:

>>> german_text = u"die orgellandschaft südniedersachsen umfasst das gebiet der landkreise goslar, göttingen, hameln-pyrmont, hildesheim, holzminden, northeim und osterode harz sowie die stadt salzgitter. Über 70 historische orgeln vom 17. bis 19. jahrhundert sind in der südniedersächsischen orgellandschaft vollständig oder in teilen erhalten. "  >>> sent in sent_tokenize(german_text, language='german'): ...     print sent ...     print '---------' ...  die orgellandschaft südniedersachsen umfasst das gebiet der landkreise goslar, göttingen, hameln-pyrmont, hildesheim, holzminden, northeim und osterode harz sowie die stadt salzgitter. --------- Über 70 historische orgeln vom 17. bis 19. jahrhundert sind in der südniedersächsischen orgellandschaft vollständig oder in teilen erhalten.  --------- 

to train own punkt model, see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py , training data format nltk punkt


Comments

Popular posts from this blog

javascript - jQuery: Add class depending on URL in the best way -

caching - How to check if a url path exists in the service worker cache -

Redirect to a HTTPS version using .htaccess -