Monday, March 26, 2012

Word Tokenization

Word tokenization is an NLP task where one breaks a string of characters--letters and spaces--into words. In English and other words with spaces, this is pretty easy; it's virtually been done by the writers (with exceptions of sentence terminal periods, n't, and 's).

However, other languages, like Chinese, require a lot more work. Chinese doesn't put spaces between words.

Often, this is written about as a purely NLP task. This isn't quite correct. Spaces proper weren't invented until the middle ages. ASI'LLDEMONSTRATEHERE,YOUREALLYDON'TNEEDSPACES.ITMIGHTBEALITTLE HARDERTOREAD,BUTTHEANCIENTROMANSBUILTANEMPIREWRITINGTHISWAY. WORDTOKENIZATIONINTHEAUDITORYANDVISUALSENSEISEASILYPARTOFOUR LANGUAGEFACULTY.

Since we can do the task naturally, it's just as much a problem of linguistics as natural language processing.

0 Comments:

Post a Comment

<< Home

-->