Tokenizer+Tagger trip over Abbreviations
Reported by nott | September 29th, 2008 @ 11:48 AM | in 0.2
Sentences including abbreviations such as Mr. Peterson likes doing things with Mrs. Peterson. seem to fool tokenizer and tagger. Mr. should be tokenized #Mr.#, not #Mr#.#. However, the tagger might still not be able to resolve these to correct tags.
The behavior of the sentence segmentizer might be problematic too, we should check this at some point.
Comments and changes to this ticket
-

adimit September 29th, 2008 @ 02:05 PM
- Assigned user set to adimit
- State changed from new to open
- Milestone set to 0.2
- Tag changed from pos-tagging, lingpipe, preprocessing to pos-tagging, sentence boundary detection, lingpipe, preprocessing
ok, thanks, that's been on the TODO list for a while. I'm open to suggestions :-)
Integration of the LingPipe sentence boundary detector is a seperate ticket, but I don't know if we'll be using lingpipe at all.
Please Sign in or create a free account to add a new ticket.
With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.
Create your profile
Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป
A platform for aiding second language learners through texts acquired from the Internet.