#19 open
nott

Tokenizer+Tagger trip over Abbreviations

Reported by nott | September 29th, 2008 @ 11:48 AM | in 0.2

Sentences including abbreviations such as Mr. Peterson likes doing things with Mrs. Peterson. seem to fool tokenizer and tagger. Mr. should be tokenized #Mr.#, not #Mr#.#. However, the tagger might still not be able to resolve these to correct tags.

The behavior of the sentence segmentizer might be problematic too, we should check this at some point.

Comments and changes to this ticket

  • adimit

    adimit September 29th, 2008 @ 02:05 PM

    • Assigned user set to “adimit”
    • State changed from “new” to “open”
    • Milestone set to 0.2
    • Tag changed from pos-tagging, lingpipe, preprocessing to pos-tagging, sentence boundary detection, lingpipe, preprocessing

    ok, thanks, that's been on the TODO list for a while. I'm open to suggestions :-)

    Integration of the LingPipe sentence boundary detector is a seperate ticket, but I don't know if we'll be using lingpipe at all.

Please Sign in or create a free account to add a new ticket.

With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.

New-ticket Create new ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป

A platform for aiding second language learners through texts acquired from the Internet.

People watching this ticket

Pages