Convert HTML entities within RelevantTexts
Reported by adimit | July 23rd, 2008 @ 04:09 AM | in 0.2
Currently, HTML entitities, like ä will just be left in the text. Taggers (and tokenizers) might choke on this, so we should transform them to Unicode and only then pass them to the tokenization process.
Comments and changes to this ticket
-

adimit October 10th, 2008 @ 11:35 PM
- State changed from new to open
- Milestone set to 0.2
-

adimit October 10th, 2008 @ 11:48 PM
It seems this also screws with the enhancers, as discussed in #24. Makes this just a bit more urgent.
Please Sign in or create a free account to add a new ticket.
With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.
Create your profile
Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป
A platform for aiding second language learners through texts acquired from the Internet.
People watching this ticket
Tags
Referenced by
-
24
HTML entities damaged by TokenEnhancer
#8 is
the cause of this. It's a known problem and happens...