Opened 13 years ago

Closed 12 years ago

#120 closed defect (fixed)

plugin spellchecker find wrong words with french accents

Reported by: mokhet Owned by: niko
Priority: normal Milestone: Version 1.0
Component: Plugin_SpellChecker Version:
Severity: normal Keywords: spellchecker i18n ut8
Cc:

Description

when there is accents in the checked word (éèàöïë, etc.), the word is badly truncated.

word badly spelt : féderation

word correctly spelt : fédération

word analysed by spellcheck : deration

think it's around this code in spell-check-logic.cgi but my perl knowledge is too poor. Probably the IsWord? (which seems a constant but dunno really) is not correctly localised

    while ($node->getNodeValue =~ /([\p{IsWord}']+)/) {
        my $word = $1;

or perhaps it's an issue located in aspell version

# aspell -v
@(#) International Ispell Version 3.1.20 (but really Aspell 0.60.2)

i wonder if french is the only one with this issue, looks like it's a problem in every version with specific language characters.

Change History (7)

comment:1 Changed 13 years ago by niko

i tested the spell-checker with féderation and it worked perfectly (although i don't have a french dictionary installed...)

but i do not use the perl-version of the spell-checker, i have the php-version.

so its for sure a problem (as you suggested) in spell-check-logic.cgi or aspell

aspell -v
@(#) International Ispell Version 3.1.20 (but really Aspell 0.50.5)

comment:2 Changed 13 years ago by gogo

  • Owner changed from gogo to niko

Use the PHP version of the spellchecker the perl version is unmaintained, you must also use a recent version of aspell, earlier versions did behave well with international charactersets. That said, you will probably not have much luck spellchecking non-english words or documents, particularly if they are not in utf-8, due to limitations in aspell (even if they are utf-8).

Also see http://xinha.python-hosting.com/wiki/SpellChecker

I'm assigning this to niko as the resident internationalization expert, niko feel free to close this as "worksforme" if it works OK for you.

comment:3 Changed 13 years ago by mokhet

does that means the cgi file should not be used at all anymore ?

when calling the plugin, the file called is spell-check-ui.html (isnt it ?), in this one we can find the little form to change dictionary, there is still a reference to the cgi file in this place.

  <body onload="initDocument()">

    <form style="display: none;" action="spell-check-logic.cgi"
          method="post" target="framecontent"

is that part of the code not used at all anymore ?

to come back to the initial ticket, I cant use a earlier version than the 0.60.2 :)
the word i seek (fédération) is well known in my fr dictionary.

$ echo féderation | aspell -a --lang=fr
@(#) International Ispell Version 3.1.20 (but really Aspell 0.60.2)
& féderation 15 0: fédération, fédérations, fédérassions, génération, fédératif, fédératifs, fédérative, modération, rudération, vénération, dérations, fédérions, fédéralisons, réitération, itération

$ echo féderation | aspell -a --lang=en
@(#) International Ispell Version 3.1.20 (but really Aspell 0.60.2)
& féderation 4 0: federation, federations, federating, federation's

I know i am using exclusively utf-8 encoding in all forms and editors, but i think gogo got a point there and i'm gonna investigate more in this direction.

Anyway, on another topic, i have still a lot of differents issues with spellchecker (translation, logic, layout, etc.). What is best, i think i should create a ticket for each, that's sounds the best way to correctly track down thoses nasty bugs. Isnt it ?

comment:4 Changed 13 years ago by gogo

  • Component changed from Xinha Core to Plugin_SpellChecker
  • Milestone set to Version 1.0

comment:5 Changed 12 years ago by gogo

  • Priority changed from normal to low

comment:6 Changed 12 years ago by mokhet

  • Keywords ut8 added
  • Priority changed from low to normal

the word "fédération" is sent as "fédération" in the temp file cause of this part of code from spell-check-logic.php

  $text = preg_replace('/([\xC0-\xDF][\x80-\xBF])/e', "'&#' . utf8_ord('\$1') . ';'", $text);
  $text = preg_replace('/([\xE0-\xEF][\x80-\xBF][\x80-\xBF])/e',             "'&#' . utf8_ord('\$1') . ';'",  $text);
  $text = preg_replace('/([\xF0-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF])/e', "'&#' . utf8_ord('\$1') . ';'",   $text);

the comment say : Convert UTF-8 multi-bytes into decimal character entities. This is because aspell isn't fully utf8-aware

but well, my local aspell seems fully utf8-aware and this 3 lines, instead of helping, are totally breaking the spellchecking. So perhaps it needs a test to know if the transformation is needed or not, or perhaps a configuration variable.

comment:7 Changed 12 years ago by gogo

  • Resolution set to fixed
  • Status changed from new to closed

I have added a config value to enable disabling these replacements changeset:498 has all the gory details.

Note: See TracTickets for help on using tickets.