slugify destroys data

slugify destroys data

Postby ashnur on Wed Jul 23, 2008 11:14 pm

Hi,

I just want to note that, if someone has a native language like mine (hungarian), which uses a lot of accented letters, than slugify does everything but do not returns something "nice". Lots of dashes, not much to understand.

ie. had a post with the title: "A Reformáció Genfi Emlékműve Előtt",
slugify default returned: "a-reform-ci-genfi-eml-km-ve-el-tt"
remaccents + slugify: "a-reformacio-genfi-emlekmuve-elott"

I added in my code a remove accents function, which resolves the problem, and I thought maybe you could/want/may use it too :)

it's fairly simple, but had no problems with it yet:


Code: Select all
/**
      * @desc remove accents from a given string
      *
      * @param string $string
      * @param string $chC input char encoding
      * @return string    *
      */
      function remAccents($string,$chC="UTF-8"){
         $string = iconv($chC,"ISO-8859-2",$string);
         $string =  strtr($string,
            "\xe1\xc1\xe0\xc0\xe2\xc2\xe4\xc4\xe3\xc3\xe5\xc5".
            "\xaa\xe7\xc7\xe9\xc9\xe8\xc8\xea\xca\xeb\xcb\xed".
            "\xcd\xec\xcc\xee\xce\xef\xcf\xf1\xd1\xf3\xd3\xf2".
            "\xd2\xf4\xd4\xf6\xd6\xf5\xd5\x8\xd8\xba\xf0\xfa\xda".
            "\xf9\xd9\xfb\xdb\xfc\xdc\xfd\xdd\xff\xe6\xc6\xdf\xf8",
            "aAaAaAaAaAaAacCeEeEeEeEiIiIiIiInNo".
            "OoOoOoOoOoOoouUuUuUuUyYyaAso");
         $string = iconv("ISO-8859-2",$chC,$string);
         return $string;
      }


the only problem with it that it converts everything to ISO-8859-2 - which works for hungarian, but I do not know about other languages
ashnur
 
Posts: 3
Joined: Wed Jul 23, 2008 10:34 pm
LifeType Version: lifetype-1.2_r6501

Re: slugify destroys data

Postby jondaley on Thu Jul 24, 2008 7:32 am

I think the real answer is to switch to UTF-8, otherwise, I think each language conflicts with others, though people who only want to use hungarian could certainly use your code.
jondaley
Lifetype Expert
 
Posts: 6169
Joined: Thu May 20, 2004 6:19 pm
Location: Pittsburgh, PA, USA
LifeType Version: 1.2.11 devel branch

Re: slugify destroys data

Postby ashnur on Thu Jul 24, 2008 8:05 am

I'm not sure what you mean, I am using utf8 everywhere (except inside in this function). All my post data, and layout is in utf8.

As far as I understood, the slugify function replaces in '{postname}' => '([_0-9a-zA-Z.-]+)?', to $separator. This does not depends on the charset I'm using. I can use any charset this function will still replace accented letters to the $separator. So I decided to replace the accented letters first to their non-accented variants and then let everything to go like before. Please let me know if I am wrong in anything.
ashnur
 
Posts: 3
Joined: Wed Jul 23, 2008 10:34 pm
LifeType Version: lifetype-1.2_r6501

Re: slugify destroys data

Postby jondaley on Thu Jul 24, 2008 10:35 pm

What I meant was UTF-8 everywhere, in LT's code. There are various assumptions about using plain ascii code in certain bits of the code. I believe 2.0 has been converted over, and since it requires PHP5, it was an easier task than in 1.2.x, which runs on php4 or php5.
jondaley
Lifetype Expert
 
Posts: 6169
Joined: Thu May 20, 2004 6:19 pm
Location: Pittsburgh, PA, USA
LifeType Version: 1.2.11 devel branch

Re: slugify destroys data

Postby ashnur on Thu Jul 24, 2008 11:41 pm

Sorry, maybe I'm to slow :?

I think the regexp pattern [_0-9a-zA-Z.-]+ will replace the accented letters even if you have UTF8 everywhere. I'm almost sure about this :mrgreen:

But never mind, it works for me - and I do not want to start an endless thread.
ashnur
 
Posts: 3
Joined: Wed Jul 23, 2008 10:34 pm
LifeType Version: lifetype-1.2_r6501

Re: slugify destroys data

Postby jondaley on Fri Jul 25, 2008 10:51 am

Yes, I agree. This bit of code (among others) only works for "plainer" languages. Different people have contributed language specific fixes, and some of them are incompatible with others, so we haven't been able to incorporate all of the changes. Even in english, I'd like to see it not replace ' with - since that looks strange in the case of "this-isn-t-a-funny-blog-post".
I just took a look at the current 2.0 code, and it still has that same character replace in it. I believe that those can be changed now that we are on php5 exclusively, and replaced with a php function that knows what language you are in, and correctly replace proper characters in all languages.
jondaley
Lifetype Expert
 
Posts: 6169
Joined: Thu May 20, 2004 6:19 pm
Location: Pittsburgh, PA, USA
LifeType Version: 1.2.11 devel branch


Return to LifeType 2.0 Development

cron