The GEDCOM file written by Reunion 10 is improperly encoding
Announcement
Collapse
No announcement yet.
utf-8 encoding for Swedish ä and ö
Collapse
This topic is closed.
X
X
-
Re: Improper utf-8 encoding for Swedish
This is not improper UTF-8. Unicode text can be stored as a series of code points in either a "composed" or "decomposed" form. Reunion is writing its GEDCOMs in the decomposed form, in which characters such as "Brad Mohr
https://bradandkathy.com/genealogy/
-
Re: Improper utf-8 encoding for Swedish
Thanks for the interesting post on Unicode fine points. I understand that two Unicode strings may display identically and that GEDCOMs don't need to use any canonical forms. So now I have another basic question.
After I've exported a GEDCOM from Reunion I upload it to my web site and import it into TNG. Then I want to search for a string that contains a Swedish character such as ä. Some are found and some aren't, and the reason is that the search code doesn't consider that ä has been encoded in more than one way.
What code must be changed so that the search works for all possible byte strings that represent ä?
Comment
-
Re: Improper utf-8 encoding for Swedish
[QUOTE=Paul Johnson;39715]After I've exported a GEDCOM from Reunion I upload it to my web site and import it into TNG. Then I want to search for a string that contains a Swedish character such asBrad Mohr
https://bradandkathy.com/genealogy/
Comment
-
Re: Improper utf-8 encoding for Swedish
Originally posted by bmohr View PostIt sounds like you're using the wrong collation setting for your MySQL tables. The utf8_bin collation makes comparisons codepoint-by-codepoint, so semantically-identical strings won't necessarily match if they're composed differently. In most situations, you would probably want to use the utf8_unicode_ci collation (utf8_general_ci would work, too, but there's no real reason to use it over utf8_unicode_ci these days).
Comment
-
Re: Improper utf-8 encoding for Swedish
[QUOTE=Paul Johnson;39718]I set up the collation sequence as utf8_swedish_ci. This is required in order for the 3 Swedish characters (Brad Mohr
https://bradandkathy.com/genealogy/
Comment
-
Re: Improper utf-8 encoding for Swedish
Originally posted by bmohr View PostI noticed that TNG sets the database connection character set to utf8 only if the browser session charset is UTF-8. Most modern browsers default to UTF8, but it's worth checking. You might also verify that your database has the same collation settings at the field, table, and database levels.
Code:if ($session_charset == 'UTF-8') @mysql_query("SET NAMES 'utf8'");
As I noted above, a GEDCOM file I exported from Paul's Reunion database imported in to one of my TNG testing sites set to utf8_unicode_ci apparently correctly. Having utf8_swedish_ci should not have made a difference.
Roger
Comment
Comment