I have an application that deals with music metadata from all over the world and I therefore use Unicode. I want a search function that native English speakers can use without understanding accents and diacriticals from other languages. MySQL’s utf8_general_ci is ideal. For example the letter “A” in a search key matches any of these:
A,a,À,Á,Â,Ã,Ä,Å,à,á,â,ã,ä,å,Ā,ā,Ă,ă,Ą,ą
The search uses SphinxSearch so I want to configure it to use character matching tables that are compatible utf8_general_ci. Sphinx’s charset_table allows any character folding to be configured but it isn’t going to be trivial to write down all the rules.
Special note: I wrote this long before Sphinx 2 came along. 2.0.1-beta introduced support for collations and utf8_general_ci in particular. I would think that in many situations it would suffice to use that rather than resort to this method for creating a custom charset_table. If you are still on 1.9.[789] then this should work just as well as Sphinx 2's built-in collation. And Sphinx 2 users may still benefit from this method as it allows customization of the collation. I give examples of the customizations I used below.
The basic idea is that you can dump out any of MySQL’s collations by populating a
CHAR(1)
column with every character you care about and
SELECT GROUP_CONCAT(mychar) FROM mytable GROUP BY mychar
The output of which can then be procesed into charset_table rules for a Sphinx config file.
I broke the process into three steps:
- A script generates a human-readable file describing the collation rules
- Manually edit the file to define the exact rules I want Sphinx to use
- A second script turns the edited file into a charset_table definiton
The first script takes as input specification of a MySQL utf8 collation and a numeric range of Unicode code points. It creates the table, populates it, runs the SELECT query (in the style above) to generate the human-readable output file. For example, if it is working on utf8_general_ci from 0x20 to 0x17f then it would look like this:
⇥0020
!⇥0021
"⇥0022
#⇥0023
$⇥0024
%⇥0025
…
=⇥003d
>⇥003e
?⇥003f
@⇥0040
A,a,À,Á,Â,Ã,Ä,Å,à,á,â,ã,ä,å,Ā,ā,Ă,ă,Ą,ą⇥0041,0061,00c0,00c1,00c2,00c3,0c4,00c5,00e0,00e1,00e2,00e3,00e4,00e5,0100,0101,0102,0103,0104,0105
B,b⇥0042,0062
C,c,Ç,ç,Ć,ć,Ĉ,ĉ,Ċ,ċ,Č,č⇥0043,0063,00c7,00e7,0106,0107,0108,0109,010a,010b,010c,010d
D,d,Ď,ď⇥0044,0064,010e,010f
…
W,w,Ŵ,ŵ⇥0057,0077,0174,0175
X,x⇥0058,0078
Y,y,Ý,ý,ÿ,Ŷ,ŷ,Ÿ⇥0059,0079,00dd,00fd,00ff,0176,0177,0178
Z,z,Ź,ź,Ż,ż,Ž,ž⇥005a,007a,0179,017a,017b,017c,017d,017e
[⇥005b
\⇥005c
]⇥005d
^⇥005e
…
Ł,ł⇥0141,0142
ʼn⇥0149
Ŋ,ŋ⇥014a,014b
Œ,œ⇥0152,0153
Ŧ,ŧ⇥0166,0167
µ⇥00b5
Each line in the file repesents a set of characters the collation treats as equivalent. A line has one or more characters (comma separated) followed by a tab followed by those characters’ respective Unicode codepoints.
With an understanding of how the second script works, I can edit the file to get the Sphinx charset_table rules I want.
A line with only one character will be translated to a singleton (ie. terminal) character
in the charset_table. For example in the last line above, µ
will become
“U+00b5” standing on its own in the charset_table with “->” neither before nor after
it.
A line with two or more charcters does two things. First, the leftmost character in the set will become a singleton. Then all the characters to the right of the first character will be folded to that first character. For example, take the line:
D,d,Ď,ď⇥0044,0064,010e,010f
This will produce the following charset_table rules:
U+0044, U+0064->U+0044, U+010e->U+0044, U+010f->U+0044
When I was editing my file I first reviewed the folding rules (the lines with more than
one character) to see that they made sense. Then I carefully thought about all the
characters I didn’t want Sphinx to index at all and deleted those lines from the file.
For example, in a song’s artist name field I want !
, '
,
’
indexed but not "
, $
or
?
. Finally, thiking about the equivalence of “O'brien” and “O’brien”,
I replaced the singleton '
line with this:
',’⇥0027,2019
The second script then reads the edited file and generates the rules as I described.
Feel free to play with the scripts. The output of the first can be piped into the second but I feel that the manual editing of the intermediate file is important.
MySQL collation to Sphinx charset_table conversion by tom--/thefsb is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.