[PATCH v2] Feature request for a new collation

Stanislav Zudin szudin at tarantool.org
Tue Feb 26 13:40:42 MSK 2019


Adds a new default collation 'unicode_s2' to support the difference
between Cyrillic letters 'Е' and 'Ё'. The standard case insensitive
collation ('unicode_ci') doesn't distinguish these letters.

Closes #4007
---
Branch: https://github.com/tarantool/tarantool/tree/stanztt/gh-4007-new-default-collation-2.1
Issue: https://github.com/tarantool/tarantool/issues/4007

 src/box/bootstrap.snap          | Bin 1831 -> 1864 bytes
 src/box/lua/upgrade.lua         |   1 +
 test/sql-tap/collation.test.lua |   7 +-
 test/sql/collation.result       | 111 ++++++++++++++++++++++++++++++++
 test/sql/collation.test.lua     |  41 ++++++++++++
 5 files changed, 157 insertions(+), 3 deletions(-)

diff --git a/src/box/bootstrap.snap b/src/box/bootstrap.snap
index 0bb446fb6903ac3ef630c419b909f7db3df0372a..1b590939f1a9ca95cc93745b81a0271643055b21 100644
GIT binary patch
delta 1860
zcmV-K2fO&E4#*CW8Gkb|I4x&pGBIUjVP#}63Q2BrbYX5|WjY`@I5=cDVl-wgIbvZk
zEi^Y`W-Vc6IW#S1IWaIcH#K88V=`t6RzqxWV{1AfdmtbnARr)p3JTS_3%bn)$N<h`
z at 9{dN0000004TLD{Qyw?H2|7Oj1NiR7^eU*%rIlf7&At2IDeps#-Ld**&@FN<E?;T
zjEO{1U^gODl9iIaG?(lz^0bn%T^oR1jGiyH=bLhRt!YE*qWbbWGg_Ha3f}<Q0O<fP
z=KCUH`p!n!ADOW<-kq01F(ebHyWT}!*S9=wYxHHn-TRWCL$Q;Z=h^p~g}D}OAHq8A
z!}h%!a@)JErhf?$(GS~D1UJhTG(Ur_Bi_xb+xO4^+3)xM)qgg_G=5nrlu{%mok_r_
zLGSskCx;a``GH#_mjJMRHCB}z0Itpvwyz8Cu-57<{kwBp5caH5b?IRdV at Cf2J*>qV
zE$T`fSLcX?$&Va5HDatN>1$BaW>Ht->KrGn8pM46tACb+a&?YL6zRhj=(@^S_d at 5q
z@*=cF3z954GFWu1xYQ{W*W=~QcJGa-z at vY#>$_RC7v7;IY080YKNnFYuFkR5#Vgy-
zMN86CgCw37S1(&q8wOmRqu8G(@wB+IB&}Ic^s+VDz3dy3cvb}TpT at SNGzylytP!0_
z(C=rTP=6T&T%Du7pFvQSJ=nLr>ydrm&PkxFv1_)U^<Cpy_?^?1=ofvS+Z&BJ at bmb^
zVZ8RTH40;(U)Kg#CF|xD*|i9I_REh~M;f#Amr86&O$E3*$MCvlyoVs|dSEU?Sl_RI
zK(a}Z264YRgPt{_B?4TXW0lXUo3p4b$43D+BY%c$MAU?+0m0Qd4%K|<P_gbr;~_?K
zGt*&)!_0;jzbzRJF*Y$ZF&SVmz+8Z_P^+l|jSW$uCJLsG(h?0<=UAFBV?$M5^ubyz
zgZ_7?%@>MP)X~gvb&f|F^o;HoiW<DNynbCI$4C124R2q at EpJKJ`0ma%=-DGH`-FZn
zTz{RT>+<r`w5(@7uV+OehQxGE1N!hRtNIMpZ|64qaBD<`Ncg_r{rXv!V<$&MQ0R^b
zT%98s`+hww?rv|C2ZpP2Bx14f?$$<kOkAC#NQ-qYX3yGa4ulupA<hG?&XM&Qzwi!k
zHwP0us><PR<tc(=QABKpbV6o=P#7RwoquC3Q6#edpc=&8)8fijsjN%+NJ>h&LVIv^
zj;8s1{|jr;#)GSKbg7fdx>V2pP~;??u*1VS{Cz|@9iPEX^Ke#)VM(j3JLS;W^<9!u
z99*3v3f8I>!g$t^@xt4H1IYitYV=bm<L6(k*A&!k at ebi_dAFebT%>jWB8#L`)_<LD
zaCMG4Ju_w$Txoo>pNls)K%goLrLwNZ2K{Pqb&ekm05PH=Gh#wy*eK7A7(6?42gc3|
zn&Tp;1r7_G6*xMvvZiuoN(C1qN~J)dK%GFDIF&fzoKiL<0)cU;;ap^>Mn({T00lr0
z0MP{pG3bg05 at 4`Ej-xP+ffxuw5PymS8wWrD3_=Pth*~g0>jrfT|AK-%Fwi^??L~cP
zFRJeV2h3G7q|)?g1ltBdbA9g{k3fJTifkx)20uq8G4^{(SsZ(Gnp9X;6y_zYPz{;9
z+2QQ9dU_K$iZf%@jB7X8^NgkWg#HYZ?}9K{vvYnc<gkTve6-tGQ&d9GF at LX%DvmDT
zLdd}HsIc-7z0uo)RUzt_IuKj^J*u8)9)UaY=BYEbU<eFaLTJQx)ok8^3yF0PtvyNf
zI{K`r+kQtw*aLb)zOgsVyZQ*5Juuj*`!o>HW#Gwh({vi*9x at nT51bR*BsZazTL%mT
zY9MG~C*UDoHFPbHZ++7cwts-$kZ<e_^Oi3 at K%3?L`bzN9MVN-r4O=6a`vFR at c-Lp-
zd`kqWq|crheFPM+#PxC at i}2K{q9y;A+iE^tb}1Gw)4<oCX)a@;jfpZ2K})|FrOc{y
zAwn()`AS>SPv>D<5Ps?DTHW?JP<tK}yV at qJ);&K*7RUT3vOaNgD1Xy1(y~46#?%7X
zpCqNC`nU`~UIN$%*XeQ3xmzzp4qOd#;<bC<(hf_Gr^KlSWP*Lkq+0~FjM)soz(U(P
z6(ug at oG0kkiI;@Qf7(KDo<c at qBGGT%%L>X|L0CEZyU<GH$ly`vSKX1SYk3=WYko61
zq}rxnjOk#W`uhr;G=C)nr(&FTY{Ot at I&#i<$AF?|;5#8YG30KoT%SY+oMEU_0GGvw
zV!_gndReldjKY9^IQRvceH3K<B!|1~#JJE%@2xbKf!*ZQR%HLjSwSqKkLV?yBnDNK
zK-xRGo3g|6`H8BMw8_|t2p(*RIa=AM8-`>kNnU+NJY&t(k3C*s4KCZ?&pU)5m~c9`
y8Ijm@$x<vOpc~cTd((}Qsc`RP7$KBCnYE(>A_cy5E1a9;kYKJ_ata>R5UuSHt$oD+

delta 1827
zcmV+;2i*9`4yO)~8GkZ0H7#c}H)3WoGdVK~Np5p=VQyn(Iv`^*Gd5&pFkvk=W at R@m
zG%zzYEjT!4F)d?aGh{VnH8Wx{HDL-?Lu_wjYdRo%eF_TIx(m9^2Ce|k)s!0pr2qf`
z001bpFZ}>eEj0jYM??=v;25U>Fw8KA5aR at KMG#m>L*QW(A%7yj2IH-OV2p`GQD8SB
zQ<9aEzBH}eU*u^eqq>gJE=JE6+w)C1z1FnIIjXms%tRqmN&(pb)d1!IBbLvpo3p4b
z$3!pY+ag{1#zxmGiG3RH&dZ<IlG)Q)&mu4DR~@%A`ZD0|eaX+E*h$TE%zOP|u0`61
zt}gSi{qBaG_J6FaWj{gG!!{Jb{n32pXK<yYyE#?+{`o!Y{ob#7&t{j#En9_BTGWIi
z32^PJajM(`Y;}&VeOY*ivsM4}@6KsHxU)kQfDaclV)Q+*!&;nCA#Sv7b&gA(2-$&i
zV at 8XTz6P~y3UQ;Y&T-PJLCp8JW=!Z-=h%f7dD!+`R)6{8Ug(@!UVgZAS1>>+ZNbR`
zpw6Eex9|>YH{XtzHQT#Sk^+hT!LINAW-Yu!OH$Mbwx5e92V0%vs*4x4pNp2HrUpzp
zEvj0!q%;h+I!CcTXVPg=Wl1{oBGj at o$}QM8Bymir=RJ*WNo5o`Y1ttflb~PEN}&`G
zY;}(Met!nRRPJ2A>TXB&eLE+6uEwm`{_nfSv+z5YIngifJhwFpbAab at i@W%&WoNWx
zpkLMoRpocHitJhhJL~1gDk6<P{bdtdQc{7f&M~~K81Eq{yB(Oz(AD>A9*~40q(R(o
z&R}PU=!n2pXC at OgC1^+h(Ttc8F%x2|a|D_>aet;42STRiCG#Q1LrjMl4l90F6lh>(
zU^c*LC}k*NC|#(PWP!ebu22$3lA&}|VXJc-#mpBFmKS%h7E7Q1-DUHI771}O6t+4?
zq6~IJ*9%1r)>d7=ERtg){riSBFXL6`q-%V4=Njy+5|x!gJ(#V|k#%|LX<62>o!2p;
z5r0M|bS~@ha4e^K?9^}PHt%pdM2kr9zTf@&Kg)3xqM#|XM`^2bBwycer^VfEjp~rL
zI>(^z!@FA=&55wpIf}AZ=VI<Gjov_L;T_sM*y<d=4;51M-fT#!j<q>*GD1^?BN8)&
zmV|6|j&($f$Qp!d5cf`tDq5v54h5u3mwzr==#Q<=ku;z0Z(%Lcd~9`&0P3VL4%K6S
zC~}fk;L+h6{wAW^j?dthbvUQQu%s2nnRIB(@-9hfj;+p7#iv?@E}n5{#OONc81g)>
z8{HJr;Q7|7W#x2N{6lzGUCn1d7iFEl_>pwNI1`So&Jo8)jTi-08rSUSVogmCn17-|
zDU6%BK|LE=ougMHz$?fIjaLyJ8lBN2M at I(DiID??-n_VRVbj6}$L8i$RuqegTCu%?
zP%2I-PAE<%Og1J|fo-VqTx6&OMi77i1war0(FF%F=!ymsV6Z at pqcDzv7zjfUiXs~a
zKmZIv3N(mXFhZ*abuTo5t}r0bJb#VtMSW;5s(;`EhOL>ZG at V_Dw&p-8+e?Qb5ZH(U
z8w&cNkRT-mlY5E*j$JyRR7_U;;{~ix4OzC?q3X4Ia+*1cr!ninwax4~%hK_LeTB&v
zLjVBRs^4;QWQeYPv}dd-Dk13b>Y^0K&07c!{KzY;+^{zwba*L5k*TZ4R)2A5)rMV5
z1W#@}b)GGRnL+;xjWVv<D$TnPrhAM5C5b{upNYCzXQ?ndZ_JlrcixyU53||%^AMcJ
zXn-!gCygMQX9hfEOg&wHPV7l;!Y8+w%G6)<X<;+(Ard$AAApCZBTbsYJgW!(nR&|J
zKW8&<KWHVG1Y*2nb*s||dVfD!iR^d$j8txkAl2j9AER4<f-G5IreqOrTQ9UL|8iT+
z`jed*i{oi{_|G(#F*%Eg@(e*se+*q_EnS!*7j%53&FC-ksx8Q~^uVod^Bkx>ON?E8
zB&%*d-&__)6H<((ZgQw$7zx{+W69LE*&h?7B7a<sAC~}Z0mJEG=YKq-7ghsTx}5mh
zm2YW>?Z;EnR68??zGSv7f?8y1z%Mw_woZ)_#d6LLT2tbG!rMPqA;^9qOVp6)w)Uk3
zWj-M at 9sOHqC33{?D9#~wr0H7TN8Lc)Ob&^*Dez)Cm?wW<i_>jj;1n9Cw{1vsrpxC{
z>={t>jCkioCWhRtm4EX|*239%3IO;oK41!#Ues%p1qCUL?4u+<m}YW<L(kyhjyf?G
zTJAly-7>J7ywZd0|9Dm~is&PHOHUGW6(y5)oZ3z2 at GO5Ksw8hRS@Hb?aZr+0cGNIK
z43t=2$s?w>#u23tKrzBb1(Hfa^nV<p45fgF?oxj_`)Ep_UK{`xaiyL&vvbKES-e0i
R9JX=Dom4Gm<qp*lt?kA^T~Gi3

diff --git a/src/box/lua/upgrade.lua b/src/box/lua/upgrade.lua
index 70cfb4f2e..84c559dac 100644
--- a/src/box/lua/upgrade.lua
+++ b/src/box/lua/upgrade.lua
@@ -610,6 +610,7 @@ local function upgrade_to_2_1_0()
 
     box.space._collation:replace{0, "none", ADMIN, "BINARY", "", setmap{}}
     box.space._collation:replace{3, "binary", ADMIN, "BINARY", "", setmap{}}
+    box.space._collation:replace{4, "unicode_s2", ADMIN, "ICU", "ru_RU", {strength='secondary'}}
 
     upgrade_priv_to_2_1_0()
 end
diff --git a/test/sql-tap/collation.test.lua b/test/sql-tap/collation.test.lua
index 1e55b0092..b8bc02317 100755
--- a/test/sql-tap/collation.test.lua
+++ b/test/sql-tap/collation.test.lua
@@ -21,9 +21,10 @@ test:do_execsql_test(
         1,"unicode",
         2,"unicode_ci",
         3,"binary",
-        4,"unicode_numeric",
-        5,"unicode_numeric_s2",
-        6,"unicode_tur_s2"
+        4,"unicode_s2",
+        5,"unicode_numeric",
+        6,"unicode_numeric_s2",
+        7,"unicode_tur_s2"
     }
 )
 
diff --git a/test/sql/collation.result b/test/sql/collation.result
index daea35543..7697f03d4 100644
--- a/test/sql/collation.result
+++ b/test/sql/collation.result
@@ -427,3 +427,114 @@ box.space.T4A:drop()
 box.space.T4B:drop()
 ---
 ...
+--
+-- gh-4007 Feature request for a new collation
+--
+-- Default unicode collation deals with russian letters
+s = box.schema.space.create('t1')
+---
+...
+s:format({{name='s1', type='string', collation = 'unicode'}})
+---
+...
+idx = s:create_index('pk', {unique = true, type='tree', parts={{'s1', collation = 'unicode'}}})
+---
+...
+s:insert{'Ё'}
+---
+- ['Ё']
+...
+s:insert{'Е'}
+---
+- ['Е']
+...
+s:insert{'ё'}
+---
+- ['ё']
+...
+s:insert{'е'}
+---
+- ['е']
+...
+-- all 4 letters are in the table
+s:select{}
+---
+- - ['е']
+  - ['Е']
+  - ['ё']
+  - ['Ё']
+...
+s:drop()
+---
+...
+-- unicode_ci collation doesn't distinguish russian letters 'Е' and 'Ё'
+s = box.schema.space.create('t1')
+---
+...
+s:format({{name='s1', type='string', collation = 'unicode_ci'}})
+---
+...
+idx = s:create_index('pk', {unique = true, type='tree', parts={{'s1', collation = 'unicode_ci'}}})
+---
+...
+s:insert{'Ё'}
+---
+- ['Ё']
+...
+-- the following calls should fail
+s:insert{'е'}
+---
+- error: Duplicate key exists in unique index 'pk' in space 't1'
+...
+s:insert{'Е'}
+---
+- error: Duplicate key exists in unique index 'pk' in space 't1'
+...
+s:insert{'ё'}
+---
+- error: Duplicate key exists in unique index 'pk' in space 't1'
+...
+-- return single 'Ё'
+s:select{}
+---
+- - ['Ё']
+...
+s:drop()
+---
+...
+-- unicode_s2 collation does distinguish russian letters 'Е' and 'Ё'
+s = box.schema.space.create('t1')
+---
+...
+s:format({{name='s1', type='string', collation = 'unicode_s2'}})
+---
+...
+idx = s:create_index('pk', {unique = true, type='tree', parts={{'s1', collation = 'unicode_s2'}}})
+---
+...
+s:insert{'Ё'}
+---
+- ['Ё']
+...
+s:insert{'е'}
+---
+- ['е']
+...
+-- the following calls should fail
+s:insert{'Е'}
+---
+- error: Duplicate key exists in unique index 'pk' in space 't1'
+...
+s:insert{'ё'}
+---
+- error: Duplicate key exists in unique index 'pk' in space 't1'
+...
+-- return two: 'Ё' and 'е'
+s:select{}
+---
+- - ['е']
+  - ['Ё']
+...
+s:drop()
+---
+...
diff --git a/test/sql/collation.test.lua b/test/sql/collation.test.lua
index 713a9bd89..e125274ef 100644
--- a/test/sql/collation.test.lua
+++ b/test/sql/collation.test.lua
@@ -172,3 +172,44 @@ box.sql.execute("SELECT a FROM t4b ORDER BY a || b")
 
 box.space.T4A:drop()
 box.space.T4B:drop()
+
+--
+-- gh-4007 Feature request for a new collation
+--
+-- Default unicode collation deals with russian letters
+s = box.schema.space.create('t1')
+s:format({{name='s1', type='string', collation = 'unicode'}})
+idx = s:create_index('pk', {unique = true, type='tree', parts={{'s1', collation = 'unicode'}}})
+s:insert{'Ё'}
+s:insert{'Е'}
+s:insert{'ё'}
+s:insert{'е'}
+-- all 4 letters are in the table
+s:select{}
+s:drop()
+
+-- unicode_ci collation doesn't distinguish russian letters 'Е' and 'Ё'
+s = box.schema.space.create('t1')
+s:format({{name='s1', type='string', collation = 'unicode_ci'}})
+idx = s:create_index('pk', {unique = true, type='tree', parts={{'s1', collation = 'unicode_ci'}}})
+s:insert{'Ё'}
+-- the following calls should fail
+s:insert{'е'}
+s:insert{'Е'}
+s:insert{'ё'}
+-- return single 'Ё'
+s:select{}
+s:drop()
+
+-- unicode_s2 collation does distinguish russian letters 'Е' and 'Ё'
+s = box.schema.space.create('t1')
+s:format({{name='s1', type='string', collation = 'unicode_s2'}})
+idx = s:create_index('pk', {unique = true, type='tree', parts={{'s1', collation = 'unicode_s2'}}})
+s:insert{'Ё'}
+s:insert{'е'}
+-- the following calls should fail
+s:insert{'Е'}
+s:insert{'ё'}
+-- return two: 'Ё' and 'е'
+s:select{}
+s:drop()
-- 
2.17.1




More information about the Tarantool-patches mailing list