Accent insensitive

vanderghast · Nov 16, 2009

Is there an easy way to make a comparison accent insensitive, a little bit
as, in MS SQL Server:

SELECT * FROM table WHERE project LIKE '%belanger%' COLLATE
Latin1_General_CI_AI

String.Compare( ) allows a case insensitivity (CI) but I am looking for an
accent insensitivity ( é == e, as example ), without having to Replace
explicitly all the possible accents for the based letter it is to be
considered equal.

AntFarmer · Nov 18, 2009

Try using the CompareOptions.IgnoreNonSpace option:

String.Compare("e", "Ã©", CultureInfo.CurrentCulture,
CompareOptions.IgnoreNonSpace)

If you want to make it case insensitive too:

String.Compare("e", "Ã©", CultureInfo.CurrentCulture,
CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreCase)

Hope that works!

vanderghast · Nov 18, 2009

That works fine. I skipped over the "IgnoreNonSpace" at first, but that is
exactly what I was looking for.

Peter Duniho · Nov 18, 2009

vanderghast said:
That works fine. I skipped over the "IgnoreNonSpace" at first, but that
is exactly what I was looking for.

Careful. It will only ignore "nonspacing combining characters". There
are characters in the Unicode space that are accented as individual
characters, and these will still be compared as non-equal to unaccented
characters.

The success of using the IgnoreNonSpace flag will depend entirely on the
source of the accented characters and how they are entered.

Pete

vanderghast · Nov 18, 2009

I have heard of such characters (Swedish/Finnish come to mind, but can be
wrong), but fortunately, for now, there are not part of the targeted
"cultures".

Peter Duniho · Nov 18, 2009

vanderghast said:
I have heard of such characters (Swedish/Finnish come to mind, but can
be wrong), but fortunately, for now, there are not part of the targeted
"cultures".

I'm not sure you're getting what I mean. For example: the letter 'Ã©'
can be represented as a combining accent followed by a plain e (in
UTF-16, 0x0301 followed by 0x0065) or as a single Unicode character (in
UTF-16, 0x00e9).

It's not a language-specific issue.

That said, I wrote a quick test (see below), and discovered that the
IgnoreNonSpace option actually does more work than the documentation
describes. In particular, it appears to actually handle the situation
you're specifically dealing with, by treating \u00e9 as the same as
\u0065 (in addition to ignoring the non-space \u0301 character, had I
included that).

Whether it's safe to rely on this undocumented behavior, I'm not
entirely sure. However, Microsoft has always held
backward-compatibility as a high priority, and even if the current
behavior was eventually deemed incorrect according to their
specification, I'd be really surprised if they changed it due to the
potential of breaking lots of existing code. It's probably more likely
they'd update the specification and documentation.

So, in other words, ignore what I wrote about the difference between
combining characters and individual accented characters. I mean, don't
ignore the specific data, but do ignore my conclusion based on the data
as it applies to your string comparison scenario.

Pete

using System;
using System.Globalization;

namespace TestAccentCompare
{
class Program
{
static void Main(string[] args)
{
string strAccented = "\u00e9", strPlain = "\u0065";

Console.WriteLine("'{0}' == '{1}': {2} (CompareOptions.None)",
strAccented, strPlain, (String.Compare(strAccented,
strPlain, CultureInfo.CurrentCulture,
CompareOptions.None) == 0).ToString());

Console.WriteLine("'{0}' == '{1}': {2}
(CompareOptions.IgnoreNonSpace)",
strAccented, strPlain, (String.Compare(strAccented,
strPlain, CultureInfo.CurrentCulture,
CompareOptions.IgnoreNonSpace) == 0).ToString());

Console.ReadLine();
}
}
}

vanderghast · Nov 19, 2009

And to be like the SQL operator LIKE, unless I missed a shortcut, someone
has to define a CompareInfo object. Adding few lines to your code (even if
it is somehow trivial from the documentation, if we know where to look):

using System;
using System.Globalization;

namespace TestAccentCompare
{
class Program
{
static void Main(string[] args)
{

string strAccented = "\u00e9", strPlain = "\u0065";

Console.WriteLine("'{0}' == '{1}': {2} (CompareOptions.None)",
strAccented, strPlain, (String.Compare(strAccented,
strPlain, CultureInfo.CurrentCulture,
CompareOptions.None) == 0).ToString());

Console.WriteLine("'{0}' == '{1}': {2}
(CompareOptions.IgnoreNonSpace)",
strAccented, strPlain, (String.Compare(strAccented,
strPlain, CultureInfo.CurrentCulture,
CompareOptions.IgnoreNonSpace) == 0).ToString());

#region simulate SQL operator LIKE

string AccentedSentence = "... found near Baie d'UrfÃ©e,
under...";
string SearchingKey = "urfee";

CompareInfo ci = new CultureInfo("fr-CA").CompareInfo;

Boolean found = -1 != ci.IndexOf(
AccentedSentence,
SearchingKey,
CompareOptions.IgnoreCase | CompareOptions.IgnoreNonSpace);

Console.WriteLine("'{0}' in '{1}' : {2} (CIAI)",
SearchingKey, AccentedSentence, found);

#endregion

Console.ReadLine();
}
}
}

Peter Duniho said:
vanderghast said:

I have heard of such characters (Swedish/Finnish come to mind, but can be
wrong), but fortunately, for now, there are not part of the targeted
"cultures".

Click to expand...

I'm not sure you're getting what I mean. For example: the letter 'Ã©' can
be represented as a combining accent followed by a plain e (in UTF-16,
0x0301 followed by 0x0065) or as a single Unicode character (in UTF-16,
0x00e9).

It's not a language-specific issue.

That said, I wrote a quick test (see below), and discovered that the
IgnoreNonSpace option actually does more work than the documentation
describes. In particular, it appears to actually handle the situation
you're specifically dealing with, by treating \u00e9 as the same as \u0065
(in addition to ignoring the non-space \u0301 character, had I included
that).

Whether it's safe to rely on this undocumented behavior, I'm not entirely
sure. However, Microsoft has always held backward-compatibility as a high
priority, and even if the current behavior was eventually deemed incorrect
according to their specification, I'd be really surprised if they changed
it due to the potential of breaking lots of existing code. It's probably
more likely they'd update the specification and documentation.

So, in other words, ignore what I wrote about the difference between
combining characters and individual accented characters. I mean, don't
ignore the specific data, but do ignore my conclusion based on the data as
it applies to your string comparison scenario.

Pete

using System;
using System.Globalization;

namespace TestAccentCompare
{
class Program
{
static void Main(string[] args)
{
string strAccented = "\u00e9", strPlain = "\u0065";

Console.WriteLine("'{0}' == '{1}': {2} (CompareOptions.None)",
strAccented, strPlain, (String.Compare(strAccented,
strPlain, CultureInfo.CurrentCulture,
CompareOptions.None) == 0).ToString());

Console.WriteLine("'{0}' == '{1}': {2}
(CompareOptions.IgnoreNonSpace)",
strAccented, strPlain, (String.Compare(strAccented,
strPlain, CultureInfo.CurrentCulture,
CompareOptions.IgnoreNonSpace) == 0).ToString());

Console.ReadLine();
}
}
}

Thread safe case insensitive substring match?	3	Nov 21, 2011
Replace accented letters	4	Jan 8, 2009
String.Contains case insensitive?	18	Sep 22, 2009
SqlBulkCopy and column mappings	0	May 4, 2006
Tables and columns name can contain accented words?	1	Feb 20, 2008
Comparing strings with accented characters	1	Aug 16, 2004
Accents on International keyboard layout stopped working	1	Nov 13, 2003
Deedback on .NET Coding Best Practices	1	May 17, 2004

Accent insensitive

vanderghast

AntFarmer

vanderghast

Peter Duniho

vanderghast

Peter Duniho

vanderghast

Ask a Question

Similar Threads