Accent insensitive


V

vanderghast

Is there an easy way to make a comparison accent insensitive, a little bit
as, in MS SQL Server:


SELECT * FROM table WHERE project LIKE '%belanger%' COLLATE
Latin1_General_CI_AI


String.Compare( ) allows a case insensitivity (CI) but I am looking for an
accent insensitivity ( é == e, as example ), without having to Replace
explicitly all the possible accents for the based letter it is to be
considered equal.
 
Ad

Advertisements

A

AntFarmer

Try using the CompareOptions.IgnoreNonSpace option:

String.Compare("e", "é", CultureInfo.CurrentCulture,
CompareOptions.IgnoreNonSpace)

If you want to make it case insensitive too:

String.Compare("e", "é", CultureInfo.CurrentCulture,
CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreCase)

Hope that works!
 
V

vanderghast

That works fine. I skipped over the "IgnoreNonSpace" at first, but that is
exactly what I was looking for.
 
P

Peter Duniho

vanderghast said:
That works fine. I skipped over the "IgnoreNonSpace" at first, but that
is exactly what I was looking for.

Careful. It will only ignore "nonspacing combining characters". There
are characters in the Unicode space that are accented as individual
characters, and these will still be compared as non-equal to unaccented
characters.

The success of using the IgnoreNonSpace flag will depend entirely on the
source of the accented characters and how they are entered.

Pete
 
V

vanderghast

I have heard of such characters (Swedish/Finnish come to mind, but can be
wrong), but fortunately, for now, there are not part of the targeted
"cultures".
 
P

Peter Duniho

vanderghast said:
I have heard of such characters (Swedish/Finnish come to mind, but can
be wrong), but fortunately, for now, there are not part of the targeted
"cultures".

I'm not sure you're getting what I mean. For example: the letter 'é'
can be represented as a combining accent followed by a plain e (in
UTF-16, 0x0301 followed by 0x0065) or as a single Unicode character (in
UTF-16, 0x00e9).

It's not a language-specific issue.

That said, I wrote a quick test (see below), and discovered that the
IgnoreNonSpace option actually does more work than the documentation
describes. In particular, it appears to actually handle the situation
you're specifically dealing with, by treating \u00e9 as the same as
\u0065 (in addition to ignoring the non-space \u0301 character, had I
included that).

Whether it's safe to rely on this undocumented behavior, I'm not
entirely sure. However, Microsoft has always held
backward-compatibility as a high priority, and even if the current
behavior was eventually deemed incorrect according to their
specification, I'd be really surprised if they changed it due to the
potential of breaking lots of existing code. It's probably more likely
they'd update the specification and documentation.

So, in other words, ignore what I wrote about the difference between
combining characters and individual accented characters. I mean, don't
ignore the specific data, but do ignore my conclusion based on the data
as it applies to your string comparison scenario. :)

Pete



using System;
using System.Globalization;

namespace TestAccentCompare
{
class Program
{
static void Main(string[] args)
{
string strAccented = "\u00e9", strPlain = "\u0065";

Console.WriteLine("'{0}' == '{1}': {2} (CompareOptions.None)",
strAccented, strPlain, (String.Compare(strAccented,
strPlain, CultureInfo.CurrentCulture,
CompareOptions.None) == 0).ToString());

Console.WriteLine("'{0}' == '{1}': {2}
(CompareOptions.IgnoreNonSpace)",
strAccented, strPlain, (String.Compare(strAccented,
strPlain, CultureInfo.CurrentCulture,
CompareOptions.IgnoreNonSpace) == 0).ToString());

Console.ReadLine();
}
}
}
 
Ad

Advertisements

V

vanderghast

And to be like the SQL operator LIKE, unless I missed a shortcut, someone
has to define a CompareInfo object. Adding few lines to your code (even if
it is somehow trivial from the documentation, if we know where to look):


using System;
using System.Globalization;


namespace TestAccentCompare
{
class Program
{
static void Main(string[] args)
{

string strAccented = "\u00e9", strPlain = "\u0065";

Console.WriteLine("'{0}' == '{1}': {2} (CompareOptions.None)",
strAccented, strPlain, (String.Compare(strAccented,
strPlain, CultureInfo.CurrentCulture,
CompareOptions.None) == 0).ToString());

Console.WriteLine("'{0}' == '{1}': {2}
(CompareOptions.IgnoreNonSpace)",
strAccented, strPlain, (String.Compare(strAccented,
strPlain, CultureInfo.CurrentCulture,
CompareOptions.IgnoreNonSpace) == 0).ToString());

#region simulate SQL operator LIKE

string AccentedSentence = "... found near Baie d'Urfée,
under...";
string SearchingKey = "urfee";

CompareInfo ci = new CultureInfo("fr-CA").CompareInfo;

Boolean found = -1 != ci.IndexOf(
AccentedSentence,
SearchingKey,
CompareOptions.IgnoreCase | CompareOptions.IgnoreNonSpace);

Console.WriteLine("'{0}' in '{1}' : {2} (CIAI)",
SearchingKey, AccentedSentence, found);

#endregion

Console.ReadLine();
}
}
}





Peter Duniho said:
vanderghast said:
I have heard of such characters (Swedish/Finnish come to mind, but can be
wrong), but fortunately, for now, there are not part of the targeted
"cultures".

I'm not sure you're getting what I mean. For example: the letter 'é' can
be represented as a combining accent followed by a plain e (in UTF-16,
0x0301 followed by 0x0065) or as a single Unicode character (in UTF-16,
0x00e9).

It's not a language-specific issue.

That said, I wrote a quick test (see below), and discovered that the
IgnoreNonSpace option actually does more work than the documentation
describes. In particular, it appears to actually handle the situation
you're specifically dealing with, by treating \u00e9 as the same as \u0065
(in addition to ignoring the non-space \u0301 character, had I included
that).

Whether it's safe to rely on this undocumented behavior, I'm not entirely
sure. However, Microsoft has always held backward-compatibility as a high
priority, and even if the current behavior was eventually deemed incorrect
according to their specification, I'd be really surprised if they changed
it due to the potential of breaking lots of existing code. It's probably
more likely they'd update the specification and documentation.

So, in other words, ignore what I wrote about the difference between
combining characters and individual accented characters. I mean, don't
ignore the specific data, but do ignore my conclusion based on the data as
it applies to your string comparison scenario. :)

Pete



using System;
using System.Globalization;

namespace TestAccentCompare
{
class Program
{
static void Main(string[] args)
{
string strAccented = "\u00e9", strPlain = "\u0065";

Console.WriteLine("'{0}' == '{1}': {2} (CompareOptions.None)",
strAccented, strPlain, (String.Compare(strAccented,
strPlain, CultureInfo.CurrentCulture,
CompareOptions.None) == 0).ToString());

Console.WriteLine("'{0}' == '{1}': {2}
(CompareOptions.IgnoreNonSpace)",
strAccented, strPlain, (String.Compare(strAccented,
strPlain, CultureInfo.CurrentCulture,
CompareOptions.IgnoreNonSpace) == 0).ToString());

Console.ReadLine();
}
}
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top