Removing binary string from text

I

Isa Janfada

Hello,

I have a html text string like this:

" When  creating  a 
new  message,  Reset  occurs  when  '\' 
is  entered  in  Address.     
S-1a<BR>¥á¡¼¥ëºîÀ®»þ¥¢¥É¥ì¥¹¤Ë\¤òÆþ¤ì¤ë¤È¥ê¥»¥Ã¥È¤¹¤ë¡¡

How can I find and remove the strings like
¥á¡¼¥ëºîÀ®»þ¥¢¥É¥ì¥¹¤Ë\¤òÆþ¤ì¤ë¤È¥ê¥»¥Ã¥È¤¹¤ë¡¡ from the text.

Thanks for any tips
 
J

Jon Skeet [C# MVP]

Isa Janfada said:
I have a html text string like this:

" When&nbsp; creating&nbsp; a&nbsp;
new&nbsp; message,&nbsp; Reset&nbsp; occurs&nbsp; when&nbsp; '\'&nbsp;
is&nbsp; entered&nbsp; in&nbsp; Address.&nbsp; &nbsp; &nbsp;
S-1a<BR>¥á¡¼¥ëºîÀ®»þ¥¢¥É¥ì¥¹¤Ë\¤òÆþ¤ì¤ë¤È¥ê¥»¥Ã¥È¤¹¤ë¡¡

How can I find and remove the strings like
¥á¡¼¥ëºîÀ®»þ¥¢¥É¥ì¥¹¤Ë\¤òÆþ¤ì¤ë¤È¥ê¥»¥Ã¥È¤¹¤ë¡¡ from thetext.

Well, there are two things to worry about here:

1) How do you want to distinguish between "real" data and "bad" data?
Should your real data always be in ASCII, for instance? If so, you
could create a StringBuilder, and then go through each character in the
string, appending it if its integer value is less than 127. (It would
be more efficient to append a whole substring at a time, but slightly
more complicated - unless efficiency is a concern, I'd go with the
"character at a time" route to start with.)

2) Why have you got bogus data in the first place?

The second point is a more important one, to my mind - if you find out
why you're getting data you don't want, you may find you've got a
problem higher up the food chain, or you may find a way of not
receiving the "bad" data in the first place, which is better than
filtering it out later.
 
I

Isa Janfada

2) Why have you got bogus data in the first place?

The html page has mixt by two different character set. I want take ASCII
characters to database, therefore I must remove the bad characters.

Thank you Jon Skeet
 
J

Jon Skeet [C# MVP]

Isa Janfada said:
2) Why have you got bogus data in the first place?

The html page has mixt by two different character set. I want take ASCII
characters to database, therefore I must remove the bad characters.

So are you assuming that *none* of the data you're interested will be
non-ASCII?
 
R

rossum

2) Why have you got bogus data in the first place?

The html page has mixt by two different character set. I want take ASCII
characters to database, therefore I must remove the bad characters.

Thank you Jon Skeet

Some ideas:

- look for any characters outside the normal range of ASCII, this
might be anything above 255.

- does the different character set always start with the same
character? If so then you might be able to look for the first
occurrence of that character and ignore anything from then on.

- does the different character set always come at the end of the
string? If not then you are going to have to think of a way to pick
up normal ASCII again.

rossum



The ultimate truth is that there is no ultimate truth
 
J

Jon Skeet [C# MVP]

rossum said:
Some ideas:

- look for any characters outside the normal range of ASCII, this
might be anything above 255.

Not quite - anything above 127. That's where ASCII ends.
- does the different character set always start with the same
character? If so then you might be able to look for the first
occurrence of that character and ignore anything from then on.

- does the different character set always come at the end of the
string? If not then you are going to have to think of a way to pick
up normal ASCII again.

If the character encodings are okay, that should be fine - hopefully
it's correctly decoded it to Unicode, it's just the non-ASCII
characters which can be discarded.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top