Removing binary string from text

  • Thread starter Thread starter Isa Janfada
  • Start date Start date
I

Isa Janfada

Hello,

I have a html text string like this:

" When  creating  a 
new  message,  Reset  occurs  when  '\' 
is  entered  in  Address.     
S-1a<BR>¥á¡¼¥ëºîÀ®»þ¥¢¥É¥ì¥¹¤Ë\¤òÆþ¤ì¤ë¤È¥ê¥»¥Ã¥È¤¹¤ë¡¡

How can I find and remove the strings like
¥á¡¼¥ëºîÀ®»þ¥¢¥É¥ì¥¹¤Ë\¤òÆþ¤ì¤ë¤È¥ê¥»¥Ã¥È¤¹¤ë¡¡ from the text.

Thanks for any tips
 
Isa Janfada said:
I have a html text string like this:

" When&nbsp; creating&nbsp; a&nbsp;
new&nbsp; message,&nbsp; Reset&nbsp; occurs&nbsp; when&nbsp; '\'&nbsp;
is&nbsp; entered&nbsp; in&nbsp; Address.&nbsp; &nbsp; &nbsp;
S-1a<BR>¥á¡¼¥ëºîÀ®»þ¥¢¥É¥ì¥¹¤Ë\¤òÆþ¤ì¤ë¤È¥ê¥»¥Ã¥È¤¹¤ë¡¡

How can I find and remove the strings like
¥á¡¼¥ëºîÀ®»þ¥¢¥É¥ì¥¹¤Ë\¤òÆþ¤ì¤ë¤È¥ê¥»¥Ã¥È¤¹¤ë¡¡ from thetext.

Well, there are two things to worry about here:

1) How do you want to distinguish between "real" data and "bad" data?
Should your real data always be in ASCII, for instance? If so, you
could create a StringBuilder, and then go through each character in the
string, appending it if its integer value is less than 127. (It would
be more efficient to append a whole substring at a time, but slightly
more complicated - unless efficiency is a concern, I'd go with the
"character at a time" route to start with.)

2) Why have you got bogus data in the first place?

The second point is a more important one, to my mind - if you find out
why you're getting data you don't want, you may find you've got a
problem higher up the food chain, or you may find a way of not
receiving the "bad" data in the first place, which is better than
filtering it out later.
 
2) Why have you got bogus data in the first place?

The html page has mixt by two different character set. I want take ASCII
characters to database, therefore I must remove the bad characters.

Thank you Jon Skeet
 
Isa Janfada said:
2) Why have you got bogus data in the first place?

The html page has mixt by two different character set. I want take ASCII
characters to database, therefore I must remove the bad characters.

So are you assuming that *none* of the data you're interested will be
non-ASCII?
 
2) Why have you got bogus data in the first place?

The html page has mixt by two different character set. I want take ASCII
characters to database, therefore I must remove the bad characters.

Thank you Jon Skeet

Some ideas:

- look for any characters outside the normal range of ASCII, this
might be anything above 255.

- does the different character set always start with the same
character? If so then you might be able to look for the first
occurrence of that character and ignore anything from then on.

- does the different character set always come at the end of the
string? If not then you are going to have to think of a way to pick
up normal ASCII again.

rossum



The ultimate truth is that there is no ultimate truth
 
rossum said:
Some ideas:

- look for any characters outside the normal range of ASCII, this
might be anything above 255.

Not quite - anything above 127. That's where ASCII ends.
- does the different character set always start with the same
character? If so then you might be able to look for the first
occurrence of that character and ignore anything from then on.

- does the different character set always come at the end of the
string? If not then you are going to have to think of a way to pick
up normal ASCII again.

If the character encodings are okay, that should be fine - hopefully
it's correctly decoded it to Unicode, it's just the non-ASCII
characters which can be discarded.
 
Back
Top