Convert large XML file to UTF-8

B

bbb

Hi,
I need to convert XML files from Japanese encoding to UTF-8.
I was using the following code:

using ( FileStream fs = File.OpenRead(fromFile) )
{
int fileSize = (int)fs.Length;
int buffer = fileSize;
byte[] b = new byte[buffer];

using(StreamWriter sw = new StreamWriter(toFile, true, toEnc))
{
while (fs.Read(b,0,buffer) > 0)
{
byte[] utf8Bytes = Encoding.Convert(fromEnc, toEnc, b);

// Convert the new byte[] into a char[] and then into a string.
char[] utf8Chars = new char[toEnc.GetCharCount(utf8Bytes, 0,
utf8Bytes.Length)];
toEnc.GetChars(utf8Bytes, 0, utf8Bytes.Length, utf8Chars,
0);

string utfString = new string(utf8Chars);

sw.Write(replaceXmlEncodingHeader(utfString, fromEncHeader
,toEncHeader));

}


}

}

Everything worked fine until we get 100MB file - I got OutOfMemory
exception.
I've tried to read it by pieces :
if (fileSize >30000000)
{
buffer = 1024;
}
but then it fills the end of the buffer with other bytes (lets say last
chunk 900 bytes - it adds 124 bytes from somewhere else) -so my
converted xml is not well formed.

Please help.
Thanks in advance.
Regards
 
N

Nicholas Paldino [.NET/C# MVP]

Your code to read in chuncks isn't correct. You are assuming that the
buffer is filled completely, when in reality, it is not. Chances are, you
are adding the 124 bytes yourself, instead of writing only what is read.

You might also want to consider some other mechanism for storing your
data other than XML for these files. At 100MB, its going to be very
difficult to process this file for updates if the time comes.

Hope this helps.
 
B

bbb

Nicholas,
Thanks for reply.
Unfortunately I cannot choose the format ( XML ) - that's given.
100MB files appear once in a while but they need to be processed.
I understand, that's I'm doing it wrong way.
Can you show me please how to do it correct.

Thanks in advance.
Regards

Your code to read in chuncks isn't correct. You are assuming that the
buffer is filled completely, when in reality, it is not. Chances are, you
are adding the 124 bytes yourself, instead of writing only what is read.

You might also want to consider some other mechanism for storing your
data other than XML for these files. At 100MB, its going to be very
difficult to process this file for updates if the time comes.

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)

bbb said:
Hi,
I need to convert XML files from Japanese encoding to UTF-8.
I was using the following code:

using ( FileStream fs = File.OpenRead(fromFile) )
{
int fileSize = (int)fs.Length;
int buffer = fileSize;
byte[] b = new byte[buffer];

using(StreamWriter sw = new StreamWriter(toFile, true, toEnc))
{
while (fs.Read(b,0,buffer) > 0)
{
byte[] utf8Bytes = Encoding.Convert(fromEnc, toEnc, b);

// Convert the new byte[] into a char[] and then into a string.
char[] utf8Chars = new char[toEnc.GetCharCount(utf8Bytes, 0,
utf8Bytes.Length)];
toEnc.GetChars(utf8Bytes, 0, utf8Bytes.Length, utf8Chars,
0);

string utfString = new string(utf8Chars);

sw.Write(replaceXmlEncodingHeader(utfString, fromEncHeader
,toEncHeader));

}


}

}

Everything worked fine until we get 100MB file - I got OutOfMemory
exception.
I've tried to read it by pieces :
if (fileSize >30000000)
{
buffer = 1024;
}
but then it fills the end of the buffer with other bytes (lets say last
chunk 900 bytes - it adds 124 bytes from somewhere else) -so my
converted xml is not well formed.

Please help.
Thanks in advance.
Regards
 
N

Nicholas Paldino [.NET/C# MVP]

bbb,

You aren't checking the return value to the call to read. That value
tells you how many bytes were read into the buffer. Subsequently, you
should only be trying to convert those number of bytes, not the whole
buffer.


--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)

bbb said:
Nicholas,
Thanks for reply.
Unfortunately I cannot choose the format ( XML ) - that's given.
100MB files appear once in a while but they need to be processed.
I understand, that's I'm doing it wrong way.
Can you show me please how to do it correct.

Thanks in advance.
Regards

Your code to read in chuncks isn't correct. You are assuming that the
buffer is filled completely, when in reality, it is not. Chances are,
you
are adding the 124 bytes yourself, instead of writing only what is read.

You might also want to consider some other mechanism for storing your
data other than XML for these files. At 100MB, its going to be very
difficult to process this file for updates if the time comes.

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)

bbb said:
Hi,
I need to convert XML files from Japanese encoding to UTF-8.
I was using the following code:

using ( FileStream fs = File.OpenRead(fromFile) )
{
int fileSize = (int)fs.Length;
int buffer = fileSize;
byte[] b = new byte[buffer];

using(StreamWriter sw = new StreamWriter(toFile, true, toEnc))
{
while (fs.Read(b,0,buffer) > 0)
{
byte[] utf8Bytes = Encoding.Convert(fromEnc, toEnc, b);

// Convert the new byte[] into a char[] and then into a string.
char[] utf8Chars = new char[toEnc.GetCharCount(utf8Bytes, 0,
utf8Bytes.Length)];
toEnc.GetChars(utf8Bytes, 0, utf8Bytes.Length, utf8Chars,
0);

string utfString = new string(utf8Chars);

sw.Write(replaceXmlEncodingHeader(utfString, fromEncHeader
,toEncHeader));

}


}

}

Everything worked fine until we get 100MB file - I got OutOfMemory
exception.
I've tried to read it by pieces :
if (fileSize >30000000)
{
buffer = 1024;
}
but then it fills the end of the buffer with other bytes (lets say last
chunk 900 bytes - it adds 124 bytes from somewhere else) -so my
converted xml is not well formed.

Please help.
Thanks in advance.
Regards
 
G

Greg Young

int bytesread = fs.Read(b,0,buffer)

bytesread != 1024
bytesread == 900

then you process the entire array ...
byte[] utf8Bytes = Encoding.Convert(fromEnc, toEnc, b);

Just make this only process bytesread bytes of the buffer.
http://msdn.microsoft.com/library/d...frlrfSystemTextEncodingClassConvertTopic1.asp
Should handle this for you ending up with ..

byte[] utf8Bytes = Encoding.Convert(fromEnc, toEnc, b, 0, bytesread);

Cheers,

Greg Young
MVP - C#
http://codebetter.com/blogs/gregyoung


bbb said:
Nicholas,
Thanks for reply.
Unfortunately I cannot choose the format ( XML ) - that's given.
100MB files appear once in a while but they need to be processed.
I understand, that's I'm doing it wrong way.
Can you show me please how to do it correct.

Thanks in advance.
Regards

Your code to read in chuncks isn't correct. You are assuming that the
buffer is filled completely, when in reality, it is not. Chances are,
you
are adding the 124 bytes yourself, instead of writing only what is read.

You might also want to consider some other mechanism for storing your
data other than XML for these files. At 100MB, its going to be very
difficult to process this file for updates if the time comes.

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)

bbb said:
Hi,
I need to convert XML files from Japanese encoding to UTF-8.
I was using the following code:

using ( FileStream fs = File.OpenRead(fromFile) )
{
int fileSize = (int)fs.Length;
int buffer = fileSize;
byte[] b = new byte[buffer];

using(StreamWriter sw = new StreamWriter(toFile, true, toEnc))
{
while (fs.Read(b,0,buffer) > 0)
{
byte[] utf8Bytes = Encoding.Convert(fromEnc, toEnc, b);

// Convert the new byte[] into a char[] and then into a string.
char[] utf8Chars = new char[toEnc.GetCharCount(utf8Bytes, 0,
utf8Bytes.Length)];
toEnc.GetChars(utf8Bytes, 0, utf8Bytes.Length, utf8Chars,
0);

string utfString = new string(utf8Chars);

sw.Write(replaceXmlEncodingHeader(utfString, fromEncHeader
,toEncHeader));

}


}

}

Everything worked fine until we get 100MB file - I got OutOfMemory
exception.
I've tried to read it by pieces :
if (fileSize >30000000)
{
buffer = 1024;
}
but then it fills the end of the buffer with other bytes (lets say last
chunk 900 bytes - it adds 124 bytes from somewhere else) -so my
converted xml is not well formed.

Please help.
Thanks in advance.
Regards
 
B

bbb

Thank you very much for your help.
It works perfect.

Regards,

Greg said:
int bytesread = fs.Read(b,0,buffer)

bytesread != 1024
bytesread == 900

then you process the entire array ...
byte[] utf8Bytes = Encoding.Convert(fromEnc, toEnc, b);

Just make this only process bytesread bytes of the buffer.
http://msdn.microsoft.com/library/d...frlrfSystemTextEncodingClassConvertTopic1.asp
Should handle this for you ending up with ..

byte[] utf8Bytes = Encoding.Convert(fromEnc, toEnc, b, 0, bytesread);

Cheers,

Greg Young
MVP - C#
http://codebetter.com/blogs/gregyoung


bbb said:
Nicholas,
Thanks for reply.
Unfortunately I cannot choose the format ( XML ) - that's given.
100MB files appear once in a while but they need to be processed.
I understand, that's I'm doing it wrong way.
Can you show me please how to do it correct.

Thanks in advance.
Regards

Your code to read in chuncks isn't correct. You are assuming that the
buffer is filled completely, when in reality, it is not. Chances are,
you
are adding the 124 bytes yourself, instead of writing only what is read.

You might also want to consider some other mechanism for storing your
data other than XML for these files. At 100MB, its going to be very
difficult to process this file for updates if the time comes.

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)

Hi,
I need to convert XML files from Japanese encoding to UTF-8.
I was using the following code:

using ( FileStream fs = File.OpenRead(fromFile) )
{
int fileSize = (int)fs.Length;
int buffer = fileSize;
byte[] b = new byte[buffer];

using(StreamWriter sw = new StreamWriter(toFile, true, toEnc))
{
while (fs.Read(b,0,buffer) > 0)
{
byte[] utf8Bytes = Encoding.Convert(fromEnc, toEnc, b);

// Convert the new byte[] into a char[] and then into a string.
char[] utf8Chars = new char[toEnc.GetCharCount(utf8Bytes, 0,
utf8Bytes.Length)];
toEnc.GetChars(utf8Bytes, 0, utf8Bytes.Length, utf8Chars,
0);

string utfString = new string(utf8Chars);

sw.Write(replaceXmlEncodingHeader(utfString, fromEncHeader
,toEncHeader));

}


}

}

Everything worked fine until we get 100MB file - I got OutOfMemory
exception.
I've tried to read it by pieces :
if (fileSize >30000000)
{
buffer = 1024;
}
but then it fills the end of the buffer with other bytes (lets say last
chunk 900 bytes - it adds 124 bytes from somewhere else) -so my
converted xml is not well formed.

Please help.
Thanks in advance.
Regards
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top