1

Closed

Loading a file lets the "umlaute" be non-displayable/readable ...

description

I've loaded a HTML file starting with the following declarations:
 
"<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de">
<head profile="http://purl.org/uF/2008/03">
    <meta http-equiv="Content-Type"         content="application/xhtml+xml; charset=ISO-8859-1" />
    ...
</head>
...
</html>"
 
After loading the 62kB big file, which needed 4 Minutes on a 2GHz DualCore machine 4GB RAM with the enabled "preserve HTML" option, the special characters (here German umlaute) looked very strange and were not readable like the following example: "Ank�ndigung".
 
I attached additionally a screen shot of the read file with some of those special character occurrences.

file attachments

Closed Apr 19, 2010 at 9:03 AM by joanfusan
Fixed in local test, waiting test in other computers

comments

joanfusan wrote Apr 14, 2010 at 11:51 AM

Appears thats the read file method of Visual Basic don't read good the non ASCII characters. I'll search a solution...

wrote Apr 14, 2010 at 11:51 AM

wrote Apr 15, 2010 at 11:13 AM

wrote Apr 15, 2010 at 11:14 AM

joanfusan wrote Apr 15, 2010 at 11:16 AM

Solved. A new file load system can load and encode a file of 200KB in 4 seconds in Intel Core 2 1.86 GHz.



** Closed by joanfusan 15/04/2010 3:14

joanfusan wrote Apr 15, 2010 at 11:16 AM

joanfusan wrote Apr 15, 2010 at 11:17 AM

Can you write send me the text to test in development version?

wrote Apr 15, 2010 at 11:20 AM

MartinLemburg wrote Apr 15, 2010 at 12:29 PM

Here it comes!

wrote Apr 15, 2010 at 12:29 PM

MartinLemburg wrote Apr 15, 2010 at 12:39 PM

Belonging to your comment, that VB don't read good non ASCII characters ...
  1. does it really try to read ASCII characters (8bit or 7bit)
  2. the text read from file ... how is it stored ... in which type of variable, so perhaps a string typed variable with a special encoding (UNICODE, UTF-8, ASCII, etc.)
  3. the "text" widget/control ... what does it want as input (UNICODE, UTF-8, ASCII, etc.)
What did you change to solve this problem?

joanfusan wrote Apr 15, 2010 at 4:51 PM

It's weird. Opening the file that you sent me, Text to HTML can't read well the special characters. However, if I copy the text and paste it in Notepad++, creating a new file, and save it, the chars views correctly. To read the file I use the next code (in new version):

Dim srLectura As IO.StreamReader
...

' txtTexto is TextBox control.
Me.txtTexto.Text = srLectura.ReadToEnd

joanfusan wrote Apr 15, 2010 at 4:52 PM

Try the file a.html saved with Notepad++

wrote Apr 15, 2010 at 4:52 PM

joanfusan wrote Apr 15, 2010 at 4:57 PM

Solved!!!! VB.Net uses by default UTF8 encoding, but depends the type of file can read well or not. If I put the next sentences:

Dim enCodificacion As Encoding
enCodificacion = Encoding.Default

srLectura = New IO.StreamReader(Me.ofdAbrir.FileName, enCodificacion)

Works!!

wrote Apr 19, 2010 at 9:03 AM

wrote Feb 2, 2013 at 3:09 AM

wrote May 8, 2013 at 6:02 PM