1
Vote

Suggestion: Change HTML preservation to shorten loading of bigger HTML files ...

description

I would suggestion to use DOM tree to load a HTML file in, to convert only text nodes and (perhaps configurable) HTML attributes, like "title", "alt".
 
The "href", "src", or "data" HTML attributes needs special care, because the URLs needs to be ASCII 7bit and may contain "HTTP Get" query specific characters like "&", "?", "=".
Thus any URL containing HTML attribute should be treated the way, that the protocol, domain, path and file names don't get any conversions, while any query values needs to be converted via MIME, while the query parameter-value-delimiter "&" needs to be a HTML entity (symbolic or numeric).
 
If it is possible to load a given HTML file via a parser into a DOM tree with XPath capabilities, than all text nodes could be referenced with the XPath expression:
 
//text()[string-length()>0]
 
The result should be a "list" of DOM tree text nodes, probably containing text to be converted.
 
To access all interesting HTML attributes via XPath the following expression may be used:
 
  • to get all attribute nodes: //@*
  • to get all (e.g. title) nodes with contents: //@title[string-length() > 0]
     
    So it would be possible to convert only the important texts inside a HTML document and to shorten the runtime for this conversion.
     
    At the end of the conversion the DOM tree is about to be serialized or "pretty printed" into text to be shown in the HTML view pane.
     
    In the tcl programming language it could be written like this:
     
    foreach xpath {{//text()} {//@alt} {//@title}} {
    foreach node [$doc selectNode $xpath] {
    $node nodeValue [htmlEntities map -symbolic true text [$node nodeValue]]
    }
    }
     
    foreach xpath {{//@href} {//@src} {//@data}} {
    foreach node [$doc selectNode $xpath] {
    set uri [uri::split [$node nodeValue]];
     
    dict set uri query [string map {"&" "&"} [dict get $uri query]];
     
    set url [uri::join {*}$uri];
     
    $node nodeValue $url
    }
    }
     
    set serialized [$doc asXML];
     
    And if a HTML file is not parsable, than the former way to convert an HTML document could be used as fallback.

comments

joanfusan wrote Apr 14, 2010 at 11:52 AM

Hi!

I'll think in it. Thanks! ;)

wrote Feb 2, 2013 at 3:09 AM