Opened 13 years ago

Closed 12 years ago

#253 closed defect (fixed)

improved getHTML

Reported by: troels_kn Owned by: gogo
Priority: normal Milestone: 2.0
Component: Plugin_Other Version:
Severity: normal Keywords:


I have been developing a plugin for htmlarea, and I'm considering to port it to Xinha, since the development of htmlarea has stalled.
While working with this [url=]plugin[/url] I also had to do a few patches/hacks on htmlarea and thus I pretty much knows the insides of htmlarea. I don't really have the time currently to opt in on developing on xinha, but I have a few suggestions for improvements.
For one thing, I had to pull out xhtml-wellformed content in a more effecient way. Using javascript to traverse through the dom and build the markup is quite cpu-hungry. The solution I came up with was to use an arsenal of regex'es to correct the html output into xhtml. The speed-improvement is considerable (more than 1:1000). I also made a function for correcting idention of sourcecode. Theese functions could quite easily be integrated into xinha, by substituting HTMLArea.getHTML() / HTMLArea.getHTMLWrapper()

The source in mind can be found by downloading the Indite plugin - You should use the file xml/XML_Utility.js and the functions XML_Utility.cleanHTML() and XML_Utility.indent(), witch takes the "raw" markup from editor.getInnerHTML(). Works with mozilla and IE.

Attachments (9)

XML_Utility.js (10.7 KB) - added by troels_kn 13 years ago. (4.4 KB) - added by wymsy 13 years ago.
GetHtml? plugin (3.6 KB) - added by wymsy 13 years ago.
GetHtml? plugin (updated)
.2 (0 bytes) - added by wymsy 13 years ago.
GetHtml? plugin (v3) (3.6 KB) - added by wymsy 13 years ago.
GetHtml? plugin (v3) (3.7 KB) - added by wymsy 13 years ago.
GetHtml? plugin (v4) (3.8 KB) - added by wymsy 13 years ago.
GetHtml? plugin (v5) (3.7 KB) - added by wymsy 13 years ago.
GetHtml? plugin (v5)
full_example-menu.html (7.6 KB) - added by mharrisonline 13 years ago.
Here is an updated example menu with this plugin and the rest that are in the folder

Download all attachments as: .zip

Change History (87)

Changed 13 years ago by troels_kn

comment:1 Changed 13 years ago by wymsy

A nice piece of work, however I have found a couple of problems. Testing in firefox (haven't got to IE yet), regexp 00 not only lowercases tags and attribute names, it also lowercases all text content. Similarly, regexp 02 also put quotes around text following an = sign in the text content. The program needs to be modified to only process tags and not the rest of the page contents.

Also, I had to modify the indent function to replace \n's with a space instead of removing them, otherwise words were running together.

comment:2 Changed 13 years ago by wymsy

I've got it working quite well now - changed the way I was calling the RE's so only tags are processed. I need to test it some more, especially look at how paths are handled by the different browsers, but if all goes well I'll post my code soon.

Indent, which is a completely separate function, works beautifully in both browsers. I made a couple of tweaks, and I'll post that shortly also.

comment:3 Changed 13 years ago by wymsy

I've been testing this for over a week now, and it was looking good until I tried it with a Flash Movie plugin I have been working on. The plugin inserts an <object > tag, several <param /> tags, an <embed /> tag, and a </object> tag, in the usual way for Flash movies. It works well in HTMLArea 3 and in unmodified Xinha. The problem is that with the modified code below, the object and param tags tend to disappear, leaving only the embed tag! It happens after I load the editor with a page already containing the object tags. The editor loads correctly, which I verified with an alert at the end of initIframe, but then after anything that invokes innerHTML the object tag is gone.

I haven't been able to prove whether this is a problem in Xinha or in IE, or possibly just my copy of IE, so I am posting the code in case someone else would like to try it.

HTMLArea.getHTMLWrapper = function(root, outputRoot, editor, indent) {
  var html = "";
//  if(!indent) indent = '';
  switch (root.nodeType) {
    case 10:// Node.DOCUMENT_TYPE_NODE
    case 6: // Node.ENTITY_NODE
    case 12:// Node.NOTATION_NODE
      // this all are for the document type, probably not necessary

    case 2: // Node.ATTRIBUTE_NODE
      // Never get here, this has to be handled in the ELEMENT case because
      // of IE crapness requring that some attributes are grabbed directly from
      // the attribute (nodeValue doesn't return correct values), see
      // for information

    case 4: // Node.CDATA_SECTION_NODE
      // Mozilla seems to convert CDATA into a comment when going into wysiwyg mode,
      //  don't know about IE
      html += (HTMLArea.is_ie ? ('\n' + indent) : '') + '<![CDATA[' + + ']]>' ;

    case 5: // Node.ENTITY_REFERENCE_NODE
      html += '&' + root.nodeValue + ';';

      // PI's don't seem to survive going into the wysiwyg mode, (at least in moz)
      // so this is purely academic
      html += (HTMLArea.is_ie ? ('\n' + indent) : '') + '<?' + + ' ' + + ' ?>';

      case 1: // Node.ELEMENT_NODE
      case 11: // Node.DOCUMENT_FRAGMENT_NODE
      case 9: // Node.DOCUMENT_NODE
    var closed;
    var i;
    var root_tag = (root.nodeType == 1) ? root.tagName.toLowerCase() : '';
 //   if (root_tag == 'br' && !root.nextSibling)
 //     break;
    if (outputRoot)
      outputRoot = !(editor.config.htmlRemoveTags && editor.config.htmlRemoveTags.test(root_tag));

    if (outputRoot) {
      closed = (!(root.hasChildNodes() || HTMLArea.needsClosingTag(root)));
      html += "<" + root.tagName.toLowerCase();
      var attrs = root.attributes;
      for (i = 0; i < attrs.length; ++i) {
        var a = attrs.item(i);
        if (!a.specified && !(root.tagName.toLowerCase().match(/input|option/) && a.nodeName == 'value')) {
        var name = a.nodeName.toLowerCase();
        var value;
        if (name != "style") {
          if (typeof root[a.nodeName] != "undefined" && name != "href" && name != "src" && !/^on/.test(name)) {
            value = root[a.nodeName];
          } else {
            value = a.nodeValue;
            // IE seems not willing to return the original values - it converts to absolute
            // links using a.nodeValue, a.value, a.stringValue, root.getAttribute("href")
            // So we have to strip the baseurl manually :-/
            if (HTMLArea.is_ie && (name == "href" || name == "src")) {
              value = editor.stripBaseURL(value);
        } else { // IE fails to put style in attributes list
          // FIXME: cssText reported by IE is UPPERCASE
          value =;
        html += " " + name + '="' + HTMLArea.htmlEncode(value) + '"';
      if (html != "") {
        html += closed ? " />" : ">";
		html += editor.getInnerHTML().replace(/<[^>]*>/gi, function($1){return XML_Utility.cleanHTML($1,false)});
		if (outputRoot && !closed) {
			html += "</" + root.tagName.toLowerCase() + ">";
		html = XML_Utility.indent(html);
      case 3: // Node.TEXT_NODE
    html = /^script|style$/i.test(root.parentNode.tagName) ? : HTMLArea.htmlEncode(;

      case 8: // Node.COMMENT_NODE
    html = "<!--" + + "-->";
  return html;
XML_Utility = {};

XML_Utility.RegExpCache = [
/*00*/ // new RegExp().compile(/[< ]+([^= ]+)/gi),//lowercase tags/attribute names DOESN'T WORK!!! lowercases content also!!
/*00*/  new RegExp().compile(/[< ]+([^= ]+)/gi),//lowercase tags/attribute names DOESN'T WORK!!! lowercases content also!!
/*01*/  new RegExp().compile(/(\S*\s*=\s*)?_moz[^=>]*(=\s*[^>]*)?/gi),//strip _moz attributes
/*02*/  new RegExp().compile(/\s*=\s*(['"])?(([^>" ]| (?=[^"=]+['"]))+)\1?/gi),//add attribute quotes
/*03*/  new RegExp().compile(/\/>/g),//strip singlet terminators
/*04*/  new RegExp().compile(/<(br|hr|img|input|link|meta|param|embed)([^>]*)>/g),//terminate singlet tags
/*05*/  new RegExp().compile(/(checked|compact|declare|defer|disabled|ismap|multiple|no(href|resize|shade|wrap)|readonly|selected)/gi),//expand singlet attributes
/*06*/  new RegExp().compile(/(="[^']*)'([^'"]*")/),//check quote nesting
/*07*/  new RegExp().compile(/&(?=[^<]*>)/g),//expand query ampersands
/*08*/  new RegExp().compile(/<\s+/g),//strip tagstart whitespace
/*09*/  new RegExp().compile(/\s+(\/)?>/g),//trim whitespace
/*10*/  new RegExp().compile(/\s{2,}/g),//trim extra whitespace
/*11*/  new RegExp().compile(/&\w*;/g),
/*12*/  new RegExp().compile(/^<body>\s*/gi),
/*13*/  new RegExp().compile(/\s*<\/body>/gi),
/*14*/  new RegExp().compile(/<\/?(div|p|h[1-6]|table|tr|td|th|ul|ol|li|blockquote|object|br|hr|img|embed|param)[^>]*>/g),
/*15*/  new RegExp().compile(/<\/(div|p|h[1-6]|table|tr|td|th|ul|ol|li|blockquote|object)( [^>]*)?>/g),//blocklevel closing tag
/*16*/  new RegExp().compile(/<(div|p|h[1-6]|table|tr|td|th|ul|ol|li|blockquote|object)( [^>]*)?>/g),//blocklevel opening tag
/*17*/  new RegExp().compile(/<(br|hr|img|embed|param)[^>]*>/g)//singlet tag

  * Cleans HTML into wellformed xhtml
  * A much faster way of retrieving the html-source of the document than the default supplied by HtmlArea
  * mishoo should feel free to copy this to the main distribution
  * credits goes to adios, who helped me out with this one :
XML_Utility.cleanHTML = function(sHtml, bReplaceEntities) {
        var c = XML_Utility.RegExpCache;

        sHtml = sHtml.
                replace(c[0], function($1) { return $1.toLowerCase(); } ).//lowercase tags/attribute names
                replace(c[1], ' ').//strip _moz attributes
                replace(c[2], '="$2"').//add attribute quotes
                replace(c[3], '>').//strip singlet terminators
                replace(c[9], '$1>').//trim whitespace
                replace(c[4], '<$1$2 />').//terminate singlet tags
                replace(c[5], '$1="$1"').//expand singlet attributes
                replace(c[6], '$1$2').//check quote nesting
                replace(c[7], '&').//expand query ampersands
                replace(c[8], '<').//strip tagstart whitespace
                replace(c[10], ' ');//trim extra whitespace
        if ((typeof(bReplaceEntities) == "boolean") ? bReplaceEntities : true) { // fix entities ? default = yes
                return XML_Utility.replaceEntities(sHtml);
        return sHtml;

  * Prettyfies html by inserting linebreaks before tags, and indenting blocklevel tags
  * @todo    linebreaks are not preserved in preformatted tags, witch likely will cause trouble.
  *          some unmotivated extra whitespaces ends up at the end of lines. not really a problem, but
  *          annoying none the less.
XML_Utility.indent = function(s, sindentChar) {
        XML_Utility.__nindent = 0;
        XML_Utility.__sindent = "";
        XML_Utility.__sindentChar = (typeof sindentChar == "undefined") ? "  " : sindentChar;
        var c = XML_Utility.RegExpCache;
        s = s.replace(/[\n\r]/gi, " ").replace(/\s+/gi," ").replace(c[14], function($1) {
                        if ($1.match(c[16])) {
                                var s = "\n" + XML_Utility.__sindent + $1;
                                // blocklevel openingtag - increase indent
                                XML_Utility.__sindent += XML_Utility.__sindentChar;
                                return s;
                        } else if ($1.match(c[15])) {
                                // blocklevel closingtag - decrease indent
                                XML_Utility.__sindent = "";
                                for (var i=XML_Utility.__nindent;i>0;--i) {
                                        XML_Utility.__sindent += XML_Utility.__sindentChar;
                                return "\n" + XML_Utility.__sindent + $1;
                        } else if ($1.match(c[17])) {
                                // singlet tag
                                return "\n" + XML_Utility.__sindent + $1;
                        return $1; // this won't actually happen
        if (s.charAt(0) == "\n") {
                return s.substring(1, s.length);
        return s;

It looks to me like a problem in IE in innerHTML, but I have not found any references to any known bugs like this. So it might just be something messed up in my PC. I did reinstall IE, with no effect. I can't find anything in Xinha to explain it, either. If anyone can reproduce the problem, or fail to, I'd be interested to hear.

comment:4 Changed 13 years ago by anonymous

This is one of the problems that ticket 287 documents and fixes, at least for older versions of XINHA

comment:5 Changed 13 years ago by wymsy

It's actually not quite the same. Ticket 287 fixes the problem of embed tags being lost when constructing the html from the DOM. What I'm seeing is the object and param tags being lost when using innerHTML.

Upon further testing, I am finding that I have the same problem in Xinha (version 193) with the regular getHTMLWrapper function, modified slightly with the essence of the 287 fix. So it has nothing to do with the code in this ticket, and it still may turn out to be some weirdness in my PC.

comment:6 Changed 13 years ago by mharrisonline

Wow, it works great! When I tested the example above in IE6 it improved the code for Flash and made it easier to read, and preserved the code for scripting and noscript.

The only problem I saw was that when you are in full HTML mode the body tag becomes:

<body contenteditable="true">

comment:7 Changed 13 years ago by mharrisonline

One other possible problem, I had previously noticed that with the current download that if I replaced the HTMLArea.getHTMLWrapper with the one I had submitted in Ticket 287, the fix in ticket 127 no longer made HTMLArea.htmlEncode work. I was able to add the fixes in 287 to the current download's HTMLArea.getHTMLWrapper, and then the fix in ticket 127 was again able to convert symbols to HTML entities.

The same thing happens with this code, it makes the sooped-up HTMLArea.htmlEncode in ticket 127 unable to capture symbols from the CharacterMap? plugin and convert them to entities. Some part of this modification probably needs to be updated to allow HTMLArea.htmlEncode to work properly again (at least when 127 is applied).

Except for that, I definitely like this better than what I submitted in 287.

comment:8 Changed 13 years ago by wymsy

The code submitted by troels_kn at the beginning of this ticket includes an encoding utility similar to ticket 127's which, for simplicity's sake, I did not include in my tests (yet). But the hooks are there, just change the second parameter passed to cleanHTML to true and copy the replaceEntities utility into xinha.js.

comment:9 Changed 13 years ago by mharrisonline

Hmmm, I tried what you described, but can't get it to keep the symbols as entities. I noticed another problem, you get a javascript error (line 4241, character 9) when you try to undo.

comment:10 Changed 13 years ago by mharrisonline

To be more exact, undo works with the code posted above on Jun 9, but if you go to Full Page mode, undo no longer works, and the body node becomes <body contenteditable="true">.

comment:11 Changed 13 years ago by wymsy

Ah, well, I just took a closer look at the two encoding functions, and they are not at all the same. The one in this ticket just translates named entities to the numeric representation, where the one in ticket 127 translates the actual character into the named entity. So to preserve symbols we need ticket 127. The encoding function in this ticket doesn't add anything particularly useful.

comment:12 Changed 13 years ago by mharrisonline

Do you think this could be made to work in Full Page Mode too?

comment:13 Changed 13 years ago by wymsy

One more regexp to strip out the contenteditable=true attribute would be a good place to start. I don't know if that would make undo work, but it's possible.

comment:14 Changed 13 years ago by mharrisonline

I'm feeling pretty regex challenged. I've been trying this for days, and no matter what I do, either contenteditable="true" reappears, or I get the message that I messed up the DOM.

The current HTMLArea.getHTMLWrapper catches contenteditable with:

  if (/(_moz)|(contenteditable)|(_msh)/.test(name)) {
          // avoid certain attributes

but even if I restore those lines it still keeps happening.

comment:15 Changed 13 years ago by anonymous

Great work on this guys, it's looking very promising!

comment:16 Changed 13 years ago by mharrisonline

Does anybody have a clue how to make this work in full page mode?

comment:17 Changed 13 years ago by mharrisonline

Whoops! It looks like this works fine (except for content=editable in the body), undo in Full Page mode is completely broken in Xinha period, it has nothing to do with this at all.

comment:18 Changed 13 years ago by mharrisonline

...and it does bypass HTMLArea.htmlEncode, preventing the fix in ticket 127 from converting characters to HTML entities.

comment:19 Changed 13 years ago by gogo

  • Milestone changed from Version 1.0 to 2.0

I'm going to bump this to version 2.0, I'm not keen on making such a large modification to core functionality just now.

comment:20 Changed 13 years ago by mharrisonline

I did figure out how to make this work with ticket 127, afterall. This new code already takes care of the simple replacements for < and >, etc., so I altered the HTMLArea.htmlEncode in ticket 127 by removing all original regex expressions, and just leaving the the latin, greek, math, etc. I then used the HTMLArea.htmlEncode function on the final output, which normally would have turned the < and > symbols in the HTML into entities.

comment:21 Changed 13 years ago by mharrisonline

Wymsy's modification above to preserve Flash code works great for that purpose, and I had noticed that it also doesn't empty the Script node like the normal HTMLArea.getHTMLWrapper function does.

However, after testing this to see how it handles JavaScript? in the code, I've found that because it isn't preserving formatting in script nodes, it causes unterminated string errors, etc. So, as it is right now, it isn't something that can be used with content that contains scripting. Also, in a case-sensitive LINUX environment it can be problematic when more than just tags are being converted to lowercase.

comment:22 Changed 13 years ago by wymsy

The formatting is done in the indent() function, separate from cleaning the tags. You might try commenting that line (html = XML_Utility.indent(html);) out to see if scripts work better that way. I'm looking at changing the indent function to not strip line breaks inside script and pre tags. I'll report back when I get that working.

comment:23 Changed 13 years ago by wymsy

  • Component changed from Xinha Core to Plugin_Other

I've done some more work on this, and I now have a version that I think takes care of all the issues noted above. To make it easy for others to try, I have packaged it as a plugin and attached it to this ticket.

Plugin features:

  • Much faster than HTMLArea.getHTML
  • Eliminates the hacks to accomodate browser quirks
  • Returns correct code for Flash objects and scripts
  • Formats html in an indented, readable format in html mode
  • Preserves script and pre formatting
  • Removes contenteditable from body tag in full-page mode
  • does not require stripBaseURL()
  • includes the expanded htmlEncode() function from ticket 127

It works well in my application, which does not use full-page and does not require stripBaseURL(). However, the limited testing I have done in those areas leads me to believe that stripBaseURL() is not needed ever, and the special requirements of full-page are handled.

I encourage others to try the plugin. If no other unsurmountable issues are uncovered, eventually this could be integrated into the core htmlarea.js

Changed 13 years ago by wymsy

GetHtml? plugin

comment:24 Changed 13 years ago by niko

looks nice! It creates valid XHTML-code!! amazing :D
and much simpler than the old getHTML[[BR]]
And it is really nice as a plugin - so we can include it into xinha and people can test it - without dumping the tested and working old getHTML-functions.

a few things i noticed: (using Firefox)

  • the expand singlet attributes is buggy, try the following html-code:
      <option value="1">asdf"</option>
      <option value="1" selected="selected">

cleanHTML will make that out of it:

<select><option value="1">asdf"</option><option value="1"="selected="selected"=" selected="selected"="selected="selected"">asdf</option></select> 
  • the stripBaseURL-function is missing (as you pointed out allready) - the function IS necessary! (at least for me :D)

probably using such reg-exprs you could call the stripBaseURL-functions:


(there are probably better ones, i'm not that good in regexpr-writing :D)

  • the expand query ampersands is buggy:
    <a href="blah?param&otherparam">

gets converted into

<a href="blah?param&amp;otherparam">
  • this html-code
    <a href="asdf" onclick="'asdfadf')"">asdf</a>

gets converted into

<a href="asdf" onclick="try{if(document.designMode" && document.designmode="='on') return false;}catch(e){}'asdfadf')"">asdf</a>
                                                 ^^^                           ^^^
  • and the last thing: imho the htmlEncode-function isn't necessary- with the right encoding all these characters should be saved correclty.

comment:25 Changed 13 years ago by wymsy

Niko, thanks for the feedback. Here is another version to try. Changes made:

  • Fixed the regexp for expand singlet attributes.
  • Added the stripBaseURL function. Now behaves the same as unmodified Xinha.
  • Removed expand query ampersands. Now behaves the same as unmodified Xinha - the & appears in html view, but reverts to & on output.
  • Fixed the problem with onclick. This was coming from the inwardHTML and outwardHTML functions. The regexps were modifying the string and preventing outwardHTML from matching it. Fixed with a patch to outwardHTML.
  • Took out the htmlEncode function. For those who feel they need it, probably best to implement it as a separate plugin.

Changed 13 years ago by wymsy

GetHtml? plugin (updated)

Changed 13 years ago by wymsy

GetHtml? plugin (v3)

Changed 13 years ago by wymsy

GetHtml? plugin (v3)

comment:26 Changed 13 years ago by niko

thanks! almost everything is fixed :D

  • the onclick=" isn't working perfect yet, give the Linker-Plugin a try, it will insert html-code like this:
    <a onclick=", 'popupwindow',  'toolbar=yes,scrollbars=yes,resizeable=yes');return false;" title="" target="popup" href="">consectetuer</a>

it gets a bit messed up into this:

<a onclick=", 'popupwindow'," 'toolbar="yes,scrollbars=yes,resizeable=yes);return false;"" title="" target="popup" href="">consectetuer</a>
  • you use stripBaseURL only in ie (as in the original getHTML) - as Mozilla doesn't make absolute URLs out of relative.

BUT, try this:

<a href="/test.html">foo</a>

it will be converted into

<a href="http://thedomain/test.html">foo</a>

when baseURL = "http://thedomain"; it will be stripped again (just like IE)

so imho you should use stripBaseURL for Mozilla too, what do you think?

comment:27 Changed 13 years ago by derekcopelin@…


Is there a way of altering this plug in so that it ignores particular tags? I previously used php to insert a style sheet external link to replicate the format used on the site in the editor and then stripped it with php on save. At the moment it is being stripped out by the plug in and I can't see exactly where to change it.



comment:28 Changed 13 years ago by wymsy

Hmmm, this onclick thing could get messy. It's really more general than that, what we really need to do is isolate all onxxxx="(javascript)" event handlers and pass them through unmodified.

As for the semi-absolute links, that's getting complicated, too. I'm seeing some of the same behavior in standard xinha, but I haven't quite figured out what's going on yet.

Comments and suggestions welcome....

comment:29 Changed 13 years ago by wymsy

Derek, if you use the xinha_config.pageStyle or xinha_config.pageStyleSheets configuration option, the style sheet won't be visible to the plugin and won't be in the saved content, so stripping isn't a problem. (This assumes you are not using full-page mode.)

comment:30 Changed 13 years ago by niko

the stripBaseURL stuff is perfect now! It even works with Semi-Absolute-Links and Relative Links in both IE and FF (at least what my limited testing showed)

Mozilla fixes the links in fixRelativeLinks - so it is not needed in getHTML again.

....and it would be a killer-feature to leave php-tags alone! i hope it is possible :D

comment:31 Changed 13 years ago by niko

wow, these regexp's are difficult!
i didn't know that it is possible to use \1 within an expression!
(phps preg_match doesn't support that :( )

my suggestion for the onxxx-problem is to divide the add attribute quotes into two reg-exprs. one that fixes <tag prop=val> and <tag selected> (which doesn't effect the onxxx-properties)

and one reg-expr that looks for <tag prop="blah"> - where you don't have to use the space in [>" ] to get the end value (which is currently the problem i think)

....and for the php-code-problem: is it enough to check for <\?.*\?>$ in cleanHTML? if it matches just return the tag as it is! Or does the browser then mess up the code? the same possible for javascript-code within the html-code?

comment:32 Changed 13 years ago by wymsy

Niko, did you change something to get stripBaseUrl working? I didn't...?

comment:33 Changed 13 years ago by mharrisonline

I noticed that when you use this in IE with the Full Page plugin, body attributes are removed. Also, this JavaScript? error popped up when the editor initialized:

Line: 155
Error: Could not set the innerHTML property. Invalid target element for this operation.

this._iframe.contentWindow.document.documentElement.innerHTML = this.inwardHtml(this._textArea.value);

comment:34 Changed 13 years ago by niko

to get the semi-absolute-links working i added a config.baseHref = ''; (note: no slash at the end! if you add a slash at the end, you will get relative links)

thats all i have changed (and this was needed for the original xinha without getHTML-plguin too)

comment:35 Changed 13 years ago by wymsy

I have uploaded another revision, v4 above. Improvements since v3:

  • fixed the onxxxx= problem. (modified regexps c[0] and c[2], and added c[11])
  • added code to ignore php tags.... but they still get stripped out somewhere else :(
  • eliminated the javascript error in full-page mode.
  • cleaned up some code for readability.

I don't know why body attributes are being stripped - it's not in this code. The other problem in full-page mode is that list items have no closing tags (</li>) in IE except for the last item in the list. The code that generated the error message was intended to fix that, but it doesn't work (obviously). It's not fatal, because browsers generally render the lists ok, but it's not right either.

Other than these issues, can anyone find any other problems?

Changed 13 years ago by wymsy

GetHtml? plugin (v4)

comment:36 Changed 13 years ago by wymsy

Well, in my previous post I was half right on a couple of things. The attributes in body tags are being stripped out in FireFox? before getting to this function. But in IE it was due to a subtle bug in the contenteditable regular expression, now fixed (v5 above).

Likewise, php tags are removed in FF, but not in IE. However, even in IE line breaks are lost.

Changed 13 years ago by wymsy

GetHtml? plugin (v5)

comment:37 Changed 13 years ago by wymsy

Hmmm, did a little more testing and now body attributes are not getting stripped in FireFox? either! I must have set something up wrong....

mharrisonline, can you confirm that it now works in full-page mode for you (in both browsers)? If so, then the only outstanding issue I know of is the </li> problem in IE. (I don't use php, so I won't be putting a lot of effort into making that work. And it goes in the "enhancement" category anyway.)

(btw, I may have left a couple of debugging alerts in the latest version, in lines 58 and 77.)

comment:38 Changed 13 years ago by wymsy

I fixed the </li> problem in IE, the right way this time, so it works in full-page mode also and doesn't rely on a mysterious hack to initIframe(). Attachement v5 above....

Changed 13 years ago by wymsy

GetHtml? plugin (v5)

comment:39 Changed 13 years ago by wymsy

Oops, it's actually GetHtml?

comment:40 Changed 13 years ago by mharrisonline

With tonight's XINHA download and GetHtml? used with the fullpage plugin, there is a closing li tag appearing in the head, right after the opening head tag. This only happen s in IE. If there is a title, the </li> appears after the title.

Verbose script tags work fine, attributes are kept as well as formatting.

Noscript works fine.

Flash works fine as before, although the object parameters

 <param name="_cx" value="12250" />
            <param name="_cy" value="6641" />

are kind of odd. They don't seem to hurt anything though.

I didn't implement the full page plugin because I've seen too much background color abuse in online courses where the system provided easy access to changing the color. Absolutely unreadable pages with hideous dark dark brown backgrounds with black text seemed to be the most popular. So I only use the this.fullPage parameter set to true. I'm happy to report that everything I tried with the full page plugin worked the same without it and full page set to true.

This is a huge improvement over the current Xinha getHTML.

comment:41 Changed 13 years ago by mharrisonline

I think that a challenge for the future is to make body tags and body events not be corrupted in either browser.

Also there is the issue of allowing javascript to exist in the code. If a document.write statement exists, it will immediately write into the source code.

I handle that in my implementation by changing the word javascript to freezescript, onload to onplaceholder, and convert <script> to <script language="FreezeScript?" type="text/freezescript">. On submit I change everything back. It works perfectly, although it could be confusing to someone viewing the source.

comment:42 Changed 13 years ago by mharrisonline

Body attributes in full page mode are working normally. Currently in core Xinha the body tag loses all attributes in Firefox when the full page config is set to true and the full page plugin is not used. This is a bug that was introduced a few months ago into core XINHA, and seperately into the full-page plugin. It was fixed for the full-page plugin only, core XINHA is still broken.

comment:43 Changed 13 years ago by wymsy

The stray </li> tag when using the FullPage? plugin comes from one of the regexps erroneously picking up a <link> tag in the head. Change line 83 to

	sHtml = sHtml.replace(/<li( [^>]*)?>/g,'</li><li$1>').

to fix it.

comment:44 Changed 13 years ago by mharrisonline

It works perfectly!

comment:45 Changed 13 years ago by niko

you could probably commit this plugin - so more people would test it

comment:46 Changed 13 years ago by wymsy

Committed in changeset 345.

Also added support for only7BitPrintablesInURLs, and tidied up the code a bit more.

comment:47 Changed 13 years ago by kimss

Pretty amazing speed it seems, I just did a benchmark test from your page with which is a load of HTML - to much for a "live example" but not far fetched if you paste from Word...

Anyways, test results are :

Test A : 10946 mS
Test B : 3324 mS
Test C : 361 mS

Mind I tell you Im sitting on a lousy 1.2Ghz AMD... However, Im not the only one with a slow old fashioned CPU. I have quite a few clients who have alot of bloated code in their WYSIWYG areas which on my machine takes 4-5 seconds to complete when pressing submit - now I know why, :D

I will absolutely activate this plugin pronto! Great work, interesting reading the sitepoint thread.

Changed 13 years ago by mharrisonline

Here is an updated example menu with this plugin and the rest that are in the folder

comment:48 Changed 13 years ago by mharrisonline

I attached an updated full_example-menu.html that has this plugin, and some other new ones that are in the plugins folder.

comment:49 Changed 13 years ago by mharrisonline

New bug in checked-in version of get-html.js...

The head now looks like this:

  <head><link id="IA-style" href="function(str,p1,p2,p3){return this.stripBaseURL(p3)}" rel="stylesheet" />

Something was changed that made it not work with the full-page plugin

comment:50 Changed 13 years ago by wymsy

Sorry about that, another IE quirk. It's fixed now.

comment:51 Changed 13 years ago by mharrisonline

Hmmm... I wonder how a plugin using the extended encode function to fix the problem of IE corrupting HTML entities could work both with and without this, since this plugin no longer calls the encode function. Can one plugin detect if another plugin is being used?

I had started making such a plugin some time ago, making the extended replaces be a seperate function, but I never got it to work properly (whereas switching out the encode function on htmlarea.js worked perfectly, so I stuck with that...)

comment:52 Changed 13 years ago by wymsy

To use htmlEncode() with this plugin, you could do a simple override of getHTML(). Something like:

HTMLArea._origGetHtml = HTMLArea.getHTML;
HTMLArea.getHTML = function(html) {
  return html;

Put the above code in your config file, or in your plugin if you go that way (make sure it loads after the GetHtml? plugin). (DISCLAIMER: I haven't tested this code!)

comment:53 Changed 13 years ago by anonymous

Tested in the Xinha example with IE, and while it seems to run nicely when this plugin is enabled is strips the <embed> tags out of flash objects. When the GetHTML plugin is disabled it works as it should.

Any thoughts? I noticed some previous work, but nothing concrete about them

comment:54 Changed 13 years ago by anonymous

I can confirm the CVS version of the getHTML plugin removes the optional <param> tags, but still leaves the object tags in place. I'm guessing that's what the user above meant??

Here's the code I tested with:

<object type="application/x-shockwave-flash" height="600" width="160" data="testimage.swf"> 
<param value="testimage.swf" name="movie" /> 
<param value="high" name="quality" />

This gets parsed and returned as:

<object type="application/x-shockwave-flash" height="600" width="160" data="testimage.swf">

With getHTML disabled, it remains untouched

comment:55 Changed 13 years ago by wymsy

anonymous # 2, you appear to be using the 'satay' method for invoking flash. I looked at that briefly but found that the innerHTML property in IE stripped the params, as you report. This is a problem in IE that I haven't found a way around. The GetHtml? plugin works fine with flash using the traditional, non-standards-compliant-but-works-best-with-IE method.

comment:56 Changed 13 years ago by wymsy

I noticed yesterday that this plugin breaks a few of the other plugins in Gecko, specifically those that call getSelectedHTML() (FindReplace?, UnFormat?, and Stylist among those I checked). getSelectedHTML() calls getHTML() to get the html of the selected document fragment, but this plugin doesn't work correctly on document fragments (node type 11). I have been working on a fix, but I haven't found a way to derive the html associated with a node, short of reverting to the original getHTML() function.

Does anyone out there know enough about the DOM to suggest a way to make this work?

comment:57 Changed 13 years ago by wymsy

I fixed this to make it work with calls from getSelectedHTML(). I also added support for the htmlRemoveTags config option.

Applied in changeset:384

comment:58 Changed 13 years ago by mharrisonline

I just noticed that in IE and Firefox when this plugin sees a title node in this format:

<title="Here is the title">


<title="Here is the title" />

It replaces it with


comment:59 Changed 13 years ago by niko

is this valid html?

<title="Here is the title">

i don't think so....

comment:60 Changed 13 years ago by mharrisonline

Actually, that is what it should do. It seems to be perfect!

comment:61 Changed 13 years ago by mharrisonline

The plugin is still losing some formatting within <script> nodes:

If you have javascript that looks like this:

 <script language="JavaScript" type="text/javascript">
        <!-- Hide script from old browsers
        //Choose stylesheet based on browser type
        if (navigator.appName == "Netscape") {
            document.write("<link rel='STYLESHEET' type='text/css' href='style_ns.css'>")
        else {

It will become:

    <script language="JavaScript" type="text/javascript">
        <!-- Hide script from old browsers //Choose stylesheet based on browser type if (navigator.appname="=" "Netscape") { document.write("<link rel="STYLESHEET" type="text/css" href="style_ns.css" />")
        else {
            document.write("<link rel="STYLESHEET" type="text/css" href="style.css" />")


<!-- Hide script from old browsers
        //Choose stylesheet based on browser type
        if (navigator.appName == "Netscape") {

all became one line.

comment:62 Changed 13 years ago by mharrisonline

Or in other words, formatting is not preserved in comments starting with <!--

comment:63 Changed 13 years ago by mharrisonline

In the original getHTML, comments retained formatting and were handled here:

    case 8: // Node.COMMENT_NODE
    html = "<!--" + + "-->";

Could this plugin retain comment formatting also? Without it, any JavaScript? that contains comments is breaking. I've had to roll back to my patch in #287 because of this.

comment:64 Changed 13 years ago by wymsy

It now ignores comments, as of changeset:406

comment:65 Changed 13 years ago by mharrisonline

That's great! It doesn't change JavaScript? in comments anymore. Thanks for doing that!

comment:66 Changed 13 years ago by mharrisonline

I just noticed that in full page mode, the plugin adds the doctype twice: Once before the opening <HTML> tag, and once directly after. So, if you start with

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "">
<html xmlns="" xml:lang="en" lang="en">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>My Test Page</title>

You get:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "">
<html><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" ""> 
  <head><title>EDU 717 Excersize 2B</title> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

comment:67 Changed 13 years ago by mharrisonline

One more thing (sorry), within a script node if there is an HTML comment like this

<!--this script filters the foo foo-->

that closes, all the javascript in the script node that comes after the --> is on a single line. Like the earlier problem within comments, this can easily break the script.

comment:68 Changed 13 years ago by wymsy

Why would you be using html comments within a script?

comment:69 Changed 13 years ago by mharrisonline

I wouldn't, but others have.

comment:70 Changed 13 years ago by wymsy

My point is, an html comment within a script is not a comment as far as the script is concerned, so whatever is in the comment is part of the script. I can't think of a situation where this would not break the script anyway, so I don't think this is a problem in GetHtml? that needs to be fixed.

comment:71 Changed 13 years ago by wymsy

On the other hand, the same problem exists for comments within pre tags, which is valid, and the fix (for both) turns out to be easy. It's in changeset:425.

comment:72 Changed 13 years ago by mharrisonline


comment:73 Changed 13 years ago by mharrisonline

Have you thought about including areas? They're a lot like params.

I have added them below to 4, 14, and 17 to close and also be formatted.

HTMLArea.RegExpCache = [
/*00*/  new RegExp().compile(/<\s*\/?([^\s\/>]+)[\s*\/>]/gi),//lowercase tags
/*01*/  new RegExp().compile(/(\S*\s*=\s*)?_moz[^=>]*(=\s*[^>]*)?/gi),//strip _moz attributes
/*02*/  new RegExp().compile(/\s*=\s*(([^'"][^>\s]*)([>\s])|"([^"]+)"|'([^']+)')/g),// find attributes
/*03*/  new RegExp().compile(/\/>/g),//strip singlet terminators
/*04*/  new RegExp().compile(/<(area|br|hr|img|input|link|meta|param|embed)([^>]*)>/g),//terminate singlet tags
/*05*/  new RegExp().compile(/(checked|compact|declare|defer|disabled|ismap|multiple|no(href|resize|shade|wrap)|readonly|selected)([\s>])/gi),//expand singlet attributes
/*06*/  new RegExp().compile(/(="[^']*)'([^'"]*")/),//check quote nesting
/*07*/  new RegExp().compile(/&(?=[^<]*>)/g),//expand query ampersands
/*08*/  new RegExp().compile(/<\s+/g),//strip tagstart whitespace
/*09*/  new RegExp().compile(/\s+(\/)?>/g),//trim whitespace
/*10*/  new RegExp().compile(/\s{2,}/g),//trim extra whitespace
/*11*/  new RegExp().compile(/\s+([^=\s]+)(="[^"]+")/g),// lowercase attribute names
/*12*/  new RegExp().compile(/(\S*\s*=\s*)?contenteditable[^=>]*(=\s*[^>\s\/]*)?/gi),//strip contenteditable
/*13*/  new RegExp().compile(/((href|src)=")([^\s]*)"/g), //find href and src for stripBaseHref()
/*14*/  new RegExp().compile(/<\/?(div|p|h[1-6]|area|table|tr|td|th|ul|ol|li|blockquote|object|br|hr|img|embed|param|pre|script|html|head|body|meta|link|title)[^>]*>/g),
/*15*/  new RegExp().compile(/<\/(div|p|h[1-6]|table|tr|td|th|ul|ol|li|blockquote|object|html|head|body|script)( [^>]*)?>/g),//blocklevel closing tag
/*16*/  new RegExp().compile(/<(div|p|h[1-6]|table|tr|td|th|ul|ol|li|blockquote|object|html|head|body|script)( [^>]*)?>/g),//blocklevel opening tag
/*17*/  new RegExp().compile(/<(area|br|hr|img|embed|param|pre|meta|link|title)[^>]*>/g),//singlet tag
/*18*/  new RegExp().compile(/(^|<\/(pre|script)>)(\s|[^\s])*?(<(pre|script)[^>]*>|$)/g),//find content NOT inside pre and script tags
/*19*/  new RegExp().compile(/(<pre[^>]*>)(\s|[^\s])*?(<\/pre>)/g),//find content inside pre tags
/*20*/  new RegExp().compile(/(^|<!--(\s|\S)*?-->)((\s|\S)*?)(?=<!--(\s|\S)*?-->|$)/g)//find content NOT inside comments

comment:74 Changed 13 years ago by mharrisonline

I don't know if this is fixable, but it looks like any FlashVars? added to the Flash object's parameter are stripped out. You can add them to the embed and they stay intact, but the param value for FlashVars? is always "".

comment:75 Changed 13 years ago by mharrisonline

After changeset:425 I can't find any JavaScript?, no matter how old, that breaks with this plugin.

comment:76 Changed 12 years ago by wymsy

I found a bug where the regexp that finds tags to be cleaned would crash the browser on comments containing an unmatched '<'. I fixed it in changeset:468.

comment:77 Changed 12 years ago by wymsy

Due to the way innerHTML works in IE, there was a problem with fixRelativeLinks() where the url IE adds to self-named anchors was not being stripped off if the location.href of the page containing xinha had a query string with at least one '&' in it. I added an override function in changeset:478 to fix it.

comment:78 Changed 12 years ago by gogo

  • Resolution set to fixed
  • Status changed from new to closed

I think this can be closed now.

Note: See TracTickets for help on using tickets.