URL 中,查询字符串与HTML实体冲突,可能带来的问题.（2篇）

此问题相关信息(我不放在最前面,似乎有些朋友会找不到的样子.)

IE10+, Safari5.17+, Firefox4.0+,Opera12+, Chrome7+ 已经按新标准实现. 所以就没有这个问题了.

参考标准 : http://www.w3.org/html/ig/zh/wiki/HTML5/tokenization 新标准明确提到,如果实体后面遇到的不是;且下一个是= 那么就不处理的.就是为了解决这个坑爹的问题的.

我们来看demo :

部分浏览器(对应上面已经按新标准实现的版本之下的,各个浏览器.)

点上面的链接, 会自动把 &reg 转意成® (部分浏览器会自动对转意后的字符进行编码) .

这个bug.的本质,就是当HTML中出现相关HTML实体(HTML character entity)时.就自动转意处理了. 所以理论上, 用脚本,动态创建的资源则没有这个问题,比如 new Image().src = ‘http://www.baidu.com?a=1&reg=2’; 甚至动态创建的iframe.亦如此.

IE9- 有两个问题比其他浏览器更严重:

1. 用脚本跳转当前页比如location.href = xxx,或 location.replace(xxx) .又或者是调用window.open(xxx);如果查询字符串中包含这些html实体, 仍然会触发这个问题…

2. ，参见标准, 你会知道实体+”其他字符” , “其他字符中”，哪些与实体连接在一起，是没有这个问题的. 比如 &rega , &reg1 其中a, 1 与 &reg 连接就不会有这种问题,从标准角度,甚至是 &reg_a 也不应有问题. 但是IE9-又一次打败了我们. 至于其他特殊字符如 # ~ 等.在各个浏览器中表现各异. 考虑我们在设计字段名时,不大可能出现那些字符.我们也不再纠结其他浏览器在此处实现的差异.

所以,理论上，这个问题应该是后端的同学,在输出html时.更加要注意的问题. 而前端同学，要注意的则是跳转或弹窗时的url中是否有相关的字段包含一个无分号即为html实体的情况.

至于IE为啥这么特殊…我也没想明白…

那么,无论后端同学也好,前端同学也罢,我们可能更改已经定好的字段成本比较高. 所以其实最妥善的办法,应该是这样子: (感谢 @辰光未然的提醒.)

var fixURL = function (url) {
    return url.replace(/&/g,'&');
};
//使用fixURL 去替换url中的&.然后再输出给html, 或者跳转链接，又或者弹窗... 当然,前端的同学在js代码中之所以要这样做.主要是受IE的拖累...

var fixURL = function (url) {

return url.replace(/&/g,'&');

};

//使用fixURL 去替换url中的&.然后再输出给html, 或者跳转链接，又或者弹窗... 当然,前端的同学在js代码中之所以要这样做.主要是受IE的拖累...

那么大概，很多HTML 实体都会出问题:

http://www.whatwg.org/specs/web-apps/current-work/multipage/named-character-references.html#named-character-references

这个表里, 没有分号结尾的,都是隐患… 也就是下面这106个: (感谢 @kenny 提供的最新的list 地址. 我花了点时间写了个脚本.把需要处理的,都抓了出来.)

我们可以用下面这个脚本来帮忙做检测 :

var checkURL = function () {
    var list = [ //106
            '&Aacute',
            '&aacute',
            '&Acirc',
            '&acirc',
            '&acute',
            '&AElig',
            '&aelig',
            '&Agrave',
            '&agrave',
            '&AMP',
            '&amp',
            '&Aring',
            '&aring',
            '&Atilde',
            '&atilde',
            '&Auml',
            '&auml',
            '&brvbar',
            '&Ccedil',
            '&ccedil',
            '&cedil',
            '&cent',
            '&COPY',
            '&copy',
            '&curren',
            '&deg',
            '&divide',
            '&Eacute',
            '&eacute',
            '&Ecirc',
            '&ecirc',
            '&Egrave',
            '&egrave',
            '&ETH',
            '&eth',
            '&Euml',
            '&euml',
            '&frac12',
            '&frac14',
            '&frac34',
            '&GT',
            '&gt',
            '&Iacute',
            '&iacute',
            '&Icirc',
            '&icirc',
            '&iexcl',
            '&Igrave',
            '&igrave',
            '&iquest',
            '&Iuml',
            '&iuml',
            '&laquo',
            '&LT',
            '&lt',
            '&macr',
            '&micro',
            '&middot',
            '&nbsp',
            '&not',
            '&Ntilde',
            '&ntilde',
            '&Oacute',
            '&oacute',
            '&Ocirc',
            '&ocirc',
            '&Ograve',
            '&ograve',
            '&ordf',
            '&ordm',
            '&Oslash',
            '&oslash',
            '&Otilde',
            '&otilde',
            '&Ouml',
            '&ouml',
            '&para',
            '&plusmn',
            '&pound',
            '&QUOT',
            '&quot',
            '&raquo',
            '&REG',
            '&reg',
            '&sect',
            '&shy',
            '&sup1',
            '&sup2',
            '&sup3',
            '&szlig',
            '&THORN',
            '&thorn',
            '&times',
            '&Uacute',
            '&uacute',
            '&Ucirc',
            '&ucirc',
            '&Ugrave',
            '&ugrave',
            '&uml',
            '&Uuml',
            '&uuml',
            '&Yacute',
            '&yacute',
            '&yen',
            '&yuml'
        ];
        
    return function (url) {
        var l = list;
        var i = l.length;
        var matchIndex;
        var current;
        var nextchar;
        var errors = [];
        for (; i--;){
            matchIndex = url.indexOf(l[i]);
            current = l[i];
            if(matchIndex > -1){
                if((current === '&amp' || current === '&AMP') && url.charAt(matchIndex + 4) === ';'){
                    //如果是 & 或 &AMP; 我们就认为是故意要输出 & ,比如是一个调用fixURL方法修正过的URL.里面的& 会被我们替换为 amp;
                    //所以,我们要跳过它,去检查后面.
                    continue;
                }
                nextchar = url.charAt(matchIndex + current.length);
                if(!/[a-zA-Z0-9]/.test(nextchar)){
                    //此处我们只要发现任意一个 ,如 &reg后面紧随字符不在 a-z,A-Z,0-9范围内.就算有问题.
                    //这样处理实际和标准的细节以及浏览器实现有细微差异. 但是本着任何浏览器来跑case,都能发现潜在威胁的原则.和实现复杂度的考虑.
                    // 我们姑且粗暴的这样处理了. 似乎还不错.
                     
                    errors.push(current + nextchar);
                }
            }
        }
        if(errors.length){
            throw Error('contains : \n' + errors.join('\n'));
        }
    };
}();

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

var checkURL = function () {

var list = [ //106

'&Aacute',

'&aacute',

'&Acirc',

'&acirc',

'&acute',

'&AElig',

'&aelig',

'&Agrave',

'&agrave',

'&AMP',

'&amp',

'&Aring',

'&aring',

'&Atilde',

'&atilde',

'&Auml',

'&auml',

'&brvbar',

'&Ccedil',

'&ccedil',

'&cedil',

'&cent',

'&COPY',

'&copy',

'&curren',

'&deg',

'&divide',

'&Eacute',

'&eacute',

'&Ecirc',

'&ecirc',

'&Egrave',

'&egrave',

'&ETH',

'&eth',

'&Euml',

'&euml',

'&frac12',

'&frac14',

'&frac34',

'&GT',

'&gt',

'&Iacute',

'&iacute',

'&Icirc',

'&icirc',

'&iexcl',

'&Igrave',

'&igrave',

'&iquest',

'&Iuml',

'&iuml',

'&laquo',

'&LT',

'&lt',

'&macr',

'&micro',

'&middot',

'&nbsp',

'&not',

'&Ntilde',

'&ntilde',

'&Oacute',

'&oacute',

'&Ocirc',

'&ocirc',

'&Ograve',

'&ograve',

'&ordf',

'&ordm',

'&Oslash',

'&oslash',

'&Otilde',

'&otilde',

'&Ouml',

'&ouml',

'&para',

'&plusmn',

'&pound',

'&QUOT',

'&quot',

'&raquo',

'&REG',

'&reg',

'&sect',

'&shy',

'&sup1',

'&sup2',

'&sup3',

'&szlig',

'&THORN',

'&thorn',

'&times',

'&Uacute',

'&uacute',

'&Ucirc',

'&ucirc',

'&Ugrave',

'&ugrave',

'&uml',

'&Uuml',

'&uuml',

'&Yacute',

'&yacute',

'&yen',

'&yuml'

];

return function (url) {

var l = list;

var i = l.length;

var matchIndex;

var current;

var nextchar;

var errors = [];

for (; i--;){

matchIndex = url.indexOf(l[i]);

current = l[i];

if(matchIndex > -1){

if((current === '&amp' || current === '&AMP') && url.charAt(matchIndex + 4) === ';'){

//如果是 & 或 &AMP; 我们就认为是故意要输出 & ,比如是一个调用fixURL方法修正过的URL.里面的& 会被我们替换为 amp;

//所以,我们要跳过它,去检查后面.

continue;

}

nextchar = url.charAt(matchIndex + current.length);

if(!/[a-zA-Z0-9]/.test(nextchar)){

//此处我们只要发现任意一个 ,如 &reg后面紧随字符不在 a-z,A-Z,0-9范围内.就算有问题.

//这样处理实际和标准的细节以及浏览器实现有细微差异. 但是本着任何浏览器来跑case,都能发现潜在威胁的原则.和实现复杂度的考虑.

// 我们姑且粗暴的这样处理了. 似乎还不错.

errors.push(current + nextchar);

}

if(errors.length){

throw Error('contains : \n' + errors.join('\n'));

}

};

}();

test case 1:

var url  = '//www.baidu.com?a=1&amp=2&lt=3&reg=4';           
document.onclick = function () { //IE9-好了.证明我们的修正是ok的了.
      window.open(fixURL(url))
};

var url = '//www.baidu.com?a=1&amp=2&lt=3&reg=4';

document.onclick = function () { //IE9-好了.证明我们的修正是ok的了.

window.open(fixURL(url))

};

test case 2:

var url  = '//www.baidu.com?a=1&amp=2&lt=3&reg=4';     
  try{
      checkURL(url);
  }catch(e){
      alert(e.message)
  }

var url = '//www.baidu.com?a=1&amp=2&lt=3&reg=4';

try{

checkURL(url);

}catch(e){

alert(e.message)

}

以上为转载，转载来源：http://www.cnblogs.com/_franky/archive/2012/09/28/2706512.html

What’s the difference between this HTML snippet:

<a href="http://www.google.com/search?q=html&foo=0">foo=0</a>

1	<a href="http://www.google.com/search?q=html&foo=0">foo=0</a>

and this?

<a href="http://www.google.com/search?q=html&copy=0">copy=0</a>

1	<a href="http://www.google.com/search?q=html&copy=0">copy=0</a>

Both of them look like simple Google searches (though they could have been anything; Google is just an example). One of them appends an extra “&foo=0″ to the end of the URL; the other appends “&copy=0″ instead.

Only the second snippet is valid in HTML 4.01 Strict, but that snippet doesn’t work the way you might expect. Neither snippet is valid in XHTML.

Give up? Click on these:

Two weird things are happening here.

Note that “©” is an HTML entity for the copyright symbol “©.” It would have been more obvious if the URL had used a semicolon, like this:
<a href="http://www.google.com/search?q=html©=0">copy;=0</a>
1
<a href="http://www.google.com/search?q=html©=0">copy;=0</a>
or if we’d used a more traditional HTML entity like this:
<a href="http://www.google.com/search?q=html"=0">quot;=0</a>
1
<a href="http://www.google.com/search?q=html"=0">quot;=0</a>
The second weird thing is a quirk in the HTML specification on character references:
Note. In SGML, it is possible to eliminate the final “;” after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the “;” in all cases to avoid problems with user agents that require this character to be present.
As a result, all modern browsers (FF3, IE7, Opera 9, Safari 3.1) will helpfully notice possible entities like “&copy” and “&lt” and replace them with “©” and “<” … they assume you forgot the semicolon. This applies to all of the HTML entities, even the obscure ones like &empty “∅”, &not “¬”, &reg “®”, &sub “⊂”, and &lang “⟨”. (Bizarrely, &Copy is left alone as “&Copy” but &COPY is replaced with “&COPY;”.)

We think there are two valuable lessons to learn from this story. The first lesson you may already know:

The correct way to write an URL with a query parameter is to HTML escape the URL, replacing all &s with & like this:
<a href="http://www.google.com/search?q=html&copy=0">copy=0</a>
1
<a href="http://www.google.com/search?q=html&copy=0">copy=0</a>
That’s also the only way to make the snippet XHTML compliant.
Don’t use URL query parameters whose names are HTML entities. Never create a web service that accepts a query parameter like “&lang=en”. After all, there’s no way to know when your users might want to copy & paste your URLs into a blog, forum, or HTML email. Even if developers are clever enough to HTML escape href links, not everyone will be, and you can save everybody some trouble by avoiding the dangerous entities altogether.

以上为转载，来源|source：http://blog.redfin.com/devblog/2008/10/url_query_parameters_and_html_entities_the_case_of_the_missing_semicolon.html

发表回复 取消回复

发表回复取消回复