URL 中,查询字符串与HTML实体冲突,可能带来的问题.(2篇)

此问题相关信息(我不放在最前面,似乎有些朋友会找不到的样子.)

IE10+, Safari5.17+, Firefox4.0+,Opera12+, Chrome7+ 已经按新标准实现. 所以就没有这个问题了.

参考标准 : http://www.w3.org/html/ig/zh/wiki/HTML5/tokenization  新标准明确提到,如果实体后面遇到的不是;且下一个是= 那么就不处理的.就是为了解决这个坑爹的问题的.

我们来看demo :

<a href=”http://www.baidu.com?a=1&reg=2&reg_a=3″ >悲剧</a>

部分浏览器(对应上面已经按新标准实现的版本之下的,各个浏览器.)

点上面的链接, 会自动把  &reg 转意成® (部分浏览器会自动对转意后的字符进行编码) .  

这个bug.的本质,就是当HTML中出现相关HTML实体(HTML character entity)时.就自动转意处理了. 所以理论上, 用脚本,动态创建的资源则没有这个问题,比如 new Image().src = ‘http://www.baidu.com?a=1&reg=2’; 甚至动态创建的iframe.亦如此.

IE9- 有两个问题比其他浏览器更严重:

1. 用脚本跳转当前页比如location.href = xxx,或 location.replace(xxx) .又或者是调用window.open(xxx);如果查询字符串中包含这些html实体, 仍然会触发这个问题… 

2. ,参见标准, 你会知道实体+”其他字符”   ,    “其他字符中”,哪些与实体连接在一起,是没有这个问题的. 比如 &rega  , &reg1     其中a, 1 与 &reg 连接就不会有这种问题,从标准角度,甚至是  &reg_a 也不应有问题. 但是IE9-又一次打败了我们.  至于其他特殊字符如 # ~ 等.在各个浏览器中表现各异. 考虑我们在设计字段名时,不大可能出现那些字符.我们也不再纠结其他浏览器在此处实现的差异.

所以,理论上,这个问题应该是后端的同学,在输出html时.更加要注意的问题.  而前端同学,要注意的则是跳转或弹窗时的url中是否有相关的字段包含一个无分号即为html实体的情况.

至于IE为啥这么特殊…我也没想明白…

那么,无论后端同学也好,前端同学也罢,我们可能更改已经定好的字段成本比较高.  所以其实最妥善的办法,应该是这样子: (感谢 @辰光未然 的提醒.)

那么大概,很多HTML 实体都会出问题:

http://www.whatwg.org/specs/web-apps/current-work/multipage/named-character-references.html#named-character-references

这个表里, 没有分号结尾的,都是隐患…  也就是下面这106个: (感谢 @kenny 提供的最新的list 地址. 我花了点时间写了个脚本.把需要处理的,都抓了出来.)

我们可以用下面这个脚本来帮忙做检测 :

test case 1: 

test case 2:

以上为转载,转载来源:http://www.cnblogs.com/_franky/archive/2012/09/28/2706512.html

What’s the difference between this HTML snippet:

and this?

Both of them look like simple Google searches (though they could have been anything; Google is just an example). One of them appends an extra “&foo=0″ to the end of the URL; the other appends “&copy=0″ instead.

Only the second snippet is valid in HTML 4.01 Strict, but that snippet doesn’t work the way you might expect. Neither snippet is valid in XHTML.

Give up? Click on these:

The first URL searches for “html,” but the other URL searches for “html©=0.”

Two weird things are happening here.

  • Note that “&copy;” is an HTML entity for the copyright symbol “©.” It would have been more obvious if the URL had used a semicolon, like this:

    or if we’d used a more traditional HTML entity like this:

  • The second weird thing is a quirk in the HTML specification on character references:

    Note. In SGML, it is possible to eliminate the final “;” after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the “;” in all cases to avoid problems with user agents that require this character to be present.

    As a result, all modern browsers (FF3, IE7, Opera 9, Safari 3.1) will helpfully notice possible entities like “&copy” and “&lt” and replace them with “©” and “<” … they assume you forgot the semicolon. This applies to all of the HTML entities, even the obscure ones like &empty “∅”, &not “¬”, &reg “®”, &sub “⊂”, and &lang “⟨”. (Bizarrely, &Copy is left alone as “&Copy” but &COPY is replaced with “&COPY;”.)

We think there are two valuable lessons to learn from this story. The first lesson you may already know:

  1. The correct way to write an URL with a query parameter is to HTML escape the URL, replacing all &s with &amp; like this:

    That’s also the only way to make the snippet XHTML compliant.

  2. Don’t use URL query parameters whose names are HTML entities. Never create a web service that accepts a query parameter like “&lang=en”. After all, there’s no way to know when your users might want to copy & paste your URLs into a blog, forum, or HTML email. Even if developers are clever enough to HTML escape href links, not everyone will be, and you can save everybody some trouble by avoiding the dangerous entities altogether.

 

以上为转载,来源|source:http://blog.redfin.com/devblog/2008/10/url_query_parameters_and_html_entities_the_case_of_the_missing_semicolon.html

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注