Both ASP.Net and ASP.Net MVC frameworks by default protect your web application from accepting Form input data containing “<” characters, since this character may start a malicious HTML tag (i.e. HTML injection) if it is output unencoded by the web app.
If the input contains a “<“, ASP.Net will throw an exception:
A potentially dangerous Request.Form value was detected from the client
You can disable this validation routine in ASP.Net by adding the section
<configuration> <system.web> <pages validateRequest="false" /> </system.web> </configuration>
to your web.config, and in ASP.Net MVC by adding the
attribute to the controller method.
Depending on your application, you now need to encode the data whenever it is output in an HTML page, or you need to check whether it really contains malicious tags, and remove them.
I created two regex’s for this purpose. The first one removes all attributes and values from HTML tags:
string html = "<p style=\"font-size: 2em\">hello. this is some html text</p>"; var rexTagAttrs = new Regex(@"<(\w+)\s.*?>"); html = rexTagAttrs.Replace(html, "<$1>");
The $1 evaluates to any tag name (\w+) the regular expression finds, and everything up to the next “>” is removed. This gets you rid of any custom formatting, such as created by rich text editors.
The next regex removes any HTML tags except those explicitly allowed:
var rexAllowedOnly = new Regex(@"(</?[^(p|ol|ul|li|span|i|b|br)]/?>)"); html = rexAllowedOnly.Replace(html, "");
The result of these two operations is HTML text that only contains the allowed tags which are save to display without encoding.