Removing Certain HTML Markup from String using Regex

Both ASP.Net and ASP.Net MVC frameworks by default protect your web application from accepting Form input data containing “<” characters, since this character may start a malicious HTML tag (i.e. HTML injection) if it is output unencoded by the web app.

If the input contains a “<“, ASP.Net will throw an exception:

A potentially dangerous Request.Form value was detected from the client

You can disable this validation routine in ASP.Net by adding the section

<configuration> 
  <system.web> 
    <pages validateRequest="false" /> 
  </system.web> 
</configuration>

to your web.config, and in ASP.Net MVC by adding the

[ValidateInput(false)]

attribute to the controller method.

Depending on your application, you now need to encode the data whenever it is output in an HTML page, or you need to check whether it really contains malicious tags, and remove them.

I created two regex’s for this purpose. The first one removes all attributes and values from HTML tags:

string html = "<p style=\"font-size: 2em\">hello. this is some html text</p>";
var rexTagAttrs = new Regex(@"<(\w+)\s.*?>");
html = rexTagAttrs.Replace(html, "<$1>");

The $1 evaluates to any tag name (\w+) the regular expression finds, and everything up to the next “>” is removed. This gets you rid of any custom formatting, such as created by rich text editors.

The next regex removes any HTML tags except those explicitly allowed:

var rexAllowedOnly = new Regex(@"(</?[^(p|ol|ul|li|span|i|b|br)]/?>)");
html = rexAllowedOnly.Replace(html, "");

The result of these two operations is HTML text that only contains the allowed tags which are save to display without encoding.

1 thought on “Removing Certain HTML Markup from String using Regex

  1. wow, you are my hero! This is by far the best and only regex example i have found that exactly does what I was searching for. Removing unwanted Word Html Markup but leaving valid html code in place!

    Well done.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.