Removing Certain HTML Markup from String using Regex

Both ASP.Net and ASP.Net MVC frameworks by default protect your web application from accepting Form input data containing “<” characters, since this character may start a malicious HTML tag (i.e. HTML injection) if it is output unencoded by the web app.

If the input contains a “<“, ASP.Net will throw an exception:

A potentially dangerous Request.Form value was detected from the client

You can disable this validation routine in ASP.Net by adding the section

    <pages validateRequest="false" /> 

to your web.config, and in ASP.Net MVC by adding the


attribute to the controller method.

Depending on your application, you now need to encode the data whenever it is output in an HTML page, or you need to check whether it really contains malicious tags, and remove them.

I created two regex’s for this purpose. The first one removes all attributes and values from HTML tags:

string html = "<p style=\"font-size: 2em\">hello. this is some html text</p>";
var rexTagAttrs = new Regex(@"<(\w+)\s.*?>");
html = rexTagAttrs.Replace(html, "<$1>");

The $1 evaluates to any tag name (\w+) the regular expression finds, and everything up to the next “>” is removed. This gets you rid of any custom formatting, such as created by rich text editors.

The next regex removes any HTML tags except those explicitly allowed:

var rexAllowedOnly = new Regex(@"(</?[^(p|ol|ul|li|span|i|b|br)]/?>)");
html = rexAllowedOnly.Replace(html, "");

The result of these two operations is HTML text that only contains the allowed tags which are save to display without encoding.

One Response to Removing Certain HTML Markup from String using Regex

  1. Dave says:

    wow, you are my hero! This is by far the best and only regex example i have found that exactly does what I was searching for. Removing unwanted Word Html Markup but leaving valid html code in place!

    Well done.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: