Blog Home  Home Feed your aggregator (RSS 2.0)  
Venexus DotNetNuke Blog - Tags within tags regular expression
DotNetNuke Articles, Code Snippets, Errors, and News
 
 Wednesday, March 14, 2007

We ran into an issue parsing some HTML for a DNN content migration project we are working on. We needed to find the actual content of the page, without all of the look and feel. Luckily we found a pretty solid case of an opening and closing div tag that wrapped the entire content of the page. At first we had a basic regular expression for finding the div tags like so:

Dim sRegEx As String = "<div align=" & Chr(34) & "center" & Chr(34) & "[\d\D]*?\</div>"

This worked fine until we ran into some code that had div tags withing the div tags. The following shows you what the reg ex returns.

Example:

<div align="center">

Some text here

<div> this is between another div</div>

Here is more text that should be in the content we are ripping.

</div>

After some digging, the following regex does the trick:

Dim regexp As Regex = New Regex( _

"(<[^>]*?div[^>]*?(?:center)[^>]*>)((?:.*?(?:<[ \r\t]*div[^>]*>?.*?(?:<.*?/.*?div.*?>)?)*)*)(<[^>]*?/[^>]*?div[^>]*?>.*</div>)", _

RegexOptions.IgnoreCase _

Or RegexOptions.Singleline _

)

Wednesday, March 14, 2007 1:59:48 AM (US Eastern Standard Time, UTC-05:00)  #       |   | 
Copyright © 2009 Venexus, Inc.. All rights reserved.