We ran into an issue parsing some HTML for a DNN content migration project we are working on. We needed to find the actual content of the page, without all of the look and feel. Luckily we found a pretty solid case of an opening and closing div tag that wrapped the entire content of the page. At first we had a basic regular expression for finding the div tags like so:
Dim sRegEx As String = "<div align=" & Chr(34) & "center" & Chr(34) & "[\d\D]*?\</div>"
This worked fine until we ran into some code that had div tags withing the div tags. The following shows you what the reg ex returns.
Example:
<div align="center">
Some text here
<div> this is between another div</div>
Here is more text that should be in the content we are ripping.
</div>
After some digging, the following regex does the trick:
Dim regexp As Regex = New Regex( _
"(<[^>]*?div[^>]*?(?:center)[^>]*>)((?:.*?(?:<[ \r\t]*div[^>]*>?.*?(?:<.*?/.*?div.*?>)?)*)*)(<[^>]*?/[^>]*?div[^>]*?>.*</div>)", _
RegexOptions.IgnoreCase _
Or RegexOptions.Singleline _
)