Parsing HTML with SQL

Fun with XPath

July 12, 2013

A friend sent me the following request earlier this week:

I was wondering if you knew how to use SQL to parse a column and return multiple individual records where text in that column is started with <B>This is a test</B>. The column might have these tags multiple times.

Sample Column Text…

This is a test
<B>Test</B>
How is it going
<B>Get this Text Also</B>

It should return…

Test
Get this Text also

Challenge accepted! Here is my initial reply:


You’re going to hate me, but you can do this with SQL’s XML functionality. I have built a test case that works on your sample set.

NOTE: ALL XML FUNCTION NAMES ARE CASE-SENSITIVE (lowercase), EVEN THOUGH SQL IN GENERAL IS NOT.

Consider the following table, which represents your existing table, which I presume has the HTML stored as a VARCHAR:

CREATE TABLE dbo.ParseExample
(
    HTMLText VARCHAR(4000) NULL
)

Now we insert your test case into it:

INSERT INTO ParseExample VALUES('This is a test
<B>Test</B>
How is it going
<B>Get this Text Also</B>')

SELECT * will bring back what you’d expect.

So, we can convert the HTML into XML (even though it’s not valid XML) on the fly using a common table expression:

WITH Converted
AS
(SELECT CAST(HTMLText AS XML) AS X FROM ParseExample)
SELECT * FROM Converted

Now you can do fun things with XPath and XML functions. For example, to just get the <B> elements and their values (the XPath //B will look for bold elements wherever they are in the XML DOM), you can use the query function:

WITH Converted
AS
(SELECT CAST(HTMLText AS XML) AS X FROM ParseExample)
SELECT X.query('//B') FROM Converted

But this isn’t useful, because the result still has the <B> tags and also returned everything in one column, which isn’t what you want. So then you use the nodes function to get individual rows for each node that matches the XPath query, and then the value function to extract just the text (and not the tags) from that:

WITH Converted
AS
(SELECT CAST(HTMLText AS XML) AS X FROM ParseExample)
SELECT nref.value('.', 'nvarchar(max)') BoldText
FROM Converted
CROSS APPLY X.nodes('//B') AS R(nref)

The above returns exactly what you were looking for. I look forward to hearing you explain how it works to the client. :)


Of course, there was one more wrinkle – his actual live data had &nbsp; entities in it, so the XML parser was choking on that with:

XML parsing: line 1, character 9, well formed check: undeclared entity

In follow-up discussions he decided to just do this in the common table expression:

(SELECT CAST(REPLACE(HTMLText, '&nbsp;', ' ') AS XML) AS X FROM ParseExample)

I personally would prefer something more like:

(SELECT CAST(REPLACE(HTMLText, '&nbsp;', '&amp;nbsp;') AS XML) AS X FROM ParseExample)

Anyway, I thought I would document this just for fun.