A friend sent me the following request earlier this week:
I was wondering if you knew how to use SQL to parse a column and return multiple individual records where text in that column is started with
<B>This is a test</B>. The column might have these tags multiple times.
Sample Column Text…
This is a test <B>Test</B> How is it going <B>Get this Text Also</B>
It should return…
Test Get this Text also
Challenge accepted! Here is my initial reply:
You’re going to hate me, but you can do this with SQL’s XML functionality. I have built a test case that works on your sample set.
NOTE: ALL XML FUNCTION NAMES ARE CASE-SENSITIVE (lowercase), EVEN THOUGH SQL IN GENERAL IS NOT.
Consider the following table, which represents your existing table, which I presume has the HTML stored as a
CREATE TABLE dbo.ParseExample ( HTMLText VARCHAR(4000) NULL )
Now we insert your test case into it:
INSERT INTO ParseExample VALUES('This is a test <B>Test</B> How is it going <B>Get this Text Also</B>')
SELECT * will bring back what you’d expect.
So, we can convert the HTML into XML (even though it’s not valid XML) on the fly using a common table expression:
WITH Converted AS (SELECT CAST(HTMLText AS XML) AS X FROM ParseExample) SELECT * FROM Converted
Now you can do fun things with XPath and XML functions. For example, to just get the
<B> elements and their values (the XPath
//B will look for bold elements wherever they are in the XML DOM), you can use the
WITH Converted AS (SELECT CAST(HTMLText AS XML) AS X FROM ParseExample) SELECT X.query('//B') FROM Converted
But this isn’t useful, because the result still has the
<B> tags and also returned everything in one column, which isn’t what you want. So then you use the
nodes function to get individual rows for each node that matches the XPath query, and then the
value function to extract just the text (and not the tags) from that:
WITH Converted AS (SELECT CAST(HTMLText AS XML) AS X FROM ParseExample) SELECT nref.value('.', 'nvarchar(max)') BoldText FROM Converted CROSS APPLY X.nodes('//B') AS R(nref)
The above returns exactly what you were looking for. I look forward to hearing you explain how it works to the client. :)
Of course, there was one more wrinkle – his actual live data had
entities in it, so the XML parser was choking on that with:
XML parsing: line 1, character 9, well formed check: undeclared entity
In follow-up discussions he decided to just do this in the common table expression:
(SELECT CAST(REPLACE(HTMLText, ' ', ' ') AS XML) AS X FROM ParseExample)
I personally would prefer something more like:
(SELECT CAST(REPLACE(HTMLText, ' ', '&nbsp;') AS XML) AS X FROM ParseExample)
Anyway, I thought I would document this just for fun.