Pull Out Text from HTML Code


ScrapeText is a simple function that I use to pull plain text out of strings that contain markup-language formatting. I wrote this function to help build a report for a questionnaire system. The report needed to display only the plain text of each question. However, the text of each question was stored in heavily formatted HTML.

I searched online for a solution but didn’t find anything useful.Then, in a moment of clarity,a solution came to me:Write a function that accepts an HTML string as input and returns a plain text string by simply skipping over any text contained within the <and> tags. For example, if you run the code

SELECT dbo.ScrapeText
  (‘<a href=”www.sqlserver.com”>
  SQL Server</a>

the result is "SQL Server".

I wrote the ScrapeText function for SQL Server 2005 and SQL Server 2000. Listing 1 shows the code that does the main processing.This code first wraps the input string (which can be a string or an alphanumeric column in a table) with the > and <tags. The code then searches the input string for any nonbreaking spaces (&nbsp;), replacing them with standard spaces.

After the initial data cleansing is complete, the code examines the input string one character at a time from left to right. An imaginary pen is ready to start copying the desired characters to an output string. When a > tag is encountered, the pen is put to paper because it has reached the beginning of a desired string of text. The pen then writes each character to the output string until a <tag is encountered, at which point the pen stops writing. This process continues until the end of the input string is reached.
—Bill McEvoy

Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.