Wednesday, January 27, 2016

Removing RTF codes from SQL data

A client requested data be migrated from one system to another.  Data in the source system included text data that had embedded RTF codes as below.  These codes were used for application rendering in what was likely an unstructured note field.

{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 Microsoft Sans Serif;}}
{\colortbl ;\red0\green0\blue0;}\viewkind4\uc1\pard\cf1\f0\fs16 bunch of text here \par more text here. \cf0\fs17\par
}


Searching did not yield much in the way of help on this so I had to figure this out with my own noodle.  The way I solved this was by cataloging the list of control codes that start with the backslash.  I then removed them by using the SQL REPLACE function.  For example:

UPDATE <table> SET <column> = REPLACE(<column>,'\rtf1','')

That will remove the "\rtf1" code from the data completely.  I ended up with a very long list of control codes and REPLACE statements - you can see from the above snippet there are a lot of codes and this is just a short example.

What was also important to watch out for were short codes that could cause issues.  For example if I found a code "\fs1" and I removed that before removing "\fs16" and "\fs17" then I would have left spurious characters behind - not good.  So I sorted the REPLACE functions by length of control code from longest to shortest.

Of course I also had to remove the open and closing brackets as well as an assortment of font names and semicolons.  Especially trick was getting rid of an accumulation of CR/LF codes at the start of the text once the control codes were removed.  The SQL SUBSTRING function took care of those.

The end result was over 200 REPLACE statements and some very clean text to migrated.

Another happy client!

No comments:

Post a Comment