VARCHAR and CHAR datatypes are attractive. On the surface they offer flexibility and accommodate a variety of textual data. There is another side to VARCHAR that data architects need to be mindful of. Data requirements for free-form textual data are common. They may include data that appears to meet those criteria on the surface but is something totally different under the covers.
Good data analysis takes a deeper dive and analyzes the contents of a VARCHAR data element. Over the years I have yielded to the temptation of just defining field as VARCHAR since the analysis was “fuzzy” at the time, and the business user insisted that data was needed in the application. I have encountered many misuses of VARCHAR in data designs. Here are some of my favorite scenarios where a square VARCHAR is put into a round hole.
- Big Block Data – The application needs to transfer a block of data to other applications. The logic is that defining dedicated columns is time consuming and a maintenance nightmare. The conversation is usually “Just give me a VARCHAR (256). I will handle everything in the application.” Time marches on and more and more applications need this data. It proliferates. It changes. Boom! It blows up. Everything is out of sync.
- Ten-columns-in-One – This was rampant at a former employer. Similar to the big block data, they would define a CHAR (10) column that actually contained 10 CHAR (1) switches that controlled application use of the data. It became very problematic when the data became of enterprise interest and other technologies came knocking on the database’s door. The columns were hidden. The data was difficult to use.
- The Versatile Column – The business user needs a new column. Time is short; too short for data modeling. The developer reuses a column for another purpose. VARCHAR columns are particularly attractive targets. There is a VARCHAR (50) column that never exceeds 20 characters. Just use the last 10 characters. What can go wrong?! Need I say more? This data quality issue is often done behind the data architect’s back. It is uncovered in data profiling.
- Non-numeric Numeric – Do you define every numeric data element as numeric or decimal? There are some specific cases where it is best defined as a character column. An example might be a state of province number that will never be involved in a calculation. Unfortunately, I come across numbers defined as VARCHAR or CHAR because the defining application only used it in a non-computational manner. Data analysis should always look at a data element from an enterprise and full lifecycle point of view.
- Painting the Wrong Picture – This misuse builds on the non-numeric numeric misuse. Data elements were defined as VARCHAR or CHAR for an application’s ease of use. This might include monetary amounts, dates, and timestamps. The application requirements did not delve deeper than the immediate need for the data. This is more common in legacy applications that came from the days of keypunched data. The application rewrite did not address the need to set appropriate datatypes. Application remodeling should always include analysis of data types.