Subscribe to our Newsletter

Nasty data corruption getting exponentially worse with the size of your data

The issue with truly big data is that you will end up with field separators that are actually data values (text data). What are the chances to find a double tab in a one GB file? Not that high. In an 100 TB file, the chance is very high. Now the question is: is it a big issue, or maybe it's fine as long as less than 0.01% of the data is impacted. In some cases, once the glitch occurs, ALL the data after the glitch is corrupted, because it is not read correctly - this is especially true when a data value contains text that is identical to a row or field separator, such as CR / LF (carriage return / line feed). The problem gets worse when data is exported from UNIX or MAC to WINDOWS, or even from ACCESS to EXCEL.

How do you handle this issue when working with very large data?

Follow these 3 steps:

  • Through exploratory analysis, check whether data values contain characters such as [ ] { } \ / ? < > ; . & * ( ) + - = % $ # @ ! ^ / , etc. And look at tricks such as invisible or control characters, something that looks like a <SPACE> but is actually not one, but rendered as a <SPACE> on Notepad or whatever tool you use. Typically these chars have an ASCII number below 32 or above 128. This type of issue is widespread with web log data.
  • Identify all special characters that are causing problems (see above), starting with the most widespread value/separator collisions, until most get resolved. For instance, if a tab is found within a value, replace it with a different symbol, maybe something like __||__, which is highly unlikely to be found except in very, very, very large data sets.  
  • Identify misplaced fields by looking at glitches such as this one: on the first 5,565,897 rows, field #24 is a date. Starting on row 5,565,898, it is no longer a date, but a text string. Identify which special character on row 5,565,898 and column 24 is causing the problem.   

As a rule of thumb, tab-separated file format is much better than CSV (comma separated). Also, the creation of a data dictionary, as an exploratory tool, helps pinpoint issues.

Views: 1854

Comment by Jean Michel LeTennier on July 7, 2013 at 6:30am

depending on the tool.. quite easily .. 

Comment by Alex Mylnikov on July 7, 2013 at 7:12am

CSV file format is working well only with numeric data and text fields that do not contain special symbols that are used to separate values. If you have no control on text values, you have to encrypt data. One of the standard ways is an http (URL) encryption that has support on almost any programming platform. For example, encrypted value for the space symbol is %20 - "%" is symbol that identifies a start of the encrypted value, 20 - is an encrypted value for the space symbol. You also can create your own custom encryption. In this encryption scheme you should assign special code for each value separator and for the symbol that you will use as a start indicator for the encrypted value.

Comment by Vincent Granville on July 7, 2013 at 5:39pm

One cause of problems is when data is gathered from different sources - say Windows and UNIX, and blended together.

Comment by jaap Karman on July 9, 2013 at 10:59am

We are living now but maybe have forgot th ancient times of computer technology.
It was the typewriter and Telex depending on the CarriageReturn and newline needing an indication. In those times with limited hardware also the limitations in representation with a byte (8 bits) was set. The 8-bit normally used as a validation of transfer errors. That is old ASCII with 128 signs and the lower used for control-signals.

Nothing is the same anymore. http://en.wikipedia.org/wiki/%5Cn#In_programming_languages

Old mac Unix (new mac) Windows have implemented all type of combinations of the Telex type record \r \n .

At first we had extended characterssets with codepages (255 chars). With Unicode being used ols ASCII has been replaced, and 1 character can be 1-4 bytes., uhhh 1-8 (as of 2012). All of them still being in use. https://en.wikipedia.org/wiki/Unicode.


IBM has his own standards introduced with their mainframes (360) with EBCDIC instead of ASCII and other types of recordindication (Fixed and Variable). The Variable type is dependent of binary pointers. Introducing common limits as 255 (text editors) 32K 16M 2G (1 bit for a sign stop or error indication is reserved).

 

Than we have all country differences. Dates have different notations the comma and decimal are used different. 

The same has happened with DBMS fields many special formats that much be translated at the best without loss of information. And there is more, not all of that kind of information is available. 

Very nasty all this kind of intrepretation of meanings of bytes as they can cause unexpected issues.

The three mentioned steps are good starting points. A good system analyses of possible issues in a project gathering data, understanding the risks is a next step..          

Comment

You need to be a member of Big Data News to add comments!

Join Big Data News

© 2017   BigDataNews.com is a subsidiary of DataScienceCentral LLC and not affiliated with Systap   Powered by

Badges  |  Report an Issue  |  Terms of Service