A Data Science Central Community
The issue with truly big data is that you will end up with field separators that are actually data values (text data). What are the chances to find a double tab in a one GB file? Not that high. In an 100 TB file, the chance is very high. Now the question is: is it a big issue, or maybe it's fine as long as less than 0.01% of the data is impacted. In some cases, once the glitch occurs, ALL the data after the glitch is corrupted, because it is not read correctly - this is especially true when a data value contains text that is identical to a row or field separator, such as CR / LF (carriage return / line feed). The problem gets worse when data is exported from UNIX or MAC to WINDOWS, or even from ACCESS to EXCEL.
How do you handle this issue when working with very large data?
Follow these 3 steps:
As a rule of thumb, tab-separated file format is much better than CSV (comma separated). Also, the creation of a data dictionary, as an exploratory tool, helps pinpoint issues.