When the data scientist goes after structured and machine-generated data, experience has shown that there are not many positive results. Instead, the most fertile grounds are textual data and spreadsheet data. Fig 1 depicts textual and spreadsheet data.
But – as has been discussed – there is a barrier to accessing and analyzing textual and spreadsheet data. Textual and spreadsheet data is not “well behaved”. Spreadsheet and textual data is erose, and common data base management systems do not hold or interact well with erose data.
But it is noted that just because textual data forms a basis for business value does not mean that ALL textual data is useful for finding business value. Fig 2 notes that there is some amount of textual data that is not fit to serve as a basis for finding business value.
Some textual data is informal. Some textual data is hearsay. Some textual data is casual. So textual data must be vetted as to its suitability to serve as a basis for business value.
The same is true for spreadsheet data. Fig 3 shows that some spreadsheets are not suitable to serve as a basis for finding business value.
Some spreadsheets are informal. Some spreadsheets are casual. Some spreadsheets are created at 9:00 am and are deleted at 10:00 am. There are many reasons why a spreadsheet may not be a good candidate to serve as a basis for finding business value.
Fig 4 shows that there is a continuum of spreadsheets.
In actuality probably only 10% or less of the spreadsheets the corporation has are fit to serve as a basis for finding business value.
Once the organization has vetted both textual data and spreadsheet data, the next step is to employ technology that allows the data to be transformed into a standard data base management system. There are two very different technologies that are required. For text, there is textual disambiguation, as seen by Fig 5.
And for spreadsheets there is spreadsheet disambiguation, as seen in Fig 6.
At a high level, textual disambiguation and spreadsheet disambiguation appear to be similar, because they both achieve the same function. They both convert unstructured data into a standard data base management system. But once you look inside the two technologies they are nothing alike.
Textual disambiguation deals with the vagaries of language and text, while spreadsheet deals with the idiosyncrasies of spreadsheets.
Once text and/or spreadsheets have been disambiguated, they are turned into a standard data base. And after they have been turned into a standard data base, then (and only then) the data scientist can stat to do his/her analysis.
It is disambiguation technology that breaks down the shield of opaqueness that surrounds text and spreadsheets.
The post Textual & Spreadsheet Data – Effective Data Science Series: 5 of 5 appeared first on Analytics India Magazine.