Textual & Spreadsheet Data – Effective Data Science Series: 5 of 5

When the data scientist goes after structured and machine-generated data, experience has shown that there are not many positive results. Instead, the most fertile grounds are textual data and spreadsheet data. Fig 1 depicts textual and spreadsheet data.

But – as has been discussed – there is a barrier to accessing and analyzing textual and spreadsheet data. Textual and spreadsheet data is not “well behaved”. Spreadsheet and textual data is erose, and common data base management systems do not hold or interact well with erose data.

But it is noted that just because textual data forms a basis for business value does not mean that ALL textual data is useful for finding business value. Fig 2 notes that there is some amount of textual data that is not fit to serve as a basis for finding business value.

Some textual data is informal. Some textual data is hearsay. Some textual data is casual. So textual data must be vetted as to its suitability to serve as a basis for business value.

The same is true for spreadsheet data. Fig 3 shows that some spreadsheets are not suitable to serve as a basis for finding business value.

Some spreadsheets are informal. Some spreadsheets are casual. Some spreadsheets are created at 9:00 am and are deleted at 10:00 am. There are many reasons why a spreadsheet may not be a good candidate to serve as a basis for finding business value.

Fig 4 shows that there is a continuum of spreadsheets.

In actuality probably only 10% or less of the spreadsheets the corporation has are fit to serve as a basis for finding business value.

Once the organization has vetted both textual data and spreadsheet data, the next step is to employ technology that allows the data to be transformed into a standard data base management system. There are two very different technologies that are required. For text, there is textual disambiguation, as seen by Fig 5.

And for spreadsheets there is spreadsheet disambiguation, as seen in Fig 6.

At a high level, textual disambiguation and spreadsheet disambiguation appear to be similar, because they both achieve the same function. They both convert unstructured data into a standard data base management system. But once you look inside the two technologies they are nothing alike.

Textual disambiguation deals with the vagaries of language and text, while spreadsheet deals with the idiosyncrasies of spreadsheets.

Once text and/or spreadsheets have been disambiguated, they are turned into a standard data base. And after they have been turned into a standard data base, then (and only then) the data scientist can stat to do his/her analysis.

It is disambiguation technology that breaks down the shield of opaqueness that surrounds text and spreadsheets.

The post Textual & Spreadsheet Data – Effective Data Science Series: 5 of 5 appeared first on Analytics India Magazine.

Textual & Spreadsheet Data – Effective Data Science Series: 5 of 5

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112