Why Data Scientists Are Heavily Weighted Towards Open Source Frameworks: CG Venkatesh of LTI Explains

Image may be NSFW.
Clik here to view.

With 20+ years of expertise in statistics & data science, CG Venkatesh has spearheaded several industry-specific strategic advanced analytics solutions. With his broad experience, he has designed and implemented data analytics solutions for Fortune 500 clients across industry verticals. He is also associated with MIT Sloan, IEEE, AICTE and various other universities for his passion for extended interactions with budding data scientists and academicians.

An acclaimed analytics thought leader, he heads the data science practice at LTI. AIM got in touch with CG Venkatesh, better known as CG, to get his insights on the leading tools and techniques that are currently used by analytics, AI and data science practitioners. In this detailed interview, CG gives us the lowdown on the most preferred tool by his team, preference of open source and paid tools, cloud providers, LTI’s in-house tool and more.

Analytics India Magazine: What are the most commonly used tools in analytics, AI, data science?

CG: These are some of the popular and most commonly used tools according to me.

Data Science & Applied Statistics:

Commercial products: SAS, IBM SPSS, STATISTICA
Open Source: R, Python Pandas, NumPy, SciKit and the libraries based on these tools.

AI/ ML:

IBM’s WATSON, Amazon’s SageMaker, Baidu, Cloudera, Confluent, DataBricks, Google ML, Microsoft’s Cognitive and Computer Vision suites
Open Source: R, Python, TensorFlow based libraries

NLP:

Open source frameworks like Stanford NLP, GATE, Python’s NLP libraries

AIM: What is the most productive tool that you have come across?

CG: Python’s Pandas and R are the most productive tool for data preparation before modeling. Also, all SQL-supporting libraries which offers an efficient manipulation and transformation features based on Matrix Algebra driven data structures like data frames, data sets etc.

AIM: Do you prefer tools that are open source or paid? Please elaborate on the benefits, some open source and paid tools that you prefer.

CG: The factors that impact the choice of tools are as follows:

Availability of skill sets of resources at hand
Clients data maturity
Client mindset towards open source tools vs licensed tools
Scalability of the solution design
Build vs Buy: cost benefit analysis

Given a choice, I would prefer open source, given the following reasons:

Ease of availability
Portability across systems
Big data capability and handling dynamic volume, velocity, variety of data
Absence of protocols due to free availability
Scalability in terms of resource as they can be quickly trained
Ease in proving do ability & capability for a quick start and client buy in

AIM: Is open source considered an important attribute when choosing the tool of your choice

CG: Yes, Most definitely.

AIM: What are the most common issues you face while dealing with data? How is selecting the right tool critical for problem-solving?

CG: Common issues faced in terms of data quality are:

History
Granularity
Logical connectivity between multiple data sets and sources
Availability
Sufficiency
Multiple form and formats of data

Right tool combinations are crucial to do the problem solving as that’s the key to extract, transform the data to reach upto the algorithmic stage.

AIM: How do you select tools for a given task?

CG: Data analysis tasks start with first checking what kind of technologies & processes can interact with the relevant data set, so that an initial analysis, profiling and sampling can be done. This analysis typically involves tools with SQL querying capabilities and tools that can convert data from one format to another quite easily. The tool should also have the capability to easily explore, summarise and visualise the univariate statistical measures. Additionally, if the tool can offer features for inputing & selecting samples, that’s an added advantage. Post the initial profiling, next important step is to choose a tool for data modelling. Often, the driver to the decision is the set of algorithms that we choose for the modelling, and checking where among the tools are those algorithms implemented with great coverage of statistical scenarios – i.e. flexibility to tweak modelling parameters and metrics.

AIM: What are the most user-friendly languages and tools that you have come across?

CG: R & Python are brilliant for processing data with respect to analytical goals. Azure Machine Learning Studio interface is fast catching up on user-friendliness.

AIM: What does the ideal data scientist’s toolkit look like?

CG: An ideal data scientist toolkit should be:

A SQL-heavy tool/library to query environments hosting structured data
A tool with easy/intuitive syntax to query environments hosting unstructured data
A Studio/Client GUI that helps in analysis & model brain-storming
A visualization tool that visualizes insights without having to code too much
A spreadsheet application of course!
A tool that helps deploy and unit-test the models, before they can move to production
An ever-doubting mind

AIM: What is the most preferred language used by the team?

CG: R & Python are undoubtedly most preferred for ease of coding and the depth of libraries they offer.

AIM: Can you give us the percentage of data scientists and percentage of developers that use a particular language/data visualization tool etc.?

CG: Roughly about 50% – 50% – R & Python – for both scientists & developers.

AIM: What is the most preferred cloud provider— AWS, Google or Azure?

CG: Azure, due to the friendliness and easy integration to other vast set of Microsoft products.

AIM: What are some of the tools used for scaling data science workloads; for eg., Dockers are gaining popularity vis-à-vis spark?

CG: Clearly Dockers or self-containing packages, where services/APIs and applications are run as a processes/threads– hosted in a micro OS like CoreOS are the future when it comes to delivering millions of many insights on scale. But for batch outcomes, in-memory distributed server environments like Spark still takes the cake.

AIM: What are some of the proprietary tools developed in-house by the company?

CG: LTI’s Mosaic is a unique offering that leverages the power of data, AI & automation to overcome the challenges of data-driven decision management. The foundation of the platform is equipped with state-of-the-art data engineering and advanced analytics capabilities such as data ingestion, storage and governance, advanced analytics, processing, and consumption adaptors, extending a single interface for ‘Data to Decisions’.

The post Why Data Scientists Are Heavily Weighted Towards Open Source Frameworks: CG Venkatesh of LTI Explains appeared first on Analytics India Magazine.

Why Data Scientists Are Heavily Weighted Towards Open Source Frameworks: CG Venkatesh of LTI Explains

Analytics India Magazine: What are the most commonly used tools in analytics, AI, data science?

AIM: What is the most productive tool that you have come across?

AIM: Do you prefer tools that are open source or paid? Please elaborate on the benefits, some open source and paid tools that you prefer.

AIM: Is open source considered an important attribute when choosing the tool of your choice

AIM: What are the most common issues you face while dealing with data? How is selecting the right tool critical for problem-solving?

AIM: How do you select tools for a given task?

AIM: What are the most user-friendly languages and tools that you have come across?

AIM: What does the ideal data scientist’s toolkit look like?

AIM: What is the most preferred language used by the team?

AIM: Can you give us the percentage of data scientists and percentage of developers that use a particular language/data visualization tool etc.?

AIM: What is the most preferred cloud provider— AWS, Google or Azure?

AIM: What are some of the proprietary tools developed in-house by the company?

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112