Big Data: Where to start?

11 Nov 2020 -

The big data world is so big that it is humongous. Big data Engineer, Big data analyst, Big data scientist – Are these different names for the same role? It is all overwhelming to figure out which strand to take hold of. And how to climb that big mountain! To add on top of that – which algorithm to use, which tooling to use, which language to use.

Do I learn Hadoop, Kafka, AWS – What in AWS stack?

Let us start with understanding the role and responsibilities of these job titles. It can also serve us as a reference of skillset we need, if we want to do all of it by ourselves. Later, we shall dive deeper into what stack is usually recommended. Spoiler alert – There is no one recipe.

Picture of satalite hovering above the earth, observing the weather

Weather stations continuously use big data to predict the future

Data Analyst

The process of the extraction of information from a given pool of data is called data analytics. A data analystextracts the information through several methodologies like data cleaning, data conversion, and data modeling. There are several industries where data analytics is used, such as – technology, medicine, social science, business etc. Industries can now make careful data-driven decisions because they are able to analyze trends in the market, requirements of their clients and overview their performances with data analysis.

A Data Analyst is also well versed with several visualization techniques and tools. It is utmost necessary for the data analyst to have presentation skills. This allows them to communicate the results with the team and help them to reach proper solutions.

Data Analytics allows the industries to process fast queries to produce actionable results that are needed in a short duration of time. This restricts data analytics to a more short term growth of the industry where quick action is required.

Data Engineer

A Data Engineer is a person who specializes in preparing data for analytical usage. S/He develops the foundation for various data operations. A Data Engineer is responsible for designing the format for data scientists and analysts to work on.

They need to work with both structured and unstructured data. Data Engineers allow data scientists to carry out their data operations. They have to deal with Big Data where they engage in numerous operations like data cleaning, management, transformation, data deduplication etc.

A Data Engineer is more experienced with core programming concepts and algorithms. Therole of a data engineer also follows closely to that of a software engineer. This is because a data engineer is assigned to develop platforms and architecture that utilize guidelines of software development. For example, developing a cloud infrastructure to facilitate real-time analysis of data requires various development principles. Therefore, building an interface API is one of the responsibilities of a data engineer.

Furthermore, a data engineer has a good knowledge of engineering and testing tools. It is up to a data engineer to handle the entire pipelined architecture to handle log errors, agile testing, building fault-tolerant pipelines, administering databases and ensuring a stable pipeline.

Data Scientist

Nowadays, every company is looking for data scientists to increase their performance and optimize their production.

There is a massive explosion in data. This explosion is contributed by the advancements in computational technologies like High-Performance Computing. This has given industries a massive opportunity to unearth meaningful information from the data.

Companies extract data to analyze and gain insights about various trends and practices. In order to do so, they employ specialized data scientists who possess knowledge of statistical tools and programming skills. Moreover, a data scientist possesses knowledge of machine learning algorithms. These algorithms are responsible for predicting future events. Therefore, data science can be thought of as an ocean that includes all the data operations like data extraction, data processing, data analysis and data prediction to gain necessary insights.

However, Data Science is not a singular field. It is a quantitative field that shares its background with math, statistics and computer programming. With the help of data science, industries are qualified to make careful data-driven decisions.

The skills mentioned above can be summarized in the table below:

*** = Very important
** = Important
* = Trivial

	Data Analyst	Data Engineer	Data Scientist
Calculus and Linear Algebra	*	*	***
Data Intuition	**	**	***
Data Visualization and Communication	***	**	***
Data Wrangling	*	***	***
Machine Learning	*	*	***
Programming Tools	***	***	*
Software engineering	*	***	**
Statistics	**	**	***

Big Data programming language comparison

There is a plethora of programming languages today used for a variety of purposes.

We have compared a few in different aspects to make the decision-making process easier:

	Scala	Python	R	Java	GO	Julia
Speed	✓			✓	✓	✓
Ease of use		✓	✓	✓	✓	✓
Quick Learning curve		✓		✓	✓	✓
Data Analysis capability	✓	✓	✓	✓
General-purpose	✓	✓		✓	✓
Big Data support	✓	✓	✓	✓	✓
Interfacing with other languages		✓	✓		✓
Production-ready	✓			✓	✓

A much more detailed list of pros and cons can be found below

Python

Advantages

AI
Machine Learning
Predictive analysis
Can be used with fast big data engines like Apache Spark via the available Python API
Loads of libraries
Low learning curve
Popular libraries to clean and manipulate data – pandas, NumPy, SciPy are python based
TensorFlow written in Python
Interactive computing through Jupyter notebooks

Disadvantages

Community data for exploration and learning is not as extensive as that for R

R / Programming with Big Data in R (pbdR)

Advantages

Machine Learning
Data Science
Provides Statiscal models
Graphical capabilities- useful in visualization patterns and associations
Packages like GGPLOT2 can further enhance R’s data visualization capabilities and generate high quality graphs
Comes in handy for data visualization and modeling rather than analysis
Support for Jupyter notebooks

Disadvantages

Steep learning curve
Speed and effeciency issues
Code written in R is not production-deployable and generally has to be translated to some other programming language such as Python or Java

Java

Advantages

Hadoop HDFS – processing and storing big data applications
ETL applications – apache Camel, Apatar, Apache Kafka
Apache Hadoop based on Java
Large ecosystem of tried and tested tools and libraries

SQL

Retrieving data

Julia

Advantages

High performance numerical analysis
New – Capable of general purpose programming
Faster execution – complex projects
30 times quicker than Python; somewhat quicker than C
Best performancce parallel computing language focussed on numerical computing

Disadvantages

Newer than Python and C

Scala

Advantages

Runs on JVM
High volume data sets
Full support for functional programming
Cluster computing framework Apache Spark is written in Scala. Idle for juggling data in a thousand processor cluster and pile of Java legacy code
Lesser lines of code compared to Java
Apache Kafka is written in Scala
Fast and robust

Disadvantages

Fewer libraries

MATLAB

Advantages

Quick, stable and ensures solid algo foe numerical computing lang
fourier transforms, signal proccessingm image processing and matrix algebra
used in statical analysis

TensorFlow

Advantages

Software lib for numerical computation
machine learning framework suitable for large scale data
Graph can be broken into many chunks that can keep running parallel over various GPUs or CPUs.
Supports distributed computing
Second generation for Google brain

Go

Advantages

Kubernetes and Docker written in GO
Fast, easy to learn
Fairly easy to write applications in and deploy
Go-based systems are being used to integrate machine learning and parallel processing of data
Efficient distributed computing

AWS for a big data project

Before analysis the data and making it useful, we need to set up the infrastructure. Setting up and managing data lakes involves a lot of manual and time-consuming tasks such as loading, transforming, securing, and auditing access to data. AWS Lake Formation automates many of those manual steps and reduces the time required to build a successful data lake from months to days.

Some of the available AWS Services are:

Use cases for AWS services in Big Data

Data Warehousing

Run SQL and complex, analytic queries against structured and unstructured data in your data warehouse and data lake, without the need for unnecessary data movement.

Amazon RedShift

Big data processing

Quickly and easily process vast amounts of data in your data lake or on-premises for data engineering, data science development, and collaboration.

Amazon EMR

Real time analytics

Collect, process, and analyze streaming data, and load data streams directly into your data lakes, data stores, and analytics services so you can respond in real time.

Amazon MSK
Amazon Kinesis

Operational analytics

Search, explore, filter, aggregate, and visualize your data in near real time for application monitoring, log analytics, and clickstream analytics.

Amazon Elasticsearch Services

Apart from AWS services, whether it’s a trendy syntax language like Python or more conventional languages like Java and R, choosing the right programming language for big data really comes down to you and your business’ preference.

When starting out, it can be to take advantage of books and other free resources. Doing so can allow beginners to become more familiar with the terminology and build a strong foundation for future development. Those who are looking to make a more streamline move into the field, however, should look for opportunities to gain and practice the skills needed to become an expert data analyst.

One of the most efficient ways to do this is through numerous online short and long term courses.

Big Data: Where to start?

Data Analyst

Data Engineer

Data Scientist

Big Data programming language comparison

Python

R / Programming with Big Data in R (pbdR)

Java

SQL

Julia

Scala

MATLAB

TensorFlow

Go

AWS for a big data project

Use cases for AWS services in Big Data

Data Warehousing

Big data processing

Real time analytics

Operational analytics

About the author

Focus on value creation

Inspiration

Solutions

Careers

About us

Follow us

Big Data: Where to start?

Data Analyst

Data Engineer

Data Scientist

Big Data programming language comparison

Python

R / Programming with Big Data in R (pbdR)

Java

SQL

Julia

Scala

MATLAB

TensorFlow

Go

AWS for a big data project

Use cases for AWS services in Big Data

Data Warehousing

Big data processing

Real time analytics

Operational analytics

About the author

Related articles

Data Quality Series, part 3: Overview of Data Lineage

Data Quality Series, part 2: Data Quality Testing with Deequ in Spark

Data Quality Series, part 1: Introduction to Data Quality

Getting more value out of your data

Neo4j For Python Users and Broken Pipe Error

Focus on value creation