Big Data: Where to start?

-

The big data world is so big that it is humongous. Big data Engineer, Big data analyst, Big data scientist – Are these different names for the same role? It is all overwhelming to figure out which strand to take hold of. And how to climb that big mountain! To add on top of that – which algorithm to use, which tooling to use, which language to use.

Do I learn Hadoop, Kafka, AWS – What in AWS stack?

Let us start with understanding the role and responsibilities of these job titles. It can also serve us as a reference of skillset we need, if we want to do all of it by ourselves. Later, we shall dive deeper into what stack is usually recommended. Spoiler alert – There is no one recipe.

Picture of satalite hovering above the earth, observing the weather 

Weather stations continuously use big data to predict the future

Data Analyst

The process of the extraction of information from a given pool of data is called data analytics. A data analystextracts the information through several methodologies like data cleaning, data conversion, and data modeling. There are several industries where data analytics is used, such as – technology, medicine, social science, business etc. Industries can now make careful data-driven decisions because they are able to analyze trends in the market, requirements of their clients and overview their performances with data analysis.

A Data Analyst is also well versed with several visualization techniques and tools. It is utmost necessary for the data analyst to have presentation skills. This allows them to communicate the results with the team and help them to reach proper solutions.

Data Analytics allows the industries to process fast queries to produce actionable results that are needed in a short duration of time. This restricts data analytics to a more short term growth of the industry where quick action is required.

Data Engineer

A Data Engineer is a person who specializes in preparing data for analytical usage. S/He develops the foundation for various data operations. A Data Engineer is responsible for designing the format for data scientists and analysts to work on.

They need to work with both structured and unstructured data. Data Engineers allow data scientists to carry out their data operations. They have to deal with Big Data where they engage in numerous operations like data cleaning, management, transformation, data deduplication etc.

A Data Engineer is more experienced with core programming concepts and algorithms. Therole of a data engineer also follows closely to that of a software engineer. This is because a data engineer is assigned to develop platforms and architecture that utilize guidelines of software development. For example, developing a cloud infrastructure to facilitate real-time analysis of data requires various development principles. Therefore, building an interface API is one of the responsibilities of a data engineer.

Furthermore, a data engineer has a good knowledge of engineering and testing tools. It is up to a data engineer to handle the entire pipelined architecture to handle log errors, agile testing, building fault-tolerant pipelines, administering databases and ensuring a stable pipeline.

Data Scientist

Nowadays, every company is looking for data scientists to increase their performance and optimize their production.

There is a massive explosion in data. This explosion is contributed by the advancements in computational technologies like High-Performance Computing. This has given industries a massive opportunity to unearth meaningful information from the data.

Companies extract data to analyze and gain insights about various trends and practices. In order to do so, they employ specialized data scientists who possess knowledge of statistical tools and programming skills. Moreover, a data scientist possesses knowledge of machine learning algorithms. These algorithms are responsible for predicting future events. Therefore, data science can be thought of as an ocean that includes all the data operations like data extraction, data processing, data analysis and data prediction to gain necessary insights.

However, Data Science is not a singular field. It is a quantitative field that shares its background with math, statistics and computer programming. With the help of data science, industries are qualified to make careful data-driven decisions.

The skills mentioned above can be summarized in the table below:

  • *** = Very important
  • ** = Important
  • * = Trivial
  Data Analyst Data Engineer Data Scientist
Calculus and Linear Algebra * * ***
Data Intuition ** ** ***
Data Visualization and Communication *** ** ***
Data Wrangling * *** ***
Machine Learning * * ***
Programming Tools *** *** *
Software engineering * *** **
Statistics ** ** ***

 

Big Data programming language comparison

There is a plethora of programming languages today used for a variety of purposes.

We have compared a few in different aspects to make the decision-making process easier:

  Scala Python R Java GO Julia
Speed    
Ease of use  
Quick Learning curve    
Data Analysis capability    
General-purpose    
Big Data support  
Interfacing with other languages      
Production-ready      

A much more detailed list of pros and cons can be found below

Python

Advantages

  • AI
  • Machine Learning
  • Predictive analysis
  • Can be used with fast big data engines like Apache Spark via the available Python API
  • Loads of libraries
  • Low learning curve
  • Popular libraries to clean and manipulate data – pandas, NumPy, SciPy are python based
  • TensorFlow written in Python
  • Interactive computing through Jupyter notebooks

Disadvantages

  • Community data for exploration and learning is not as extensive as that for R

R / Programming with Big Data in R (pbdR)

Advantages

  • Machine Learning
  • Data Science
  • Provides Statiscal models
  • Graphical capabilities- useful in visualization patterns and associations
  • Packages like GGPLOT2 can further enhance R’s data visualization capabilities and generate high quality graphs
  • Comes in handy for data visualization and modeling rather than analysis
  • Support for Jupyter notebooks

Disadvantages

  • Steep learning curve
  • Speed and effeciency issues
  • Code written in R is not production-deployable and generally has to be translated to some other programming language such as Python or Java

Java

Advantages

  • Hadoop HDFS – processing and storing big data applications
  • ETL applications – apache Camel, Apatar, Apache Kafka
  • Apache Hadoop based on Java
  • Large ecosystem of tried and tested tools and libraries

SQL

Retrieving data

Julia

Advantages

  • High performance numerical analysis
  • New – Capable of general purpose programming
  • Faster execution – complex projects
  • 30 times quicker than Python; somewhat quicker than C
  • Best performancce parallel computing language focussed on numerical computing

Disadvantages

  • Newer than Python and C

Scala

Advantages

  • Runs on JVM
  • High volume data sets
  • Full support for functional programming
  • Cluster computing framework Apache Spark is written in Scala. Idle for juggling data in a thousand processor cluster and pile of Java legacy code
  • Lesser lines of code compared to Java
  • Apache Kafka is written in Scala
  • Fast and robust

Disadvantages

  • Fewer libraries

MATLAB

Advantages

  • Quick, stable and ensures solid algo foe numerical computing lang
  • fourier transforms, signal proccessingm image processing and matrix algebra
  • used in statical analysis

TensorFlow

Advantages

  • Software lib for numerical computation
  • machine learning framework suitable for large scale data
  • Graph can be broken into many chunks that can keep running parallel over various GPUs or CPUs.
  • Supports distributed computing
  • Second generation for Google brain

Go

Advantages

  • Kubernetes and Docker written in GO
  • Fast, easy to learn
  • Fairly easy to write applications in and deploy
  • Go-based systems are being used to integrate machine learning and parallel processing of data
  • Efficient distributed computing

AWS for a big data project

Before analysis the data and making it useful, we need to set up the infrastructure. Setting up and managing data lakes involves a lot of manual and time-consuming tasks such as loading, transforming, securing, and auditing access to data. AWS Lake Formation automates many of those manual steps and reduces the time required to build a successful data lake from months to days.

Some of the available AWS Services are:

Use cases for AWS services in Big Data

Data Warehousing

Run SQL and complex, analytic queries against structured and unstructured data in your data warehouse and data lake, without the need for unnecessary data movement.

  • Amazon RedShift

Big data processing

Quickly and easily process vast amounts of data in your data lake or on-premises for data engineering, data science development, and collaboration.

  • Amazon EMR

Real time analytics

Collect, process, and analyze streaming data, and load data streams directly into your data lakes, data stores, and analytics services so you can respond in real time.

  • Amazon MSK
  • Amazon Kinesis

Operational analytics

Search, explore, filter, aggregate, and visualize your data in near real time for application monitoring, log analytics, and clickstream analytics.

  • Amazon Elasticsearch Services

Apart from AWS services, whether it’s a trendy syntax language like Python or more conventional languages like Java and R, choosing the right programming language for big data really comes down to you and your business’ preference.

When starting out, it can be to take advantage of books and other free resources. Doing so can allow beginners to become more familiar with the terminology and build a strong foundation for future development. Those who are looking to make a more streamline move into the field, however, should look for opportunities to gain and practice the skills needed to become an expert data analyst.

One of the most efficient ways to do this is through numerous online short and long term courses.