Hadoop Ecosystem

Petabyte 2018. 8. 6. 23:39

2018. 8. 6. 23:39

Hadoop Ecosystem은 큰 데이터 문제를 해결하는 플랫폼 또는 프레임 워크입니다.

내부적으로 여러 가지 서비스 (섭취, 저장, 분석 및 유지)를 포함하는 모음으로 간주 할 수 있습니다.

저장을 위해 우리는 HDFS (Hadoop Distributed Filesystem)를 사용합니다 . HDFS 의 주요 구성 요소 는 NameNode 와 DataNode 입니다.

NameNode

DataNode (슬레이브 노드)를 관리하고 관리하는 마스터 데몬입니다. 클러스터에 저장된 모든 파일의 메타 데이터, 즉 저장된 블록의 위치, 파일 크기, 사용 권한, 계층 구조 등을 기록합니다. 파일 시스템 메타 데이터에 발생하는 모든 변경 사항을 기록합니다.

예를 들어 HDFS에서 파일이 삭제되면 NameNode가 즉시 편집 로그에이 파일을 기록합니다. 클러스터의 모든 데이터 노드로부터 하트 비트 및 블록 보고서를 정기적으로 수신하여 DataNode가 라이브 상태인지 확인합니다. HDFS에있는 모든 블록의 기록을 유지하고 이들 블록이 저장된 노드를 기록합니다.

데이터 노드

이것들은 각 슬레이브 머신에서 실행되는 슬레이브 데몬입니다. 실제 데이터는 DataNodes에 저장됩니다. 이들은 클라이언트의 읽기 및 쓰기 요청을 담당합니다. 또한 NameNode의 결정에 따라 블록 작성, 블록 h 제 및 복제를 담당합니다.

처리를 위해 YARN (Yet Another Resource Negotiator)을 사용합니다. YARN 의 구성 요소 는 ResourceManager 및 NodeManager 입니다.

ResourceManager

이는 클러스터 레벨 (각 클러스터에 하나씩) 구성 요소이며 마스터 시스템에서 실행됩니다. YARN에서 실행되는 리소스를 관리하고 응용 프로그램을 예약합니다.

NodeManager

노드 레벨 구성 요소 (각 노드에 하나씩 있음)이며 각 슬레이브 시스템에서 실행됩니다. 컨테이너 관리 및 각 컨테이너의 리소스 사용률 모니터링을 담당합니다. 또한 노드 상태 및 로그 관리를 추적합니다. ResourceManager와 지속적으로 통신하여 최신 상태를 유지합니다.

따라서 MapReduce를 사용하여 HDFS에서 병렬 처리를 수행 할 수 있습니다.

MapReduce

Hadoop 생태계에서 처리의 핵심 구성 요소로서 처리 논리를 제공합니다. 즉, MapReduce는 Hadoop 환경에서 분산 및 병렬 알고리즘을 사용하여 대용량 데이터 세트를 처리하는 응용 프로그램을 작성하는 데 도움이되는 소프트웨어 프레임 워크입니다. MapReduce 프로그램에서 Map () 및 Reduce () 는 두 가지 함수입니다. Map 함수 는 필터링, 그룹화 및 정렬과 같은 작업을 수행합니다. Reduce 함수가지도 함수에의해 생성 된 결과를 집계하고 요약합니다. Map 함수에 의해 생성 된 결과는 a 키 값 쌍 (K, V)은 Reduce 함수의 입력으로 사용됩니다.

이 비디오를 통해 Hadoop 및 아키텍처를 자세히 이해할 수 있습니다.

" data-yt-id="n3qnsVFNEIU" style="margin-bottom: 1em; position: relative; width: 602px; height: 295px; background-repeat: no-repeat; background-position: center center; cursor: pointer; color: rgb(51, 51, 51); font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 16px; background-image: url("https://img.youtube.com/vi/n3qnsVFNEIU/0.jpg"); background-size: cover !important;">

Hadoop 단일 노드 및 다중 노드 클러스터 설치

그런 다음이 Hadoop 생태계 블로그 를 통해 Hadoop 생태계 를 자세히 학습 할 수 있습니다 .

이 Hadoop 에코 시스템 자습서 비디오를 살펴볼 수도 있습니다.

" data-yt-id="-XkEX1onpEI" style="margin-bottom: 1em; position: relative; width: 602px; height: 295px; background-repeat: no-repeat; background-position: center center; cursor: pointer; color: rgb(51, 51, 51); font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 16px; background-image: url("https://img.youtube.com/vi/-XkEX1onpEI/0.jpg"); background-size: cover !important;">

돼지

PIG는 실행 환경을위한 돼지 라틴어 , 언어 및 돼지 런타임 이라는 두 부분으로 구성 됩니다. Java 및 JVM으로 더 잘 이해할 수 있습니다. 돼지 라틴어를 지원합니다 .

모두가 프로그래밍 배경에 속하지 않으므로. 따라서 Apache PIG는이를 완화합니다. 너는 어떻게 알고 싶어 할지도 몰라?

음, 재미있는 사실을 말씀 드리겠습니다.

돼지 라틴 10 줄 = 약. 200 줄의 Map-Reduce Java 코드

그러나 돼지 작업의 뒷부분에서지도 축소 작업이 실행된다고 말할 때 충격을받지 마십시오. 컴파일러는 내부적으로 돼지 라틴을 MapReduce로 변환합니다. MapReduce 작업의 순차 집합을 생성하며 이는 추상화입니다 (검정 상자와 같이 작동 함). PIG는 처음에 Yahoo가 개발했습니다. ETL (Extract, Transform 및 Load), 거대한 데이터 세트 처리 및 분석을위한 데이터 흐름을 구축 할 수있는 플랫폼을 제공합니다.

" data-yt-id="GG-VRm6XnNk" style="margin-bottom: 1em; position: relative; width: 602px; height: 295px; background-repeat: no-repeat; background-position: center center; cursor: pointer; color: rgb(51, 51, 51); font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 16px; background-image: url("https://img.youtube.com/vi/GG-VRm6XnNk/0.jpg"); background-size: cover !important;">

하이브

Facebook은 SQL에 능통 한 사람들을 위해 HIVE를 만들었습니다. 따라서 HIVE는 하둡 생태계에서 일하는 동안 집에서 느끼게합니다. 기본적으로 HIVE는 SQL과 같은 인터페이스를 사용하여 분산 환경에서 대형 데이터 세트를 읽고, 쓰고 관리하는 데이터웨어 하우징 구성 요소입니다.

HIVE + SQL = HQL

Hive의 쿼리 언어는 SQL과 매우 유사한 HQL (Hive Query Language)이라고합니다. 하이브는 확장 성이 뛰어납니다. 즉, 대용량 데이터 집합 처리 (즉, 일괄 처리 쿼리 처리)와 실시간 처리 (즉, 대화 형 쿼리 처리)라는 두 가지 목적을 모두 수행 할 수 있습니다. Hive는 내부적으로 MapReduce 프로그램으로 변환됩니다.

SQL의 모든 기본 데이터 유형을 지원합니다. 사전 정의 된 기능을 사용하거나 사용자의 특정 요구를 충족시키기 위해 맞춤형 사용자 정의 함수 (UDF)를 작성할 수 있습니다.

" data-yt-id="tKNGB5IZPFE" style="margin-bottom: 1em; position: relative; width: 602px; height: 295px; background-repeat: no-repeat; background-position: center center; cursor: pointer; color: rgb(51, 51, 51); font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 16px; background-image: url("https://img.youtube.com/vi/tKNGB5IZPFE/0.jpg"); background-size: cover !important;">

요구 사항에 따라 HBase에 데이터를 저장할 수 있습니다.

HBase

HBase는 오픈 소스, 비 관계형 분산 데이터베이스입니다. 즉, NoSQL 데이터베이스입니다. 모든 유형의 데이터를 지원하므로 Hadoop 에코 시스템 내부의 모든 것을 처리 할 수 있습니다. 대규모 데이터 세트를 처리하도록 설계된 분산 형 스토리지 시스템 인 Google의 BigTable을 모델로합니다.

HBase는 HDFS 위에서 실행되도록 설계되었으며 BigTable과 같은 기능을 제공합니다. 이는 대부분의 Big Data 사용 사례에서 일반적으로 나타나는 스파 스 데이터를 내결함성있게 저장하는 방법을 제공합니다. HBase는 Java로 작성되지만 HBase 응용 프로그램은 REST, Avro 및 Thrift API로 작성할 수 있습니다.

더 나은 이해를 위해 예를 들어 보겠습니다. 수십억의 고객 이메일을 보유하고 있으며 이메일에 불만 사항을 사용하는 고객의 수를 알아야합니다. 요청은 신속하게 (즉, 실시간으로) 처리되어야합니다. 여기에서는 적은 양의 데이터를 검색하는 동안 큰 데이터 세트를 처리합니다. 이러한 종류의 문제를 해결하기 위해 HBase가 설계되었습니다.

" data-yt-id="NOX6-nDtrFQ" style="margin-bottom: 1em; position: relative; width: 602px; height: 295px; background-repeat: no-repeat; background-position: center center; cursor: pointer; color: rgb(51, 51, 51); font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 16px; background-image: url("https://img.youtube.com/vi/NOX6-nDtrFQ/0.jpg"); background-size: cover !important;">

아파 토플

데이터 수집은 Hadoop 에코 시스템의 중요한 부분입니다. Flume은 비정형 및 반 구조화 된 데이터를 HDFS로 처리하는 데 도움이되는 서비스입니다. 이 솔루션은 신뢰할 수 있고 분산 된 솔루션을 제공 하며 많은 양의 데이터 세트 를 수집, 통합 및 이동 하는 데 도움이됩니다 . 네트워크 트래픽, 소셜 미디어, 이메일 메시지, 로그 파일 등과 같은 다양한 소스의 온라인 스트리밍 데이터를 HDFS로 가져 오는 데 도움이됩니다.

APACHE SQOOP

다른 데이터 수집 서비스 즉 Sqoop. Flume과 Sqoop의 주된 차이점은 Flume이 비정형 데이터 또는 반 구조화 된 데이터 만 HDFS로 가져 오는 것입니다. Sqoop은 RDBMS 또는 엔터프라이즈 데이터웨어 하우스에서 HDFS로 또는 그 반대로 가져온 구조화 된 데이터를 가져올 수 있습니다.

마지막으로,이 Hadoop 생태계 블로그와 Hadoop 생태계 비디오를 통해 당신에게 깊은 지식을 얻으시기 바랍니다.

Edureka는 Hadoop Tutorial 비디오의 좋은 목록을 제공합니다. 이 Hadoop 튜토리얼 비디오 재생 목록 과 Hadoop Tutorial 블로그 시리즈 를 살펴 보도록 권합니다 . 학습 내용은 Hadoop 인증에 부합해야합니다 .

Hadoop Ecosystem is a platform or framework which solves big data problems. You can consider it as a suite which encompasses a number of services (ingesting, storing, analyzing and maintaining) inside it.

For storage we use HDFS (Hadoop Distributed Filesystem).The main components of HDFS are NameNode and DataNode.

NameNode

It is the master daemon that maintains and manages the DataNodes (slave nodes). It records the metadata of all the files stored in the cluster, e.g. location of blocks stored, the size of the files, permissions, hierarchy, etc. It records each and every change that takes place to the file system metadata.

For example, if a file is deleted in HDFS, the NameNode will immediately record this in the EditLog. It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are live. It keeps a record of all the blocks in HDFS and in which nodes these blocks are stored.

DataNode

These are slave daemons which runs on each slave machine. The actual data is stored on DataNodes. They are responsible for serving read and write requests from the clients. They are also responsible for creating blocks, deleting blocks and replicating the same based on the decisions taken by the NameNode.

For processing , we use YARN(Yet Another Resource Negotiator). The components of YARN are ResourceManager and NodeManager.

ResourceManager

It is a cluster level (one for each cluster) component and runs on the master machine. It manages resources and schedule applications running on top of YARN.

NodeManager

It is a node level component (one on each node) and runs on each slave machine. It is responsible for managing containers and monitoring resource utilization in each container. It also keeps track of node health and log management. It continuously communicates with ResourceManager to remain up-to-date .

So, you can perform parallel processing on HDFS using MapReduce.

MapReduce

It is the core component of processing in a Hadoop Ecosystem as it provides the logic of processing. In other words, MapReduce is a software framework which helps in writing applications that processes large data sets using distributed and parallel algorithms inside Hadoop environment. In a MapReduce program, Map() and Reduce() are two functions.The Map function performs actions like filtering, grouping and sorting.While Reduce function aggregates and summarizes the result produced by map function.The result generated by the Map function is a key value pair (K, V) which acts as the input for Reduce function.

You can go through this video to understand Hadoop & it’s architecture in detail.

Install Hadoop Single Node and Multi Node Cluster

Then you can go through this Hadoop Ecosystem blog to learn Hadoop Ecosystem in detail.

You can also go through this Hadoop Ecosystem tutorial video.

Pig

PIG has two parts: Pig Latin, the language and the pig runtime, for the execution environment. You can better understand it as Java and JVM. It supports pig latin language.

As everyone does not belong from a programming background. So, Apache PIG relieves them. You might be curious to know how?

Well, I will tell you an interesting fact:

10 line of pig latin = approx. 200 lines of Map-Reduce Java code

But don’t be shocked when I say that at the back end of Pig job, a map-reduce job executes. The compiler internally converts pig latin to MapReduce. It produces a sequential set of MapReduce jobs, and that’s an abstraction (which works like black box). PIG was initially developed by Yahoo. It gives you a platform for building data flow for ETL (Extract, Transform and Load), processing and analyzing huge data sets.

Hive

Facebook created HIVE for people who are fluent with SQL. Thus, HIVE makes them feel at home while working in a Hadoop Ecosystem. Basically, HIVE is a data warehousing component which performs reading, writing and managing large data sets in a distributed environment using SQL-like interface.

HIVE + SQL = HQL

The query language of Hive is called Hive Query Language(HQL), which is very similar like SQL. Hive is highly scalable. As, it can serve both the purposes, i.e. large data set processing (i.e. Batch query processing) and real time processing (i.e. Interactive query processing). Hive gets internally gets converted into MapReduce programs.

It supports all primitive data types of SQL. You can use predefined functions, or write tailored user defined functions (UDF) also to accomplish your specific needs.

You can store data in HBase based on your requirements.

HBase

HBase is an open source, non-relational distributed database. In other words, it is a NoSQL database. It supports all types of data and that is why, it’s capable of handling anything and everything inside a Hadoop ecosystem. It is modelled after Google’s BigTable, which is a distributed storage system designed to cope up with large data sets.

The HBase was designed to run on top of HDFS and provides BigTable like capabilities. It gives us a fault tolerant way of storing sparse data, which is common in most Big Data use cases. The HBase is written in Java, whereas HBase applications can be written in REST, Avro and Thrift APIs.

For better understanding, let us take an example. You have billions of customer emails and you need to find out the number of customers who has used the word complaint in their emails. The request needs to be processed quickly (i.e. at real time). So, here we are handling a large data set while retrieving a small amount of data. For solving these kind of problems, HBase was designed.

APACHE FLUME

Ingesting data is an important part of our Hadoop Ecosystem. The Flume is a service which helps in ingesting unstructured and semi-structured data into HDFS. It gives us a solution which is reliable and distributed and helps us in collecting, aggregating and moving large amount of data sets. It helps us to ingest online streaming data from various sources like network traffic, social media, email messages, log files etc. in HDFS.

APACHE SQOOP

Another data ingesting service i.e. Sqoop. The major difference between Flume and Sqoop is Flume only ingests unstructured data or semi-structured data into HDFS. While Sqoop can import as well as export structured data from RDBMS or Enterprise data warehouses to HDFS or vice versa.

At last, I would recommend you tu go throiugh this Hadoop ecosystem blog and Hadoop Ecosystem video to get a indepth knowledge.

Edureka provides a good list of Hadoop Tutorial videos. I would recommend you to go through this Hadoop tutorial video playlist as well as Hadoop Tutorial blog series. Your learning should be aligned with Hadoop Certification.

출처 : https://www.quora.com/What-is-a-Hadoop-ecosystem

저작자표시 비영리 변경금지 (새창열림)

'빅데이터 > Hadoop' 카테고리의 다른 글

Big Data Analysis (0)	2018.08.09
Hadoop fs 명령어 정리 (0)	2018.08.06
오픈 소스 프레임워크를 활용한 검색엔진 구현 (0)	2018.08.03
hadoop 자주쓰는 명령어 / Wordcount.java / wc.jar 파일 (0)	2018.08.02
Hadoop WordCount v1.0 wc.jar (0)	2018.08.02

PETABYTE

Hadoop Ecosystem

'빅데이터 > Hadoop' 카테고리의 다른 글

+ Recent posts

티스토리툴바