-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathHadoop.tex
1819 lines (1447 loc) · 69.3 KB
/
Hadoop.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\chapter{Hadoop}
\label{chap:Hadoop}
Hadoop started in 2006 based on Google's work of GFS (now is called HDFS -
Sect.\ref{sec:HadoopFS}) and MapReduce. Nowadays, Hadoop 2.x is much bigger with
many more components (Sect.\ref{sec:MR1_MR2}).
\section{Why Hadoop was developed?}
Big data is not just big in terms of bytes, but also type (e.g., a single hard
disk likely contains relations, text, images, and spreadsheets) and structure
(e.g., a large corpus of relational databases may have millions of unique
schemas). As a result, certain long-held assumptions --- e.g., that the database
schema is always known before writing a query --- are no longer useful guides
for building data management systems.
Hadoop has emerged from a solution for large-scale web-crawling and indexing
engine to a general-purpose computing platform for the next-generation of
data-based applications.
Hadoop uses two-step disk-based MapReduce implementation on Hadoop HDFS. To
enable using data on HDFS for other tasks, e.g. machine learning where iterative
algorithms are often used, there are several packages have been developed:
Apache Spark (Sect.\ref{sec:apache_spark}).
\subsection{Hadoop is NOT}
\begin{itemize}
\item ESB
\item NoSQL
\item HPC
\item Relational
\item Real-time
\end{itemize}
Data is added into the HDFS system, but you cannot ask Hadoop to return a list
of all the data matching a specific data set.
The primary reason for this is that Hadoop doesn't store, structure, or
understand the structure of the data that is being stored within HDFS. One
soltuion is to use HBase (a distributed database system on top of HDFS).
\subsection{Key Hadoop data types}
The below types of data are widely added to HDFS (Sect.\ref{sec:HadoopFS})
\begin{itemize}
\item Sentiment
\item Clickstream
\item Sensor/Machine
\item Geographic
\item Server logs
\item Text
\end{itemize}
\section{Critics of Hadoop}
It was designed to run on-premises in data centers with the advantage of using
low-cost machines.
But now everyone is moving to cloud. Y
\begin{enumerate}
\item ou can run Hadoop in the cloud. But it is not cost effective.
\item Hadoop is getting fatter with so different components.
Today, the architecture picture
of Hadoop looks like a zoo hosting HDFS, YARN, MapReduce, Tez, Pig, Hive,
Impala, Kudu, HBase, Accumulo, Flume, Sqoop, Falcon, Samza, etc.
Because original Hadoop (with only HDFS and MapReduce) is not good enough to
meet customer's demands, the community has developed a lot of new tools into
the Hadoop ecosystem over the years.
This is rational and necessary to win the competition. However, it also
inevitably reaches the status of overshooting.
Rarely customers need all of them. If no need, why bother running a full blown
Hadoop cluster? So, options to pick only needed components such as
\begin{itemize}
\item Parquet + Spark
\item Apache Samza, a stream processing engine that was originally designed
on top of Hadoop YARN
Netflix has recently contributed the feature of static partition assignments
that allows Samza to be used without YARN. This cool feature enables Netflix
to run Samza applications in AWS EC2 instances without any Hadoop/YARN
dependency.
\end{itemize}
\item HBase, the NoSQL engine of Hadoop, is not as competitive as other noSQL
solutions.
Compared to other popular NoSQL solutions such as MongoDB and Cassandra, HBase
has a lot of functionality and unique features. It also has battle proven
scalability and availability.
Moreover, Apache Trafodion, built on top of HBase, even provides fully ACID
SQL. However, it is only ranked at 15 on DB-Engines Ranking, way behind
MongoDB and Cassandra. The biggest reason that HBase is left behind is that
Hadoop distributors' marketing commitment to HBase has never risen to nearly
the level of MongoDB's or DataStax's push behind their respective core
products.
Technically, HBase is also more complicated to setup and operate because of
the dependency to other Hadoop services.
\item Spark, Kafka, Mesos, Docker, etc. are better than their counterparts of
Hadoop or fill in blank space. And they get endorsements from heavy weights
like IBM.
\end{enumerate}
\url{https://haifengl.wordpress.com/2016/03/03/the-future-of-hadoop-is-misty/}
\section{Hadoop distributors}
Apache Hadoop is the main open-source repository for Hadoop development. Other
than that, there are 3 main Hadoop distributions, target to commercial licenses
\begin{enumerate}
\item Cloudera CDH: plan to provide ``enterprise datahub'', i.e. there is no
need for datawarehouse at company
It has free version and commercial version. Some features are not open-source.
\item Hortonwork Hadoop: this is the 100\% open-source distribution, and the
main driver for Hadoop development at Apache, e.g. YARN module.
\item MapR Hadoop: this one does not use HDFS, but use MapRFS (a proprietary
distributed file system).
The free version is MapR's M3 Edition. It provides proprietary modules to ease
the maintenance and usage of a Hadoop system at companies which do not have
programmers/experts in Hadoop.
\end{enumerate}
All of these products are rooted from Apache Hadoop.
\section{Apache Hadoop}
A Hadoop cluster is a special type of computational cluster designed
specifically for storing and analyzing huge amounts of unstructured data in a
distributed computing environment. As of early 2013, Facebook was recognized as
having the largest Hadoop cluster in the world. Other prominent users include
Google, Yahoo and IBM. These clusters use Hadoop framework.
Hadoop has its origin from Apache Nutch (Sect.\ref{sec:Nutch}).
The 2 components in Nutch were applicable beyond the realm of search, and thus
an independent project was created in Feb, 2006 called Hadoop. Then Doug Cutting
joined Yahoo, and help to turn Hadoop into a system that ran at web scale, e.g.
10,000-core Hadoop cluster. In Jan, 2008, Apache Hadoop become the top-level
project.
In April 2008, Hadoop broke a world record to become the fastest system to sort
a terabyte of data. Running on a 910-node cluster, Hadoop sorted one terabyte in
209 seconds (just under $3^\frac{1}{2}$ minutes), beating the previous year's winner
of 297 seconds. In November of the same year, Google reported that its MapReduce
implementation sorted one terabyte in 68 seconds. Then a team at
Yahoo! used Hadoop to sort one terabyte in 62 seconds.
Hadoop framework is written in Java that enables running a Java app on large
clusters of commodity hardware (i.e. low costs) as it's running on a single
machine. Hadoop's HDFS is the distributed file system (Sect.\ref{sec:HadoopFS}).
It should be installed in a cluster. However, we can deploy on a single machine
by configuring a pseudo-distributed single-node Hadoop cluster.
The project's creator, Doug Cutting, explains how the name came about:
\begin{verbatim}
The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell
and pronounce, meaningless, and not used elsewhere: those are my naming
criteria. Kids are good at generating such. Googol is a kid's term.
\end{verbatim}
Subprojects and ``contrib" modules in Hadoop also tend to have names that are
unrelated to their function, often with an elephant or other animal theme
(``Pig," for example). Smaller components are given more descriptive (and
therefore more mundane) names. This is a good principle, as it means you can
generally work out what something does from its name. For example, the
jobtracker keeps track of MapReduce jobs.
Hadoop Ecosystem, which includes all of the additional software packages that
can be installed on top of or alongside Hadoop, such as Apache Hive, Apache Pig
and Apache Spark. Hadoop is Consistent and partition tolerant, i.e. it falls
under the CP category of the CAP theoram.
\begin{enumerate}
\item Hadoop Common
\item Hadoop HDFS:
\item Hadoop YARN (added from Hadoop 2.0): resource-management platform
responsible for managing compute resources in clusters and using them for scheduling of users' applications.
\item Hadoop MapReduce: a programming model for large scale data processing.
\end{enumerate}
\subsection{Hadoop computing engine}
\label{sec:hadoop_node_structure}
For processing the data, the Hadoop Map/Reduce ships code (typically Java-code
.jar file) to to the nodes that have the required data, and the nodes then
process the data in parallel. So, each node functions as ``data'' source and
computing unit which takes advantage of data locality. \textcolor{red}{This is
in contrast to HPC architecture which split data cluster and compute cluster,
but connect through high-speed networking}.
Although Java code is common, any programming language can be used with "Hadoop
Streaming" to implement the "map" and "reduce" parts of the user's program. To
expose higher level APIs, Apache Pig, Apache Hive and Apache Spark have been
added.
\url{http://en.wikipedia.org/wiki/Apache_Hadoop}
Next, you need to know the structure of a Hadoop cluster.
A small Hadoop cluster includes a single master (which functions as both
NameNode and JobTracker) and multiple worker nodes.
The master node can function as a JobTracker (Resource Manager), TaskTracker
(NodeManager), NameNode and DataNode.
A slave or worker node acts as both a DataNode and TaskTracker, though it is
possible to have data-only worker nodes and compute-only worker nodes. NOTE:
TaskTracker are compute-node.
\begin{enumerate}
\item one machine as NameNode, one machine as Resource Manager (or aka
JobTracker).
These machines are the masters.
\begin{itemize}
\item NameNode stores HDFS filesystem information in a file named
\verb!fsimage!. Updates to the file system (add/remove blocks) are not
updating the fsimage file, but instead are logged into a file, so the I/O
is fast append only streaming as opposed to random file writes. Only when
restaring, before it can serve client requests, the namenode reads the
fsimage and then applies all the changes from the log file to bring the filesystem state up to date in
memory (a new fsimage consisting of the prior fsimage plus the application
of all operations from the edit logs). It remains in safe mode until a
sufficient number of blocks have been reported by datanodes. This process
takes time. Administrators typically access the NameNode web UI at the first
sign of trouble. Unfortunately, the NameNode wouldn't start its HTTP server
until after writing a new checkpoint. In a slow startup situation, it could
take multiple minutes or even more than an hour after restarting the
NameNode before the web UI would be accessible which makes it would appear
as though the NameNode process had hung during startup.
Only an experienced Hadoop operator would be able to determine that the
NameNode is in fact making progress, by using relatively low-level
techniques such as inspecting thread dumps.
\footnote{\url{http://hortonworks.com/blog/understanding-namenode-startup-operations-in-hdfs/}}
Since Hadoop 2.0, a new feature called Secondary NameNode added. What it
does is to help boosting the start-up time; not as a full replicate of the
NamaNode.\footnote{\url{http://stackoverflow.com/questions/19970461/name-node-vs-secondary-name-node}}
\end{itemize}
\item The rest of the machines: act as both DataNode and NodeManager
(TaskTracker).
All these machines depend on the NameNode. If the name node falls the cluster
goes down.
\item (optional) one machine as Secondary NameNode: this machine does not
function as a seconday to the NameNode machine. Instead, it periodically read
the filesystem changes log and apply them into the fsimage file, thus bringing
the fsimage file up to date so that the NameNode start up faster next time.
No slaves can connect to the Secondary NameNode; so if the NameNode is down,
the whole Hadoop cluster still fails.
\end{enumerate}
In a larger cluster, the HDFS is managed through a dedicated NameNode server to
host the file system index, and a Secondary NameNode that can generate snapshots
of the namenode's memory structures, thus preventing file-system corruption and
reducing loss of data. Similarly, a standalone JobTracker server (Resource
Manager) can manage job scheduling.
Your MapReduce jobs may be IO bound or CPU/Memory bound -if you know which one
is more important (effectively how many CPU cycles/RAM MB used per Map or
Reduce), you can make better decisions.
\subsection{Hadoop YARN}
\label{sec:YARN}
Yarn has been added to Hadoop 2.0.
With YARN, you can now run multiple applications in Hadoop, all sharing a common
resource management. As of September, 2014, YARN manages only CPU (number of
cores) and memory, but management of other resources such as disk, network
and GPU is planned for the future.
\subsection{User authentication and authorization}
Hadoop doesn't do any authentication of users. This is an important realization
to make, because it can have serious implications in a corporate data center.
The NameNode and the JobTracker don't require any authentication.
Hadoop has the ability to require authentication, in the form of Kerberos
principals (Sect.\ref{sec:Kerberos}).
Hadoop can use the Kerberos protocol to ensure that when someone
makes a request, they really are who they say they are. This mechanism is used
throughout the cluster. In a secure Hadoop configuration, all of the Hadoop
daemons use Kerberos to perform mutual authentication, which means that when two
daemons talk to each other, they each make sure that the other daemon is who it
says it is. Additionally, this allows the NameNode and JobTracker to ensure that
any HDFS or MR requests are being executed with the appropriate authorization
level.
HDFS implement a permission model much like POSIX model. Each file and directory
is associated with an owner and a group. The file or directory has separate
permissions for the user that is the owner, for other users that are members of
the group, and for all other users. The difference from POSIX:
\begin{verbatim}
In contrast to the POSIX model, there are no setuid or setgid bits for files as
there is no notion of executable files.
For directories, there are no setuid or setgid bits directory as a
simplification..
The Sticky bit can be set on directories, preventing anyone except the
superuser, directory owner or file owner from deleting or moving the files
within the directory. Setting the sticky bit for a file has no effect.
\end{verbatim}
In the context of MapReduce, the users and groups are used to determine who is
allowed to submit or modify jobs. In MapReduce, jobs are submitted via queues
controlled by the scheduler.
Administrators can define who is allowed to submit jobs to particular queues via
MapReduce ACLs (Access Control Lists).
The downside to doing this is that if that user and group really don't exist, no
one will be able to access that file except the superusers, which, by default,
includes hdfs, mapred, and other members of the hadoop supergroup.
% \section{Experiences with Hadoop}
%
\subsection{Configure environment: hadoop-env.sh}
\label{sec:hadoop-env.sh}
Administrators should use the conf/hadoop-env.sh script to do site-specific
customization of the Hadoop daemons' process environment.
The file
\begin{verbatim}
/usr/local/hadoop/etc/hadoop/hadoop-env.sh
\end{verbatim}
control the path to java, and the setting for the four different
daemons NameNode/DataNode and JobTracker/TaskTracker. Each daemon receives
configuration options specified in the associated environment variable, which
maps to
\begin{verbatim}
NameNode HADOOP_NAMENODE_OPTS
DataNode HADOOP_DATANODE_OPTS
SecondaryNameNode HADOOP_SECONDARYNAMENODE_OPTS
JobTracker HADOOP_JOBTRACKER_OPTS
TaskTracker HADOOP_TASKTRACKER_OPTS
\end{verbatim}
%different locations which contaisn files that Hadoop needs to use, and the
% different system settings, e.g. maximum heap size
As Hadoop is used to run Java applications (hadoop daemons), these settings are
designed to control how a Java MapReduce applications run and use resources
\begin{verbatim}
JAVA_HOME
HADOOP_OPTS (extra args to java-runtime, default:
-Djava.net.preferIPv4Stack=true)
HADOOP_CONF_DIR
JSVC_HOME
HADOOP_CLASSPATH
# max heap to use
HADOOP_HEAPSIZE (in MB, default:1000)
HADOOP_NAMENODE_INIT_HEAPSIZE (in MB, default: 1000)
##
HADOOP_NAMENODE_OPTS
"-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS}
-Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS"
HADOOP_DATANODE_OPTS
"-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"
HADOOP_SECONDARYNAMENODE_OPTS
HADOOP_NFS3_OPTS
HADOOP_PORTMAP_OPTS (default: -Xmx512m)
## apply to commands such as: fs, dfs, fschk, distcp
HADOOP_CLIENTS_OPTS (default: -Xmx512m)
HADOOP_JAVA_PLATFORM_OPTS
# secure DataNode
HADOOP_SECURE_DN_USER
# location of log file
HADOOP_LOG_DIR (default: $HADOOP_HOME/logs)
(we can use $HADOOP_LOG_DIR/$USER)
HADOOP_SECURE_DN_LOG_DIR (default: ${HADOOP_LOG_DIR}/${HADOOP_HDFS_USER})
# location of PID file
HADOOP_PID_DIR (default: /tmp)
however should be set to a location that is accessible by only the
'hadoop' group's users, to avoid potential symlink attack
HADOOP_SECURE_DN_PID_DIR
# a string representation of the instance of hadoop
HADOOP_IDENT_STRING (default: $USER)
\end{verbatim}
Example: enable parallelGC on NameNode daemon
\begin{verbatim}
export HADOOP_NAMENODE_OPTS="-XX:+UseParallelGC ${HADOOP_NAMENODE_OPTS}"
\end{verbatim}
Example: in a multi-user Hadoop cluster, you may need to configure the log dir
\begin{verbatim}
HADOOP_LOG_DIR
\end{verbatim}
Example: maximum heapsize (in MB) for each daemon (default: 1000MB)
\begin{verbatim}
HADOOP_HEAPSIZE
\end{verbatim}
\section{Changes from Hadoop 1.x to Hadoop 2.x}
\label{sec:MR1_MR2}
Hadoop 1.x basically has 2 components: HDFS and MapReduce.
Hadoop 2.x has revised and break MapReduce down into different projects, each
does a particular job. A new and important layer above HDFS is YARN (batch
processing management) which allows new components to be added (Spark, BSP,
Hama, MapReduce, HBase (database system)) or existing programming models to use
(e.g. MPI). \url{http://www.wiziq.com/blog/hadoop-1-vs-hadoop-2/}
\subsection{Hadoop 1}
\begin{enumerate}
\item Limited up to 4000-nodes per cluster
\item The bottleneck is based on the number of tasks in a cluster: O(\# tasks
in a cluster)
\item JobTracker does multiple things: resource management, job scheduling and
monitoring $\rightarrow$ which causes the bottleneck
\item Only one namespace for managing HDFS
\item Map and Reduce slots are static
\item only job to run is MapReduce.
\item The namespaces for APIs
\begin{verbatim}
org.apache.hadoop.mapreduce.Partitioner
org.apache.hadoop.mapreduce.Mapper
org.apache.hadoop.mapreduce.Reducer
org.apache.hadoop.mapreduce.Job
\end{verbatim}
\end{enumerate}
\url{http://www.slideshare.net/RommelGarcia2/hadoop-1x-vs-2}
\begin{figure}[hbt]
\centerline{\includegraphics[height=6cm,
angle=0]{./images/MR1_MR2_comparison_01.eps}}
\caption{MR1 to MR2}
\label{fig:MR1_MR2_comparison_01}
\end{figure}
\subsection{Hadoop 2}
backward compatible with MR1
\begin{enumerate}
\item Support upto to 10,000 nodes per cluster
\item The bottleneck is based on the cluster size: O(cluster size)
\item The function of JobTracker in MR1 is different in MR2. To avoid
scaling issues, JobTracker is splitted into different components, each with
its specialized purpose (see below). Resource manager is carried out by the new component: YARN.
\item multiple namespace for managing HDFS, Fig.\ref{fig:MR1_MR2_comparison_02}
\item YARN is used as Resource Manager.
\begin{verbatim}
org.apache.hadoop.yarn.api.ApplicationClientProtocol
org.apache.hadoop.yarn.api.ApplicationMasterProtocol
org.apache.hadoop.yarn.api.ContainerManagementProtocol
\end{verbatim}
\item HDFS support multiple storage tiers: Disk, Memory, SSD
\item any job can be integrated with Hadoop
\item support other languages (not only Java)
\end{enumerate}
\url{http://www.slideshare.net/tshooter/strata-conf2014}
\begin{figure}[hbt]
\centerline{\includegraphics[height=3cm,
angle=0]{./images/MR1_MR2_comparison_02.eps}}
\caption{MR1 to MR2 comparison: Data can reside on NameNodes of different
namespace}
\label{fig:MR1_MR2_comparison_02}
\end{figure}
The reason is that in Hadoop 2, MapReduce is split into 2
components: the cluster resource management capabilities have become YARN; and
the MapReduce-specific capabilities remain MapReduce. This is known as MR2
architecture; while the old one is called MR1.
\begin{enumerate}
\item In the former MR1 architecture, the cluster was managed by a service
called the JobTracker.
\item In MR2 architecture: the functions of the JobTracker are divided into
three services.
\begin{itemize}
\item The ResourceManager is a persistent YARN service that receives and
runs applications (a MapReduce job is an application) on the cluster
\item
\end{itemize}
\end{enumerate}
\subsection{List of deprecated keys}
\url{https://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/DeprecatedProperties.html}
{\small
\begin{verbatim}
[INFO] deprecation - mapred.jar is deprecated. Instead, use mapreduce.job.jar
[INFO] deprecation - mapred.map.child.log.level is deprecated. Instead, use mapreduce.map.log.level
[INFO] deprecation - mapred.reduce.child.log.level is deprecated. Instead, use mapreduce.reduce.log.level
[INFO] deprecation - mapred.output.value.groupfn.class is deprecated. Instead, use mapreduce.job.output.group.comparator.class
[INFO] deprecation - mapred.output.key.comparator.class is deprecated. Instead, use mapreduce.job.output.key.comparator.class
[INFO] deprecation - mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files
[INFO] deprecation - mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class
[INFO] deprecation - mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class
[INFO] deprecation - mapreduce.partitioner.class is deprecated. Instead, use mapreduce.job.partitioner.class
[INFO] deprecation - mapred.job.name is deprecated. Instead, use mapreduce.job.name
[INFO] deprecation - mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class
[INFO] deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
[INFO] deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
[INFO] deprecation - mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir
[INFO] deprecation - fs.checkpoint.edits.dir is deprecated. Instead, use dfs.namenode.checkpoint.edits.dir
[INFO] deprecation - dfs.data.dir is deprecated. Instead, use dfs.datanode.data.dir
[INFO] deprecation - fs.checkpoint.dir is deprecated. Instead, use dfs.namenode.checkpoint.dir
[INFO] deprecation - mapred.temp.dir is deprecated. Instead, use mapreduce.cluster.temp.dir
[INFO] deprecation - dfs.name.dir is deprecated. Instead, use dfs.namenode.name.dir
[INFO] deprecation - mapred.system.dir is deprecated. Instead, use mapreduce.jobtracker.system.dir
[INFO] deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
[INFO] deprecation - mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
[INFO] deprecation - dfs.name.edits.dir is deprecated. Instead, use dfs.namenode.edits.dir
[INFO] deprecation - user.name is deprecated. Instead, use mapreduce.job.user.name
[INFO] deprecation - mapred.cache.files.filesizes is deprecated. Instead, use mapreduce.job.cache.files.filesizes
[INFO] deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
[INFO] deprecation - mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
[INFO] deprecation - mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class
[INFO] deprecation - mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
[INFO] deprecation - mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
[INFO] deprecation - mapred.cache.files.timestamps is deprecated. Instead, use mapreduce.job.cache.files.timestamps
[INFO] deprecation - mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
\end{verbatim}
}
\url{https://groups.google.com/forum/#!msg/scoobi-dev/OiWMwnxpoQE/H4fjQBMF3vcJ}
\section{Resource in Hadoop}
\label{sec:Hadoop_resources}
A resource is an XML-based file that contain a set of name/value pair.
\subsection{Read-only configuration}
These files contains the read-only configuration
\begin{verbatim}
src/core/core-default.xml
src/hdfs/hdfs-default.xml
src/mapred/mapred-default.xml.
\end{verbatim}
It comes with some default settings.
By default, Hadoop has 2 resources, loaded in the order they are specified
in the classpath
\begin{verbatim}
core-default.xml # read-only information
core-site.xml # site-specific information (for a given Hadoop installation)
\end{verbatim}
The configuration in the form
\begin{verbatim}
<property>
<name>property-name</name>
<value>property-value</value>
</property>
\end{verbatim}
Example: specify
\begin{verbatim}
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///usr/local/hadoop/data/datanode</value>
<description>DataNode directory</description>
</property
\end{verbatim}
we can declare a key/value pair as {\it final} (i.e. no subsequently-loaded
resource can modify the value). Typically, the {\it final} key/value pair are
stored in \verb!core-site.xml! file so that user application cannot alter.
\begin{verbatim}
<property>
<name>dfs.client.buffer.dir</name>
<value>/tmp/hadoop/dfs/client</value>
<final>true</final>
</property>
\end{verbatim}
There are so many pre-defined names from Hadoop. From your application, you can
modify these key/value pair by loading the classpath
\begin{verbatim}
java.lang.Object
org.apache.hadoop.conf.Configuration
\end{verbatim}
with the predefined methods to handle the modification of them.
\url{http://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/conf/Configuration.html}
\subsection{Site-specific configuration}
These are files that contains configuration that you can modify and it is
site-specific, i.e. it can be different from one machine to another.
\begin{verbatim}
conf/core-site.xml
conf/hdfs-site.xml
conf/mapred-site.xml
\end{verbatim}
You can control the Hadoop scripts found in the bin/ directory of the
distribution, by setting site-specific values via the conf/hadoop-env.sh
(Sect.\ref{sec:hadoop-env.sh}).
\subsection{core-site.xml}
\label{sec:core-site.xml}
The URI to the NodeName can be defined in the key \verb!fs.defaultFS! (the
deprecated key (since Hadoop 2.0) is fs.default.name)
\url{https://issues.apache.org/jira/browse/AMBARI-2789}
\subsection{yarn-site.xml}
\label{sec:yarn-site.xml}
We need to specify the machine that functions as the ResourceTracker
\begin{verbatim}
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>10.10.10.104:8025</value>
</property>
<property>
\end{verbatim}
Then we choose what shuffle process to handle the
\begin{verbatim}
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
\end{verbatim}
\url{https://developer.yahoo.com/hadoop/tutorial/module1.html}
\section{Disk configuration}
\url{http://cloud-dba-journey.blogspot.com/2013/08/hadoop-reference-architectures.html}
Hadoop is quite different than other services in conventional data centers, such
as web, mail, and database servers. They are learning that to achieve optimal
performance, you need to pay particular attention to configuring the underlying
hardware.
\url{http://hortonworks.com/blog/proper-care-and-feeding-of-drives-in-a-hadoop-cluster-a-conversation-with-stackiqs-dr-bruno/}
Having lots of disks per server gives you more raw IO bandwidth than having one
or two big disks. If you have enough that different tasks can be using different
disks for input and output, disk seeking is minimized, which is one of the big
disk performance killers. That said: more disks have a higher power budget; if
you are power limited, you may want fewer but larger disks.
\subsection{Ext3 or Ext4 or XFS or btrfs or zfs}
\textcolor{red}{Don't use LVM; it adds latency and causes a bottleneck.}
\begin{enumerate}
\item XFS: internally it uses B+ trees and Extends to store data. It is fully
supported in Ubuntu. \url{https://wiki.ubuntu.com/XFS}
\item Ext3/Ext4: use H trees
\end{enumerate}
Yahoo! has publicly stated they use ext3. Regardless of the merits of the
filesystem, that means that HDFS-on-ext3 has been publicly tested at a bigger
scale than any other underlying filesystem that we know of.
Ext4 has better performance with large files. However, the Ext4 Linux filesystem
has delayed allocation of data which makes it handle unplanned server
shutdowns/power outages less well than classic ext3. Consider turning off the
delalloc option in /etc/fstab unless you trust your UPS.
XFS offers better disk space utilization than ext3 and has much quicker disk
formatting times than ext3. This means that it is quicker to get started with a
data node using XFS. However, most often, the limitation is not disk-read, but
I/O and RAM limitations will be more important. XFS is not included in basic
RedHat Enterprise Linux.
So, \textcolor{red}{ext3 is the most common choice.} However, since Ubuntu
12.04, the preferred one is XFS, EXT4, EXT3 in that order.
Create one single partition
\begin{verbatim}
fdisk /dev/sdb
\end{verbatim}
Format the partition with EXT4
\begin{verbatim}
mkfs.ext4 /dev/sdb1
\end{verbatim}
Automount to /data/0, /data/1 (if we have more disk)
by modifying /etc/fstab file
\begin{verbatim}
/dev/sdb1 /data/0 ext4 defaults,noatime,nodiratime
1 2
\end{verbatim}
NOTE: We use these as we don't need these information and thus can improve disk
performance
\begin{verbatim}
noatime = skip writing file access time to disk
everytime a file is accessed
nodiratime = skip writing directory access time
\end{verbatim}
\url{http://wiki.apache.org/hadoop/DiskSetup}
\url{http://hortonworks.com/kb/linux-file-systems-for-hdfs/}
\subsection{RAID or JBOD (non-RAID)}
You don't need RAID disk controllers for Hadoop Data Node, as it copies data
across multiple machines instead.
\url{http://wiki.apache.org/hadoop/DiskSetup}
A set of separate disks is better than the same set managed by
RAID-0 disk array. The reason is that read-speed can vary from disk to disk, and
in RAID-0, it runs with the speed of the slowest one. So, if you have multiple
disks, you should use comma-delimited in the configuration file to specify
multiple locations. Also, in RAID-0, a single disk failures will bring the whole
node's data down.
Without RAID, losing a single disk would cause a certain loss of data.
While drives are much more reliable than they used to be, disk failures still
can happen any time.
\begin{verbatim}
Our experience indicates that a 1,000 node cluster containing 12,000 drives for
a total raw storage capacity of 48 peta-bytes can expect about 3 drive failures
a day in its third year of operation. Drive failure rates rise as the devices
age. For a 500 node cluster, you're looking at a drive failure every 17 hours or
so.
\end{verbatim}
\url{http://hortonworks.com/blog/proper-care-and-feeding-of-drives-in-a-hadoop-cluster-a-conversation-with-stackiqs-dr-bruno/}
Without the right tools and methodology, it is very difficult for cluster
operators to manage clusters at scale. They typically have to write scripts to
scan the cluster, detect disk failures, and report them. Then, once the
offending drive has been replaced, commands must be run for the controller to
recognize the new drive, OS commands need to be executed to format the drive,
and then some Hadoop commands are required to add the disk back to the configuration.
StackIQ software configure the boot-disk as RAID-1 (bootdisk0, and a
mirror bootdisk1), and all other non-boot disks are individual disks.
While the Hadoop Name Node and Secondary Name Node can write to a list of drive
locations, they will stop functioning if it can not write to ALL the locations.
In this case a mirrored RAID is a good idea for higher availability.
To determine where on the local filesystem a DFS data node should store it
blocks, we modify \verb!dfs.data.dir! If this is a comma-delimited list of
directories, then data will be stored in all named directories, typically on
different devices. Directories that do not exist are ignored.
Another parameter \verb!dfs.namenode.name.dir! which determines where on the
local filesystem the HDFS NameNode should store the name table(fsimage). If
this is a comma-delimited list of directories then the name table is replicated in all of
the directories, for redundancy.
\subsection{Seprate disk or separate partition}
Cloudera company use separate *disks* (not just partitions) for the OS and data
disks used by the datanode (DN) and/or tasktracker (TT).
\url{http://www.quora.com/What-is-the-best-disk-partitioning-scheme-for-a-Hadoop-DataNode}
\textcolor{red}{It is recommended to use one big giant partition for each disk,
i.e the disk /dev/sdb becomes /dev/sdb1}
You usually don't want Hadoop's logs going to one of the data disks either, as
it usually creates enough contention to degrade the performance of the drive,
not to mention the imbalance in space consumption it creates.
Paths to datanode and tasktracker on the same disk: this is the most common
deployment choice. You partition each data disk into one giant partition that
encompasses the entire disk and mount it at /data/<disk number>
Under this directory, you create "mapred/local" for the TT and "dfs/dn" for the
DN. You configure HDFS to reserve space for MapRed local data (set
dfs.datanode.du.reserved in \verb!hdfs-site.xml! (Sect.\ref{sec:Hadoop_resources}) to the
number of bytes to leave for mapred on each disk).
Paths to datanode and tasktracker on dedicated disks:
This is less common, but also has some merit to it. In this case, you still
create one large partition on each disk, mount them the same as above, but you
only give the DN some number of disks, leaving a few to the TT for mapred local
data. The thinking here is that the DNs have a different IO profile than the TT,
and wind up using the disks very differently. The downside to this is that you
sacrifice quite a bit of capacity and throughput since the TT is usually given a
much smaller spindle count than the DN. The TT disks wind up seeing poor
utilization (in most cases), although it's usually more predictable.
Slave nodes partition configuration
\begin{verbatim}
/swap - 96 GB (for a 48GB memory system)
/root - 20GB (ample room for existing files, future log file growth, and OS upgrades)
/grid/0/ - [full disk GB] first partition for Hadoop to use for local storage
/grid/1/ - second partition for Hadoop to use
/grid/2/ - ...
\end{verbatim}
\subsection{NameNode fault-tolerance}
Use NN HA support and QJM to reduce disk failure, or
Use RAID-1 (mirroring)
\subsection{DataNode}
All data disks should always be JBOD (just-a-bunch-of-disk, each disk
with a discrete filesystem, mounted separately).
\section{Deploy 1 machine}
\url{http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/}
\section{Deploy a cluster}
Suppose the cluster has 4 machines
\begin{verbatim}
10.10.10.104 mynode1
10.10.10.105 mynode2
10.10.10.106 mynode3
10.10.10.108 mynode4
\end{verbatim}
each running Ubuntu 12.04 and Hadoop 2.5.2.
First, you need to configure the static IP for the machines (read Sys-admin
book). Notice there is some changes on how IP for DNS servers are locally
configured with static IP on Ubuntu 12.04.
Install Hadoop on the cluster typically involves (1) unpack the compiled Hadoop
on all the machines in the cluster, i.e. put them
\begin{verbatim}
/usr/local/hadoop --> /usr/local/hadoop-2.5.2
\end{verbatim}
then we need to modify the configuration files propertly. They are in
\begin{verbatim}
/usr/local/hadoop/etc/hadoop/
\end{verbatim}
We should access everything via environment variables, e.g.
\begin{verbatim}
HADOOP_HOME
\end{verbatim}
by putting them in the \verb!~/.profile! or \verb!~/.bashrc! file.
All machines in the cluster usually have the same \verb!HADOOP_HOME! path.
It is also important to know the structure of a Hadoop cluster.
Typically one machine in the cluster is designated as the NameNode and another
machine the as JobTracker, exclusively. These are the masters. The rest of the
machines in the cluster act as both DataNode and TaskTracker. These are the slaves.
\subsection{Apply all machines - add new Hadoop users}
Before we talk how to manage multiple users in a distributed Hadoop cluster
(Sect.\ref{sec:manage_users_Hadoop}), we first create the first one with all the
privilege. A group called \verb!hadoop! should be used and all users who want to
use hadoop should belong to this group.
\begin{verbatim}
sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser
sudo adduser hduser sudo
\end{verbatim}
\subsection{First node (public/private key for remote login)}
To avoid retyping the password for the just created user account for Hadoop, we
create a public-private key so that from the first node, we can login to all
other nodes easily.
\begin{verbatim}
ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
\end{verbatim}
and copy the file to all other nodes
\begin{verbatim}
scp -r ~/.ssh [email protected]:~/
\end{verbatim}
\subsection{First node (NameNode) - packages for compiling Hadoop}
First, we need to understand the structure
of a Hadoop cluster: Sect.\ref{sec:hadoop_node_structure}.
The first and most important machine is the NameNode, we do the installation on
this node first
\begin{enumerate}
\item install the libraries to compile Hadoop
\begin{verbatim}
sudo apt-get install maven build-essential zlib1g-dev cmake pkg-config
libssl-dev protobuf-compiler
\end{verbatim}
NOTE: You can also download the pre-compiled Hadoop code but it's for 32-bit.
Even though you can also run it normally on a 64-bit O/S; you may get some
warnings and not able to read file larger than 4GB.
NOTE: \verb! protobuf-compiler! may cause some problem with the version
depending on the Ubuntu version. You can also download the source code of
protoc and compile it first.
\end{enumerate}
\subsection{Apply all machines (Java JDK) - for compiling and running Hadoop}
\label{sec:configure_Java}
\begin{enumerate}
\item Install Oracle Java JDK 7 or 8:
\begin{verbatim}
sudo add-apt-repository ppa:webupd8team/java -y
sudo apt-get update
sudo apt-get install oracle-java8-installer
sudo apt-get install oracle-java8-set-default
\end{verbatim}
\item some useful utilities
\begin{verbatim}
sudo apt-get install screen nmap
\end{verbatim}
\end{enumerate}
\subsection{First node (compile Hadoop to native)}
You don't need this step if you just get the compiled version (32-bit only) of
Hadoop.
Native implementations of certain components for performance reasons
and for non-availability of Java implementations.
These components are available in a single, dynamically-linked native library
called the native hadoop library. On the *nix platforms the library is named
\verb!libhadoop.so!.
Download the right Hadoop version for the Ubuntu O/S version being used
\url{http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/NativeLibraries.html}
\begin{verbatim}
wget
http://www.eu.apache.org/dist/hadoop/core/hadoop-2.4.1/hadoop-2.4.1-src.tar.gz
\end{verbatim}
If the source is used, unpack and compile (make sure you have the required
utilities and libraries from the previous section)
\begin{verbatim}
tar -xvf hadoop-2.4.1-src.tar.gz
cd hadoop-2.4.1-src/
mvn package -Pdist,native -Dmaven.javadoc.skip=true -DskipTests -Dtar
\end{verbatim}
Possible error:
\begin{enumerate}
\item javah not found
\begin{verbatim}
/usr/lib/jvm/java-8-oracle/jre/bin/javah
\end{verbatim}
SOLUTION: we may need to change \verb!JAVA_HOME! to
\begin{verbatim}
export JAVA_HOME=/usr/lib/jvm/java-8-oracle/
\end{verbatim}
instead of \verb!/usr/lib/jvm/java-8-oracle/jre!.
\item Firewall problem
If the machine is behind the firewall, maven (mvn) needs an xml configuration
file with the information about proxy setting