Hadoop in Action

4.2 Hadoop blocks

The running Hadoop refers to sets of daemons and resident programs, which exist on a different server in the network. The daemons perform specific roles; a daemon may exist on a single server or across many servers. The daemons consist of NameNodes, DataNode, Secondary NameNodes, JobTrackers and TaskTrackers (Chuck, 2014). The daemons and their roles within the Hadoop form a significant part of the article as illustrated in the following discussion.

4.2.3Secondary NameNode

The Secondary NameNodes (SNN) are assistant daemons that monitor the states of the clusters HDFS. Notably, the Secondary NameNodes are similar to NameNodes in the sense that they have single clusters, which reside on their own machines as well.  Furthermore, another DataNode and TaskTracker daemons that operate on same servers do not exist. The Secondary NameNodes differ from the NameNodes in the sense that the processes do not receive or tabulate any real-time change to HDFS. However, they communicate with the NameNodes to take pictures of the HDFS metadata at an interval that is defined by a cluster configuration. Conversely, the NameNodes are points of failures for Hadoop clusters, and the SNN pictures help to reduce the downtime as well as data loss. Nevertheless, the NameNodes failure requires human interventions to re-construct the clusters to use a Secondary NameNode as a basic NameNode.

4.2.4JobTracker

The JobTracker daemons are the links between the applications and the Hadoop. When the codes are submitted to the clusters, the JobTrackers establish an execution strategy by establishing the files that are processed to assign different tasks to the nodes and monitors the performance of the nodes. Incase of task failure, the JobTrackers may automatically re-start the tasks, possibly on separate nodes. Notably, single JobTracker daemons in one Hadoop cluster, which operate on servers as master nodes of the clusters.

4.2.5TaskTracker

As with a storage demon, a computing demon also follows the master/slave format: the JobTrackers are the masters overseeing the total execution of the MapReduce job whereas the TaskTrackers manages the execution involving single tasks on a slave node. This interaction is illustrated in Figure x.x. Single TaskTrackers are responsible for carrying out the individual jobs that are assigned by the JobTracker. Although there are single TaskTrackers on each slave node, a TaskTracker may encompass many JVMS that may handle maps and reduce the tasks concurrently. One of the responsibilities of TaskTrackers entails initiation of constant communication with the JobTrackers. If it fails to receive the heartbeats from the TaskTrackers within the specified time, it assumes that TaskTrackers have crashed and may re-send the corresponding activities to different nodes within the cluster.

  1. HDFS

5.x File Read and Write

The applications add data to HDFS by developing new files and filling it with the data. After the files are being closed, the written bytes may not be altered or deleted. Notably, new information might be incorporated to the files by reopening a file for appending. HDFS involve implementing the single writers, multiple-reader models. The HDFS clients that open files for writing are granted leases for the files; other clients cannot write in the files. The writing clients renew the leases periodically by sending heartbeats to the NameNodes. In addition, when the files are being closed, the leases are revoked.

Notably, the duration of the leases is bound by soft limitations and hard limitations. Before the expiry of soft limitations, the writers enjoy exclusive access towards files. Similarly, when the soft limitations expire and the clients fail to close some files and renew the leases, other clients may preempt the leases. Furthermore, when the hard limitations expire and the clients fail to renew the leases, HFDS may assume that the clients have quit and decide to close the files automatically to recover the leases. Concerning this, the leases of writers may not deter other customers from accessing the files; the files are being accessed concurrently.

The HDFS files consist of blocks. Notably, when there are necessities for new blocks, the NameNodes allocate the blocks with unique block identities and determine lists of DataNodes that can host the replica of the blocks. The DataNodes forms pipelines, which reduce the distance of total networks between the clients and the last DataNodes. Additionally, a byte is pushed through the pipeline by forming sequences of packets. Moreover, the bytes that are written by the applications buffer on the side of the client. Similarly, after being filled with the 64KB, the packet buffers’ data is pushed through the pipelines. The next packets may be pushed through the pipelines before the acknowledgements from the previous packet arrive. Seemingly, the quantities of outstanding packets are limited by an outstanding packet window size among the clients (Konstantin, Hairong, Sanjay and Robert, 2010).

Conversely, when the data is written to HDFS files, HDFS do not provide guarantees that the data will be accessed by new readers until the files close. Furthermore, the user applications ought to guarantee visibility, by explicitly calling the hflush operations. During this operation, the current packets are immediately pushed through the pipelines, thus the hflush operations may wait until the DataNodes on the pipeline accept the transmissions of the packet. Notably, all the information written prior to the hflush operations is specifically visible to the readers.

 

 

Similarly, when the errors do not occur, block constructions may go through the three stages, which are shown in Figure 2 thus illustrating pipelines of three DataNodes as well as blocks that contain five packets. In the pictures, the line in bold represents a packet of data, the line in dash represents an acknowledgement message and a thin line stand for control messages that sets up and closes the pipeline. Notably, a vertical line will represent an activity towards the client as well as the triple DataNodes when time moves from high to low. Conversely, the interval t1-t2 represents the stage of streaming data, whereby t1 stands for the time when the initial data packet transmits data whereas t2 stands for the time when the acknowledgement for the last packet receives data. It is worth to note that the hflush operations send the second packets. Furthermore, the hflush indications travel concurrently with the data packets and are separated during the operation. Moreover, the interval t2-t3 represents the pipelines close stages for the blocks.

In clusters that consist thousands nodes, the failure of nodes (storage faults) occur on a daily basis. The replicas that are being stored in the DataNode can be corrupted by the faults in the memory, diskette or network. Seemingly, the HDFS generate and store checksums for data blocks in the HDFS files. A checksum is verified by various HDFS clients while reading in order to detect corruption that is being caused by the clients, DataNodes and networks. Similarly, when the clients create the HDFS files, they compute the sequences of checksum for the blocks and send them to the DataNodes alongside the data. In addition, when the HDFS has read a file, the data of the blocks as well as the checksums are relayed to the clients. The clients compute the checksum for the received information and verify the recently computed checksums to ascertain whether they match with the received checksums. It is worth to note that, whenever the process fails, the clients might notify the NameNodes of the presence of corrupt replicas and then fetch different replicas of the blocks from different DataNodes.

Notably, when the clients open the files to read, they fetch the lists of various blocks as well as the location of the block replicas from the NameNodes. Furthermore, the locations of the blocks are arranged by the distances from the readers. When the reading attempt fails, the clients try the next replicas in the sequence. Seemingly, a read fails when the target DataNodes are not available, the nodes cannot host the replica of the blocks or when the replicas encourage corruption during the tests on checksums. Additionally, the HDFS allow clients to read files that support writing. When reading files that are open to writing, the length involving the previous block that is being written remains unknown towards the NameNodes. In this regard, the clients may ask one replica for the new length before reading its content. The designs of the HDFS 1/O are particularly optimized in a batch processing system, for instance the MapReduce that requires high input for sequential reading and writing. Similarly, there are improvements on reading and writing response times that support applications such as Scribe, which provide data streams to HDFS and HBase that provide random access to a large table.

5.x Block Placement

Notably, for large clusters, it is not practical to join the nodes in flat topologies. The common practices involve spreading the nodes through many racks. The nodes on the rack share switches, which are connected to one or multiple core switches. The communication involving two nodes on different racks should pass through several switches. Seemingly, the width of the network band between the nodes on one rack is bigger than the width of the network band between the nodes of different racks. Figure x.x gives a description of a cluster that has two racks, each containing three nodes.

Figure x.x cluster topology example

 

The HDFS estimate the width of the network band that exists between double nodes according to the distance. The distances from the node towards the nodes are one. The calculations of distances can be determined by adding their distances together to the nearest common ancestor. It is worth to note that short distances between the double nodes means they will use a greater bandwidth to transmit data (Konstantin, Hairong, Sanjay and Robert, 2010).

The HDFS allow the administrators to configure the scripts that return the nodes’ rack identity when they have the nodes’ address. The NameNodes are the central systems that resolve the rack locations of DataNodes. Seemingly, when DataNodes register with the NameNodes, the NameNodes run configured scripts that locate the rack where the nodes belong. Similarly, when the scripts are not being configured, the NameNodes assume that the node belongs to the defaulted single rack. The presence of the replica is crucial to HDFS data reliability as well as literacy performance. The policies of replica placement improve data reliability and utilization of the network bandwidth. Moreover, the HDFS provide configurable blocks placement policy interfaces which the user and the researcher experiments and tests the policy that is suitable to the applications.

The defaults in the HDFS blocks placement policies provide an exchange between reducing the writing costs and increasing data reliability, viability as well as total read bandwidth. Notably, when new blocks are being created, the HDFS inserts the first replicas in the nodes where the writers are being located whereas two replicas are inserted into two different nodes in the rack. Additionally, the remaining replicas are inserted into random nodes but there are restrictions in this process because each replica should match with one node.  Two replicas should not share the same racks when the replicas are twice less than the racks. The option of placing the two replicas on different racks distributes the block replica for the single files across the clusters. Seemingly, the first double replicas are placed on a similar rack, for the files; two-thirds in the block replicas should share the similar rack (Konstantin, Hairong, Sanjay and Robert, 2010).

When the target nodes are being selected, the nodes organize the pipeline depending on the proximity of the replicas. Notably, before reading starts, the NameNodes checks the clients’ host within the cluster. Seemingly, when the block location is returned to the clients, the block may be read from the DataNodes. It is common for the MapReduce application to run with cluster nodes, but when the host connects the NameNodes and DataNodes, it executes the HDFS clients. The policies reduce the inter-racks and inter-nodes writing traffic while improving the writing performance. The chances of rack failures are less than the node failures; the policies do not affect data reliability as well as availability guarantees. In the case involving three replicas, they reduce the total network bandwidth that is being used when data is read because blocks are being placed in two racks.  Summarily, the default HDFS replicas placement policies contain two important aspects; firstly, DataNodes do not have additional replicas in any block. Secondly, the replicas in the rack cannot exceed two in a single block, when the racks that are on the clusters are enough.

Replication management

The NameNodes ensures that the blocks have the required numbers of replica. The NameNodes detect blocks that are under-replicated and over-replicated when block reports from the DataNodes. Conversely, when blocks are over-replicated, the NameNodes choose the replicas to remove. Furthermore, the NameNodes preference may ignore the reduction of racks, which host the replicas. The second preference may involve the removal of replicas from the DataNodes using the space that is available on the disk. The objectives involve balancing storage utilization among the DataNodes but the blocks availability is not being reduced.

When the blocks are under-replicated, they are placed on the replica priority queue. Notably, the blocks with single replicas are being prioritized first whereas blocks with multiple replicas that exceed two thirds in the replication are prioritized last. Notably, the background threads scan the heads of the replica queues and choose the suitable places where new replicas are being placed. The block replications follow similar policies as the latest block placements. Seemingly, when the existing replicas are minimal, the HDFS inserts the subsequent replicas on different racks. In the event that the block has two replicas, which exist on a single rack, the third replicas are being placed on the different racks. Similarly, the third replicas are placed on different nodes but on the single rack of existing replicas. The goals entail reducing the costs of creating the new replicas (Konstantin, Hairong, Sanjay and Robert, 2010).

The NameNodes ensure that the replicas within the blocks are being located on single racks. When the NameNodes detect blocks’ replica that end on the same rack, the blocks are being treated as under-replicated and the blocks replicate to different racks using similar block placements policy. Conversely, when the NameNodes receive notifications that the replicas are being created, the blocks undergo replication. The NameNodes will decide to remove the old replicas because the over-replica policies do not reduce the numbers of racks.

  1. MapReduce

Reading and writing

The MapsReduce distributed processing, involves making certain assumptions concerning the data that is being processed. The processes provide flexibility when dealing with varieties of data format. Input data reside in the large file of over one hundred GBs. A fundamental principle of MapReduce processing power involves splitting the input data among chunks. The chunks can be processed concurrently using multiple mechanisms. Additionally, the chunks, which are being called splits, should have small sizes that are enough for granular parallelization. Notably, when the input data is placed into one split, parallelization does not occur.

According to Chuck (2014), the splits should not be small to ensure that the overhead for starting as well as stopping the process of splits becomes the fraction of executing time. The principles involved in the division of input data may include one massive file being split for parallel processes explain the design decision in the Hadoop FileSystems and the HDFS in particular. In addition, the Hadoop FileSystems provide the class FSDataInputStream for reading files rather than use Java’s java.io.DataInputStream. The FSDataInputStream extends the DataInputStream to the random reading access features that MapReduce require before the machines begin processing the splits that sit at the centre of input files. Furthermore, lack of random access makes it cumbersome to read files from a beginning point up to the split. The HDFS design function involves storing data, which the MapReduce may split while processing in parallel. The HDFS store the files in blocks, which are being spread on multiple machines. Notably, a different machine will have a different block because the parallelization becomes automatic when the splits and blocks are being processed by a machine. Additionally, when the HDFS is replicating blocks using multiple nodes in order to enhance reliability, the MapReduce may choose the nodes that have copies of splits or blocks. Conversely, the Hadoop through defaults consider the lines on the input files to be records and the keys or value pairs are the bytes offset as well as contents of the lines.

10 Hadoop security

Arguably, huge data is not only a challenge when it comes to storage, processing and analysing data but also in terms of managing as well as securing large assets of data. Firstly, Hadoop does not have inbuilt security systems. When enterprises adopted Hadoop system, a security model based on Kerberos evolved. However, the distributing nature in the ecosystems coupled with wide ranging applications, which are being built at the peak of Hadoop, complicates the process of securing an enterprise.

Typical data ecosystems involve multiple stakeholders’ interaction with the systems. For example, experts within the organizations may interact in the ecosystems through the use of business intelligence as well as analytical tools. Similarly, business analysts in the finance department should not access data through the human resource department. Business intelligence mechanisms should constitute wide-ranging systems that provide level access in the Hadoop ecosystems, which depend on the protocols and data that is being used to communicate. Notably, the biggest challenge for the Big Data project within an enterprise concerns the security of integrated external sources of data such as CRM systems, existing ERP, websites and social blogs. The external connectivity should be established to ensure that the data, which is being extracted from the external sources that are available within the Hadoop ecosystems (Sudheesh, 2013).

10.1 Security challenges understanding

During the initial development of Hadoop, the aspect of security was not considered. Seemingly, the initial objective of developing Hadoop revolved around the management of a large amount of web data, data securities as well as privacy. Seemingly, there was an assumption that Hadoop clusters might involve the cooperation of trusted machines, which are used by users in a secure environment. Initially, the security model of authentication of users’ service and data privacy were not guaranteed by Hadoop. The arising scenario is being attributed to the fact that Hadoop design would involve execution of codes over distributed clusters of machines; this implies that any user could submit codes that are being executed. Despite the fact that auditing and file permissions were being implemented in the initial distributions, such authorized controls were circumvented because of user impersonation by switching the command lines. Due to the prevalence of impersonation cases, the security mechanisms were not effective in addressing the issue (Alexey, Kevin, Boris, 2013).

In the past, organizations that were concerned about the security anomalies arising from the Hadoop clusters by placing them on private networks and restricting access by authorized users. However, because the security controls within Hadoop were few, accidents and insecurity incidents were common in such an environment.  Similarly, users with good intentions can err by deleting data which is being used by other users. It is worth to note that distributed deletes may destroy huge data within a short time. Seemingly, users as well as programmers had similar levels of accessing data in the cluster, thus implying that jobs could access within the cluster and potential users could read any existing data set. The security lapse was causing concerns especially with the issue of confidentiality.  The MapReduce lacked authentication and authorization concepts thus allowing mischievous users to lower the priority of Hadoop jobs while attempting to complete other activities.

Conversely, when Hadoop was gaining popularity data analysts and security experts were beginning to express concern about the internal threat posed to Hadoop clusters by malicious users. The malicious developers could write codes that impersonate other potential users of Hadoop services. One technique used by malicious users involves writing and registering new TaskTrackers into Hadoop services or impersonation of the HDFS or MAPRED users by deleting all contents in the HDFS. The failure of the DataNodes to enforce access control allow malicious users to read random data blocks from the DataNodes. The outcome of the practice may undermine the data integrity during analysis. Additionally, the intruders can submit jobs to the JobTrackers compelling it to execute the tasks randomly.

Notably, as Hadoop was reaching its peak, stakeholders realized the significance of comprehensive security controls to be installed to Hadoop. The security experts were thinking of an authentication system that would require users, clients program as well as servers within the Hadoop clusters to confirm the identities.  Moreover, authorization was being cited as a necessity, along with particular security concerns that include auditing, privacy, integrity and confidentiality. However, there were other security issues that were not addressed because there was no authentication, thus authentication was a critical aspect that led to re-designing of Hadoop security.

The viability aspect of authentication, led to the introduction of Kerberos for Hadoop security by technicians from Yahoo. Firstly, the Kerberos security strategy recommends that the users can access the HDFS files when they have permission. Secondly, the users may access and modify the personal MapReduce jobs. Thirdly, the users should be authenticated as a way of preventing unauthorized TaskTrackers, JobTrackers, DataNodes and NameNodes. Furthermore, the services should undergo authentication in order to stop unauthorized service from joining the clusters. Finally, Kerberos tickets and credentials should be open to the user and applications. Notably, Kerberos was integrated into Hadoop as a mechanism for implementing an authentic network that is secure and controls Hadoop processes. Since the introduction of Kerberos, Hadoop and the tools involved in the Hadoop ecosystem have transformed the processes by providing security strategies, which meet the needs of modern users (Alexey, Kevin, Boris, 2013).

 

10.2 Hadoop Kerberos security implementation

Security enforcement within distributed systems such as Hadoop is complex. The requirements for securing Hadoop systems were proposed by expertise strategy for the Hadoop security design. Summarily, the security requirements include user-level access controls, Service-level access controls, user-service authentication, delegation token, and job token and block access token.

User-level access controls

The user-level access controls involve a number of recommendations that include; firstly, the users of Hadoop should access that belongs to them. Secondly, the users who have been authenticated can submit jobs through the Hadoop cluster. Thirdly, users can view, modify and eliminate the personal jobs. Furthermore, services that are authenticated should register either as DataNodes or as TaskTrackers. Finally, the Data block access in the DataNode should be secured by authenticated users who have access to the data in a Hadoop cluster.

Service-level access controls

The list of service-level access includes;

Ø  Scalable Authentication: Hadoop cluster consists of large numbers of nodes and the authentic models which are scalable in order to support large network authentication

Ø  Impersonation: A Hadoop service should impersonate the users that are submitting the jobs in order to maintain user isolation.

Ø  Self-Served: A Hadoop job runs for long duration, this ensures that the job is renewed for the delegated users’ authentication in completing the job.

Ø  Secure IPC: The services that are in the Hadoop should authenticate closely and ensure that the communication that takes place is secure.

Notably, the preceding conditions can be achieved when Hadoop leverages the Kerberos authentication protocols and the internally generated tokens are secured in the Hadoop clusters.

User and service authentication

Conversely, the authentication of the user to NameNodes and JobTrackers service is enabled through Hadoop remote procedures using the Simple Authentication and Security Layer framework. The use of Kerberos is important because it uses the authentication protocols to determine the authentic users within SASL. In addition, all the Hadoop services support Kerberos authentication. When clients submit the MapReduce jobs to the JobTracker, the MapReduce jobs assist the user to access the Hadoop resource. The process may be achieved using three types of tokens that include the Delegation Token, Job Token as well as the Block Access Token.

Delegation Token

Delegation Token authentication refers to a protocol that involves two parties and it is based on JAVA SASL Digest-MD5. The Delegation Tokens involve the users and NameNodes to determine the authentic user. Notably, when the NameNodes determine the authentic users using the Kerberos, the NameNodes provides the Delegation Token to the users. Similarly, the users who have the Delegation Token do not have to undergo the Kerberos authentication. The users also assign the JobTracker or the ResourceManager the process of the user by renewing the Delegation Token based on the Delegation Token request. Moreover, when authentication has been completed, the secured Delegation Tokens are sent to the JobTracker or ResourceManager. Furthermore, the JobTrackers assumes the role of users by using the Delegation Tokens to access the HDFS resources. When the JobTracker encounters a long-running job, it renews the delegation tokens.

Job Token

Similarly, the jobs run on Task Nodes and the users’ access is secured by the Task Nodes. When the users submit MapReduce jobs to the JobTrackers, they create secret keys that are shared with the TaskTrackers, which run the MapReduce Jobs. The secret keys constitute the Job Tokens. Notably, the Job Tokens are stored within the local disks of TaskTrackers that are permitted for users who submit the jobs. The TaskTrackers start the child JVM task using the user identity that submits the job. Therefore, the child JVM run will access the Job Tokens from the directories and communicate with the TaskTrackers using the Job Token. Additionally, the Job Tokens are used in ensuring that authenticated users submit the jobs in Hadoop and have access to the folders as well as jobs, which the local file systems of TaskNodes authorizes. When the Reduce job activates the task tracker, the TaskTracker runs the Map tasks and collects the mapper output file. Another important function of the Job Token is to combine with the TaskTrackers in securing communication.

Block Access Token

When clients request for the data from HDFS, they should fetch a data block from the DataNode after the identity of the block has been fetched from the NameNode. Furthermore, a secured system should be present to ensure that the privileges of the user are passed to the DataNode in a secure manner. Notably, the main function of the Block Access Token is ensuring that authorized users get access to the data blocks, which have been stored in the DataNodes. When the clients want to access data that is stored in the HDFS, it requests the NameNode to produce the block identities and DataNode locations. After completing the process, clients can contact the DataNode to fetch the blocks of data. The authentication of the NameNode is enforced at the DataNode by ensuring that Hadoop implements the Block Access Token. The Block Access Tokens are provided by NameNodes to Hadoop clients to carry data authentication information through the DataNode.

The Block Access Token implements symmetric key encryptions where the NameNodes and the DataNodes share common secret keys. When the DataNodes receive the secret keys, registration with the NameNodes takes place and the process is regenerated periodically. Notably, the Block Access Tokens are lightweight and contain access modes, blockIDs, ownerIDs, keyIDs and expirationDates. The access modes define the permissions that are available to some users for requesting the block ID. Furthermore, the BATs that are generated by the NameNodes are not renewable and should be fetched when the tokens expire. Seemingly, the BATs ensure that the security of data blocks that are in the DataNodes is upheld and authorized users may access a data block (Sudheesh, 2013). The following figure shows the various interactions in a secured Hadoop cluster:

Interactions in a secured Hadoop cluster

The overall Hadoop Kerberos operations depend on key steps. Firstly, all Hadoop services should be authentic in relation to KDC. During this process, DataNodes register with NameNodes whereas the TaskTrackers register with the JobTrackers. Furthermore, the NodeManagers register with the ResourceManagers. Secondly, the process involves client authentication with KDC. The clients request the service tickets for NameNodes and JobTrackers or ResourceManagers. Thirdly, in order for clients to have access to HDFS files there should be a connection with the NameNode server. Additionally, the NameNode determines the authentic clients and provide an authorization detail to clients alongside the BATs. The BATs are users that are required by the DataNodes to make the client authorization valid and provide authentic access to blocks that correspond. Finally, for the MapReduce jobs to submit in the Hadoop clusters, the clients request for the Delegation Tokens from the Job Trackers. The Delegation Tokens are used for sending MapReduce jobs to the clusters (Sudheesh, 2013).

 

References

Alexey, Y., Kevin, T. & Boris, L. (2013) Professional Hadoop Solutions, Wrox, Wrox University Press

Chuck, L. (2014). Hadoop in Action, Oxford, Oxford University Press (a)

Chuck, L. (2014). Hadoop in Action, Oxford, Oxford University Press (b)

Konstantin, S., Hairong, K., Sanjay, R. and Robert, C. (2010). The Hadoop Distributed File System. In Proceedings of the 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST ’10), Incline Village, Nevada.(a)

Konstantin, S., Hairong, K., Sanjay, R. and Robert, C. (2010). The Hadoop Distributed File System. In Proceedings of the 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST ’10), Incline Village, Nevada (b)

Konstantin, S., Hairong, K., Sanjay, R. and Robert, C. (2010). The Hadoop Distributed File System. In Proceedings of the 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST ’10), Incline Village, Nevada (c)

Konstantin, S., Hairong, K., Sanjay, R. and Robert, C. (2010). The Hadoop Distributed File System. In Proceedings of the 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST ’10), Incline Village, Nevada (d)

Sudheesh, N. (2013) Securing Hadoop, Bangalore. Bangalore University Press (a)

Sudheesh, N. (2013) Securing Hadoop, Bangalore. Bangalore University Press (b)