Arth TASK

Nishant Saini
11 min readMay 19, 2021

According to popular articles, Hadoop uses the concept of parallelism to upload the split data while fulfilling
Velocity problem.
👉🏻 Research and conclude this statement with proper proof
✴️Hint: tcpdump

* To perform this article We will follow below steps -

<1>. NameNode Configuration

<2>. DataNode Configuration

❤>. Client Node Configuration

<4>. Find Who Upload Data at DataNode ( Client or NameNode ) & How Replication works.

<5>. Check Hadoop Use Concept Of Parallelism To Upload The Split Data While Fulfilling Velocity Problem Is Right Or Not

* For this task We have a Hadoop Cluster on my Local System -

<1>. NameNode Configuration (“NN1”)-

1 (A) Create “/nn” Directory -

1 (B) “hdfs-site.xml” file configuration -

1 C “core-site.xml” configuration -

1 (D) Format NameNode -

1 (E) Stop Firewalld -

1 (F) Start NameNode -

<2>. DataNode Configuration (“DN1” , “DN2” , “DN3” ) -

2 (A) Create “/dn” -

2 (B) “hdfs-site.xml” file configuration -

2 C “core-site.xml” file configuration -

2 (D) Stop Firewalld -

2 (E) Start DataNode -

<3>. Client Configuration (“Clt1”)-

3 (A) “hdfs-site.xml” configuration -

3 (B) Stop Firewalld -

3 © Check Client is Ready or Not -

In my case it is ready .

<4>. Find Who Upload Data at DataNode ( “Client” or “NameNode” ) & How Replication works -

* Connection between NameNode and Client port 9001 works because NameNode is working on 9001 port & we used “9001” port at Client “core-site.xml” file.

* port “50010” port is used to transfer data at DataNode. Now we want to know that Client directly transfer data to DataNode or Client transfer data to DataNode through NameNode . You can better understand through below pic “what we want to know” -

* To perform this task We will use Three terminals of Client Node -

> “Client Terminal — 1” — To check connection at client on Port 9001

> “Client Terminal — 2” — To check connection at client on Port 50010

> “Client Terminal — 3” — To upload file .

First We will check Case -2 then after we check Case-1

4 > Case — 2

* In this case We will check “50010” port connection at NameNode because in Hadoop Cluster data is transfer at “50010” by default. If any packets will pass through port 50010 at NameNode then we can say that data is transferring through NameNode.

* At Client Node We have a file “dn.txt” .

4 > 2 (A). File content is -

Hey , How are you?

4 > 2 (B) Connection On Ports “9001” & “50010” -

* In “Client Terminal — 1” We will check connection of client on Port “9001” & In “Client Terminal — 3” We will upload file & on NameNode we will check connection on Port “50010”.

4 > 2 C Now We are uploading file -

* When We upload file “dn.txt” then at NameNode on Port “50010” no network packets are passing , So we can say that In Hadoop Cluster NameNode don’t upload the file to DataNode.

4 Case — 1

4 > 1 (A) Information About File Which Will Upload By Client -

* In this case We will check all connection at Client Node on ports “9001” & “50010”. For this I have three terminal of Client Node.

* At “Client Terminal — 1” We will check connection for port “9001” and “Client Terminal — 2” We will check connection for port “50010” & “Client Terminal — 3” We will upload file “hello.txt”.

* Content of “hello.txt” file -

hello
what are you doing?

4 > 1(B) Connection at Port 9001 & 50010 -

* We use “-X” to see network packet content.

* Now We are running Command at “Client Terminal — 1” & “Client Terminal — 2”

4 > 1 C Upload File “hello.txt” -

* When We upload file from “Client Terminal — 3” we see that many network packet are going through “9001” & “50010” ports . We can see these packet in “Client Terminal — 1” & “Client Terminal — 2” respectively.

* When we see in “Client Terminal — 2” then we find that Client is connecting to DataNode — “DN2” & transfering directly.

* Now we can say that Client is one who upload data at DataNode.

4 > 1 (D) Find How Client knows IPs of DataNodes -

* But here A issue will raise that How Client Node know that “what is the IPs of DataNodes”.

* To solve this issue when we see Network Packets of port “9001” where Client is connecting to NameNode . then we find that Client Node is taking IPs of DataNodes from NameNode.

* Now this issue is solved

* Till here we draw a connection diagram

4 > 1(E) How Replication Process Works -

* Another Issue will raise that at Terminal when We see whole network packets then we find that Client is connecting Only to “DataNode — DN2” but file is uploading on all three DataNodes because by default value of replica is 3 but Client is connecting to only one DataNode then “How can be possible that file is uploading on all remaining DataNodes ?”. ( Now How Replication is possible ) — { here by default block size is 64 MiB & file size is very smaller then 64MiB so only one block will create }

* To solve this when we see again Network Packets of “Client Terminal — 2” for port “50010” then we find that Client is sending remaining DataNode’s IPs to DataNode — “DN2”.

* Now We can think that it can possible that DataNode “DN2” is connecting to another DataNode for uploading file.

* To know this We will upload a another file “gb.txt” from “Client Terminal — 3”. & at that time We will check connection on port “50010” all DataNodes & also check connection on “50010” port at “Client Terminal — 2”

* File “gb.txt” content -

Welcome to My Article

* We are running tcpdump command on DataNode — “DN1” , DataNode — “DN2” , DataNode — “DN3” & “Client Terminal — 2”.

* Now We are uploading “gb.txt” file -

* We can see on “Client Terminal — 2” Client is connecting to again “DN2” . It can be connect to other DataNodes.

* Client is also sending remaining DataNode’s IPs , we have already proved it . Now we want to know that if client is not connecting to other DataNode then who is transfering data to those two DataNodes.

* For this when We see Network packets on DataNode — “DN1” then we find that DataNode — “DN2” is connecting to DataNode — “DN1”

* When we see Network Packets at DataNode — “DN1” then we find that DataNode — “DN2” is also sending the remaining DataNode — “DN3” IP .

DataNode — “DN2” also sending file data to DataNode — “DN1”

* Till here we draw the connection diagram -

* Now We see Network Packets of DataNode — “DN3” then we find that DataNode — “DN1” is connecting to DataNode — “DN3”.

* When we see more network packets then we find that DataNode — “DN1” is also sending file data to DataNode — “DN3” .

* Now we draw again connection diagram -

* Replication works according to above connection diagram between DataNodes.

In this case block size is 1 because our file size is less than 64MiB & we didn’t change by default block size . So with the help of above connection diagram we can say that to store one block at DataNodes Client only connect with one DataNode . Now this DataNode will create replica of that block on other DataNodes.

If Client will connect other datanode then we can say that it will definitely uploading new block because client is the one who upload the block at DataNode directly.

<5>. Check Hadoop Use Concept Of Parallelism To Upload The Split Data Is Right Or Not -

* According to Concept of Parallelism When we upload file then different blocks of file will upload parallel at different DataNodes.

5 (A) Information About File Which We Want To Upload -

* To check this concept is right or not We will upload file “hello.txt” . Size of this file is around 34KiB & We have default Block size 64 MiB , We will change this block size to 5120 bytes . & By default replica is also 3 , we will not change replica size.

* We will put “50010” connection data in a “Clt.txt” file at Client Node -

* "Clt1"  Node  ----   "Clt.txt"

5 (B) Upload File “hello.txt” -

* Our Client Network Packets on port “50010” are in “Clt.txt”. You can also this this file from below link -

Click here to see Network packets for “hello.txt” file , “Clt.txt” file contains network packets for this.

* We can see , 7 blocks are created for “hello.txt” file -

5 © Prove When Client Upload New Block then Value of “seq” again Starts from “1” -

Now we want to tell you that whenever client will upload new block then values of “seq” & “ack” will start from “1” as you can see in file -

* To prove this we will upload a file “seq.txt” , size of this file is “874 bytes”. & we will store network packets in “Cltblk.txt”. We will change block size to 512 bytes , now it will create only two blocks of file “seq.txt”

* Now we are uploading file “seq.txt” -

You can see network packets for “seq.txt” file -

* In this file we can see Client is connecting to only two DataNode ,& we have also teo blocks , we also know that whenever client connect new DataNode at port 50010 then it will upload new block .

* Now we say that Client is uploading different file block on these two DataNodes according to result of our previous “step — 4” -

* Now we can see that when client upload new file block then value of seq starts from “1” -

* We know file “seq.txt” has two blocks , Now Client have uploaded all two blocks of file “seq.txt”. Now we have been proved that whenever Client uploads new block then values of “seq” & “ack” will start from “1” because whenever block is uploaded by client then after upload client will close connection with that system , Now when client uploads new block then it will establish new connection.

5 (D) Create Table Which Have Upload Time Of Block -

* Now We will see “Clt.txt” file content .

* To find client upload blocks in parallel or not we will create a table in which we will store time when client upload a new block & we will also store DataNode name like below format-

* When we see “Clt.txt” file — then we find seq 1 & ack 1first time -

* Next -

* Next -

* Next -

* Now We can’t find any block which is uploaded by Client. Client have uploaded all “7 blocks”.

5 (E) Conclusion -

With the help of upload time & “Clt.txt” file content we can say that Client will not upload next block if previous block is completely uploaded.

* With this setup Hadoop don’t use Parallelism concept ( Means Hadoop don’t upload blocks at a same time) it stores in serial way.

* So till now with this setup velocity problem in BigData is not solved .

--

--

Nishant Saini
0 Followers

RedHat Certified Engineer (RHCE) | AWS Academy Graduate | DevOps Enthusiast