Big Data Board: single disk

Hi Folks, Recently i have performed simple test on hadoop cluster. we have a pretty large cluster with so much of data, each datanode is of around 24 TB hdd(12, 2tb disks). let me tell what is the general issue we faced and how did we resolved it.

ISSUE:- 1 or 2 Disk is getting full like 90% and other disks of the same node are between 50-60%, due to which we were getting continuos alerts. Its was getting pain for us becoz it was happening frequently becoz of large size of the cluster.

Resolution:- We have tried some test and finally got successful to cope up with this situation, let me tell what we have performed and how did we resolved it.

1. I have created a text file of 142mb with 10000000 records and copied it into hdfs.

[hdfs@ricks-01 13:21:01 ~]$ hadoop fs -cat /user/hdfs/file | wc -l

10000000

2. Set its replication factor to 1 so that only one replication would be present on the cluster.

[hdfs@ricks-01 12:13:07 ~]$ hadoop fs -setrep 1 /user/hdfs/file

Replication 1 set: /user/hdfs/file

3. Now run fsck to check the location of its block on the datanode.

[hdfs@ricks-01 12:14:32 ~]$ hadoop fsck /user/hdfs/file -files -blocks -locations

/user/hdfs/file 148888897 bytes, 2 block(s):  OK

0. BP-89257919-1406754396842:blk_1073745304_4480 len=134217728 repl=1 [17.170.204.86:1004]

1. BP-89257919-1406754396842:blk_1073745305_4481 len=14671169 repl=1 [17.170.204.86:1004]

4. Check the host where it is present and found that its present on ricks-04 node

[hdfs@ricks-01 12:14:38 ~]$ host 17.170.204.86

86.204.170.17.in-addr.arpa domain name pointer ricks-04.

5. now i have to login into that node and find the block and mv it to another disks at the same path.

[hdfs@ricks-04 12:15:55 ~]$ find /ngs*/app/hdfs/hadoop/ -name blk_1073745305

/disk2/hdfs/dfs/dn/current/BP-89257919-1406754396842/current/finalized/blk_1073745305

[hdfs@ricks-04 12:16:03 ~]$ mv /disk2/hdfs/dfs/dn/current/BP-89257919-1406754396842/current/finalized/blk_1073745305*  /disk5/hdfs/dfs/dn/current/BP-89257919-1406754396842/current/finalized/

6. Now again search the block so that its confirmed that there is no other replication is present and its copied to new location which is disk5.

[hdfs@ricks-04 12:17:48 ~]$ find /disk*/hdfs/ -name blk_1073745305*

/disk5/hdfs/dfs/dn/current/BP-89257919-1406754396842/current/finalized/blk_1073745305

/disk5/hdfs/dfs/dn/current/BP-89257919-1406754396842/current/finalized/blk_1073745305_4481.meta

7. Now run the fsck so that it register its new location to the namenode.

[hdfs@ricks-01 13:02:21 ~]$ hadoop fsck /user/hdfs/file -files -blocks -locations

/user/hdfs/file 148888897 bytes, 2 block(s):  OK

0. BP-89257919-1406754396842:blk_1073745304_4480 len=134217728 repl=1 [17.170.204.86:1004]

1. BP-89257919-1406754396842:blk_1073745305_4481 len=14671169 repl=1 [17.170.204.86:1004]

8. Now again run the hdfs command and check the count of the file if its giving the right count that its get register with the namenode if it not giving the right count then wait for sometime and try again.

[hdfs@ricks-01 13:21:01 ~]$ hadoop fs -cat /user/hdfs/file | wc -l

10000000

Point to remember

Be Extra careful while moving the block from one place to another, you might want to take backup before moving it.
sure that there is not jobs running on it at that point of time, you can kill the TT before doing that.
You can restart your datanode service, after performing this test.

Big Data Board

Sunday, September 14, 2014

Single disk issue in Hadoop Cluster.