Hi Folks, Recently i have performed simple test on hadoop cluster. we have a pretty large cluster with so much of data, each datanode is of around 24 TB hdd(12, 2tb disks). let me tell what is the general issue we faced and how did we resolved it.
ISSUE:- 1 or 2 Disk is getting full like 90% and other disks of the same node are between 50-60%, due to which we were getting continuos alerts. Its was getting pain for us becoz it was happening frequently becoz of large size of the cluster.
Resolution:- We have tried some test and finally got successful to cope up with this situation, let me tell what we have performed and how did we resolved it.
1. I have created a text file of 142mb with 10000000 records and copied it into hdfs.
2. Set its replication factor to 1 so that only one replication would be present on the cluster.
3. Now run fsck to check the location of its block on the datanode.
4. Check the host where it is present and found that its present on ricks-04 node
5. now i have to login into that node and find the block and mv it to another disks at the same path.
6. Now again search the block so that its confirmed that there is no other replication is present and its copied to new location which is disk5.
7. Now run the fsck so that it register its new location to the namenode.
8. Now again run the hdfs command and check the count of the file if its giving the right count that its get register with the namenode if it not giving the right count then wait for sometime and try again.
ISSUE:- 1 or 2 Disk is getting full like 90% and other disks of the same node are between 50-60%, due to which we were getting continuos alerts. Its was getting pain for us becoz it was happening frequently becoz of large size of the cluster.
Resolution:- We have tried some test and finally got successful to cope up with this situation, let me tell what we have performed and how did we resolved it.
1. I have created a text file of 142mb with 10000000 records and copied it into hdfs.
[hdfs@ricks-01 13:21:01 ~]$ hadoop fs -cat /user/hdfs/file | wc -l
10000000
[hdfs@ricks-01 12:13:07 ~]$ hadoop fs -setrep 1 /user/hdfs/file
Replication 1 set: /user/hdfs/file
[hdfs@ricks-01 12:14:32 ~]$ hadoop fsck /user/hdfs/file -files -blocks -locations
/user/hdfs/file 148888897 bytes, 2 block(s): OK
0. BP-89257919-1406754396842:blk_1073745304_4480 len=134217728 repl=1 [17.170.204.86:1004]
1. BP-89257919-1406754396842:blk_1073745305_4481 len=14671169 repl=1 [17.170.204.86:1004]
[hdfs@ricks-01 12:14:38 ~]$ host 17.170.204.86
86.204.170.17.in-addr.arpa domain name pointer ricks-04.
[hdfs@ricks-04 12:15:55 ~]$ find /ngs*/app/hdfs/hadoop/ -name blk_1073745305
/disk2/hdfs/dfs/dn/current/BP-89257919-1406754396842/current/finalized/blk_1073745305
[hdfs@ricks-04 12:16:03 ~]$ mv /disk2/hdfs/dfs/dn/current/BP-89257919-1406754396842/current/finalized/blk_1073745305* /disk5/hdfs/dfs/dn/current/BP-89257919-1406754396842/current/finalized/
[hdfs@ricks-04 12:17:48 ~]$ find /disk*/hdfs/ -name blk_1073745305*
/disk5/hdfs/dfs/dn/current/BP-89257919-1406754396842/current/finalized/blk_1073745305
/disk5/hdfs/dfs/dn/current/BP-89257919-1406754396842/current/finalized/blk_1073745305_4481.meta
[hdfs@ricks-01 13:02:21 ~]$ hadoop fsck /user/hdfs/file -files -blocks -locations
/user/hdfs/file 148888897 bytes, 2 block(s): OK
0. BP-89257919-1406754396842:blk_1073745304_4480 len=134217728 repl=1 [17.170.204.86:1004]
1. BP-89257919-1406754396842:blk_1073745305_4481 len=14671169 repl=1 [17.170.204.86:1004]
[hdfs@ricks-01 13:21:01 ~]$ hadoop fs -cat /user/hdfs/file | wc -l
10000000
Point to remember
- Be Extra careful while moving the block from one place to another, you might want to take backup before moving it.
- sure that there is not jobs running on it at that point of time, you can kill the TT before doing that.
- You can restart your datanode service, after performing this test.