hadoop restart copy local files to dfs

December 5th, 2008

 

This is really a time-consuming step, and what's worse, the datanodes tend to be dead easily over the course. I couldn't figure out the reason, although there is such Exception in the datanode.*.out file:

Exception in thread "DataNode: [/*/dfs/data]" java.lang.OutOfMemoryError: Java heap space

So after all datanodes become dead, I have to restart dfs and continue `fs -put` files to where I want. Fortunately, seems hadoop can figure out itself where to pick up and continue.

Update:

So I decided to increase the heap size anyway, bump the default size 1000m to 1500m. And then it just works!!!

Howto change heap size:

in the conf/hadoop-env.sh

  • change the value: HADOOP_HEAPSIZE, this will increase the heap size for all jobs (DataNode or TaskTracker)
  • or add -Xmx1500m to value HADOOP_DATANODE_OPTS, then it only affects DataNode

 

Also how to restart the dead DataNode:

  1. manually find out the process on the DataNode and kill it ( I also copied the exact process command as my start-datanode.sh)
  2. then run start-dfs.sh on master, (or, you can run the start-datanode.sh on the datanode)
Tags: java heap hadoop datanode Posted in Hadoop

hadoop write file

December 5th, 2008

 

                        FSDataOutputStream out = fs.create(outFile);

In hadoop, I used the above line to prepare file for writing, Quite easy, and I don't have to worry about mkdir the parent directory if not there.

Previously, when I tried to do multi-thread output files, I need to do:

                        File parent = output.getParentFile();
                        if (!parent.exists()){
                            try {
                                parent.mkdirs();
                            } catch (Exception e) {
                                //multi thread might compete for this mkdirs()
                            }
                        }

 

Tags: hadoop java Posted in Java Hadoop

Using micolog

December 4th, 2008

I already decided to use byteflow as my blogging tool the other day, and I started to deploy django to bluehost, but things got interrupted. Today I happened to notice micolog, which is developed on top of GAE; also the developer is a 本家 (who shares the same family name with me). So in less than a hour, I am writing blog here.

Tags: blog Posted in Python

:-) Join the site

:-) Comment the blog