hadoop restart copy local files to dfs
This is really a time-consuming step, and what's worse, the datanodes tend to be dead easily over the course. I couldn't figure out the reason, although there is such Exception in the datanode.*.out file:
Exception in thread "DataNode: [/*/dfs/data]" java.lang.OutOfMemoryError: Java heap space
So after all datanodes become dead, I have to restart dfs and continue `fs -put` files to where I want. Fortunately, seems hadoop can figure out itself where to pick up and continue.
Update:
So I decided to increase the heap size anyway, bump the default size 1000m to 1500m. And then it just works!!!
Howto change heap size:
in the conf/hadoop-env.sh
- change the value: HADOOP_HEAPSIZE, this will increase the heap size for all jobs (DataNode or TaskTracker)
- or add -Xmx1500m to value HADOOP_DATANODE_OPTS, then it only affects DataNode
Also how to restart the dead DataNode:
- manually find out the process on the DataNode and kill it ( I also copied the exact process command as my start-datanode.sh)
- then run start-dfs.sh on master, (or, you can run the start-datanode.sh on the datanode)
hadoop write file
FSDataOutputStream out = fs.create(outFile);
In hadoop, I used the above line to prepare file for writing, Quite easy, and I don't have to worry about mkdir the parent directory if not there.
Previously, when I tried to do multi-thread output files, I need to do:
File parent = output.getParentFile();
if (!parent.exists()){ try {parent.mkdirs();
} catch (Exception e) {//multi thread might compete for this mkdirs()
}
}
Using micolog
I already decided to use byteflow as my blogging tool the other day, and I started to deploy django to bluehost, but things got interrupted. Today I happened to notice micolog, which is developed on top of GAE; also the developer is a 本家 (who shares the same family name with me). So in less than a hour, I am writing blog here.
Custom Search