I am working on an OS service that will need to watch an HDFS folder for any new files. If a new file appears then the file gets moved to a staging area where it does further processing...
Anyway, I just need to figure out a good way to pull files that show up. One big requirements is that the file has to have finished copying to the HDFS mount point before I can move it. So I am thinking something along these lines:
1.) Look for new files (Every 5000 ms)
2.) If found new file step 3, else step 1.
3.) check file size every 250 ms.
3.) If file size after 250ms equals the file size before 250 ms. Assume copy is complete. else, step 3.
4.) move file to staging area for processing.
Of course last night I was considering how long it will take to perform the 'checkFileSize()' method and remove that from the 250 ms... to make up the diff.. I plan on having my timers configurable through an argument to the main method or some conf yaml file external to the compiled jar.
Anyway, thoughts would be appreciated.. here is the main class i will be using: FileSystem
http://archive.cloud...FileSystem.html