Jump to content

Photo

Hadoop Recovery

hadoop recovery edits file namenode offsets previous.checkpoint

  • Please log in to reply
5 replies to this topic

#1 Champion of Cyrodiil

Champion of Cyrodiil

    Gigabyte

  • Members
  • 776 posts
  • LocationVirginia

Posted 05 November 2012 - 01:10 PM

https://www.openscie...ery?cover=print

Had corrupted NameNode in a master node VM this morning at the office...

apparently the 'edits' file could not be read at offset 126.... However, when i looked at the edits file it was only like, 4 bytes, thus having only 4 offsets (0x4).

my SecondaryNameNode had the exact same edits file, thus i could not use it for a recovery... at one point i think it complained that the version was 3 and the namenode was expecting version 4 during init.

I decided to just roll the VM back to a snapshot that it had (using ESXi 4.1)... but later read that i could have created a new NameNode and imported the previous.checkpoint.

Anyone have any experience recovering hadoop from corrupted edits file? Or just recovering any aspect of hadoop? I'm trying to get a better understanding of the SecondaryNameNode's role within the cluster, as well as a better understanding of the purpose behind the checkpoints and how often they are/should be created.

#2 Champion of Cyrodiil

Champion of Cyrodiil

    Gigabyte

  • Members
  • 776 posts
  • LocationVirginia

Posted 07 November 2012 - 02:17 PM

After speaking with one of the Apache developers for a while. I think we narrowed down the issues. Some of the developers were using hadoop-core-1.0.2.jar in the client buildpath. The version of Hadoop supported in my environment (because its what the customer env supports) is Hadoop 0.20.2. Using the newer client would be okay for reading data and in theory writing simple mutations. But for Bulk Imports and Map Reduce jobs, you would want to use the hadoop-core-0.20.2.jar in your client.

The developers are actually using Accumulo/Cloudbase which stacks on Zookeeper which stacks on Hadoop stack... but the client jar files for cloudbase depend on hadoop-core and looking at the package names, most definitely make use of the bulkimport and map reduce capability.

Thus, use the right client library version for the server your running. Even if the newer libraries claim to be backwards compatible.

#3 Champion of Cyrodiil

Champion of Cyrodiil

    Gigabyte

  • Members
  • 776 posts
  • LocationVirginia

Posted 07 November 2012 - 02:19 PM

Another error that popped up from the same problem was:

WARN: Incorrect header of version mismatch from <ClientHost>:<ephemeral port> got version 3 expected version 4

#4 Champion of Cyrodiil

Champion of Cyrodiil

    Gigabyte

  • Members
  • 776 posts
  • LocationVirginia

Posted 07 November 2012 - 02:20 PM

Also the cluster was pseudo-distributed on a single VM, thus had only one data node... so there were not even enough slaves to make the cluster properly fault tolerant. Should have a minimum of 3 data nodes in a cluster.

#5 K_N

K_N

    Megabyte

  • Members
  • 576 posts
  • LocationPhoenix

Posted 07 November 2012 - 11:24 PM

From my experience, mysterious errors that come without you changing anything are almost always PEBKAC on the part of another user.

Rumors of my demise have been greatly exaggerated.


#6 Champion of Cyrodiil

Champion of Cyrodiil

    Gigabyte

  • Members
  • 776 posts
  • LocationVirginia

Posted 08 November 2012 - 03:32 PM

yea... Im supporting a team of about 15 java/ozone developers that havent used Hadoop... also Im new to it, so its understandable... this time.





Also tagged with one or more of these keywords: hadoop, recovery, edits file, namenode, offsets, previous.checkpoint