EMR HDFS data restore

0

Hello Experts,

Technically speaking, EBS volumes assigned to the EMR core nodes are persistent storage and I have specifically created them to not delete on cluster termination. Then, I have attached the same volume to new EMR cluster and mounted them back.

After restarting the data node service, I got an exception stating "clusterID" version is not matched. To resolve this, I matched the VERSION file same as master node. However I unable to see the data that stored in the hdfs volume. I m not hadoop expert though, but I think there is some gap here.

Question is why I can't reuse the ebs volume in emr cluster

Thanks in advance.

Scott M
asked 2 months ago447 views
1 Answer
4
Accepted Answer

Hi,

Basically, the data is stored as HDFS blocks on these disks and only the NameNode is aware of these blocks (stores these blocks details as metadata)

Hypothetically, even if you were able to re-attach the EBS volumes to another cluster, the newer or other cluster Namenode is unaware of these HDFS blocks. Please note each data block(HDFS) is tied to a hadoop cluster and has its own unique block id and block location on disks differs. (HDFS blocks != OS blocks)

Because of the above framework limitation, it is not possible to switch EBS volumes since EBS is not a persistent store in EMR, meaning it is deleted once the EMR cluster is terminated. Even if we can attach additional EBS volumes to a node in EMR cluster, it is not possible to reuse them with another cluster. Please refer this document for details. I hope this answers your question.

AWS
SUPPORT ENGINEER
answered 2 months ago
EXPERT
Leo K
reviewed 2 months ago
profile picture
EXPERT
reviewed 2 months ago