Saturday, September 24, 2022
HomeBig DataImprove Amazon EMR Hive Metastore from 5.X to six.X

Improve Amazon EMR Hive Metastore from 5.X to six.X


In case you are at the moment operating Amazon EMR 5.X clusters, think about shifting to Amazon EMR 6.X as  it consists of new options that helps you enhance efficiency and optimize on value. For example, Apache Hive is 2 instances quicker with LLAP on Amazon EMR 6.X, and Spark 3 reduces prices by 40%. Moreover, Amazon EMR 6.x releases embrace Trino, a quick distributed SQL engine and Iceberg, high-performance open knowledge format for petabyte scale tables.

To improve Amazon EMR clusters from 5.X to six.X launch, a Hive Metastore improve is step one earlier than functions equivalent to Hive and Spark may be migrated. This put up gives steering on the right way to improve Amazon EMR Hive Metastore from 5.X to six.X in addition to migration of Hive Metastore to the AWS Glue Information Catalog. As Hive 3 Metastore is suitable with Hive 2 functions, you may proceed to make use of Amazon EMR 5.X with the upgraded Hive Metastore.

Resolution overview

Within the following part, we offer steps to improve the Hive Metastore schema utilizing MySQL because the backend.. For every other backends (equivalent to MariaDB, Oracle, or SQL Server), replace the instructions accordingly.

There are two choices to improve the Amazon EMR Hive Metastore:

  • Improve the Hive Metastore schema from 2.X to three.X through the use of the Hive Schema Device
  • Migrate the Hive Metastore to the AWS Glue Information Catalog

We stroll by way of the steps for each choices.

Pre-upgrade stipulations

Earlier than upgrading the Hive Metastore, you need to full the next stipulations steps:

  1. Confirm the Hive Metastore database is operating and accessible.
    It’s best to be capable of run Hive DDL and DML queries efficiently. Any errors or points should be mounted earlier than continuing with improve course of. Use the next pattern queries to check the database:
    create desk testtable1 (id int, title string);)
    insert into testtable1 values (1001, "user1");
    choose * from testtable1;

  2. To get the Metastore schema model within the present EMR 5.X cluster, run the next command within the major node:
    sudo hive —service schematool -dbType mysql -info

    The next code reveals our pattern output:

    $ sudo hive --service schematool -dbType mysql -info
    Metastore connection URL: jdbc:mysql://xxxxxxx.us-east-1.rds.amazonaws.com:3306/hive?createDatabaseIfNotExist=true
    Metastore Connection Driver : org.mariadb.jdbc.Driver
    Metastore connection Person: admin
    Hive distribution model: 2.3.0
    Metastore schema model:  2.3.0

  3. Cease the Metastore service and limit entry to the Metastore MySQL database.
    It’s crucial that nobody else accesses or modifies the contents of the Metastore database whilst you’re performing the schema improve.To cease the Metastore, use the next instructions:
    $ sudo cease hive-server2
    $ sudo cease hive-hcatalog-server

    For Amazon EMR launch 5.30 and 6.0 onwards (Amazon Linux 2 is the working system for the Amazon EMR 5.30+ and 6.x launch sequence), use the next instructions:

    $ sudo systemctl cease hive-server2.service
    $ sudo systemctl cease hive-hcatalog-server.service

    You can too notice the overall variety of databases and tables current within the Hive Metastore earlier than the improve, and confirm the variety of databases and tables after the improve.

  4. To get the overall variety of tables and databases earlier than the improve, run the next instructions after connecting to the exterior Metastore database (assuming the Hive Metadata DB title is hive):
    $mysql -u <username> -h <mysqlhost> --password;
    $use hive;
    $choose rely(*) from DBS;
    $choose rely(*) from TBLS;

  5. Take a backup or snapshot of the Hive database.
    This lets you revert any modifications made through the improve course of if one thing goes unsuitable. In case you’re utilizing Amazon Relational Database Service (Amazon RDS), consult with Backing up and restoring an Amazon RDS occasion for directions.
  6. Be aware of the Hive desk storage location if knowledge is saved in HDFS.

If all of the desk knowledge is on Amazon Easy Storage Service (Amazon S3), then no motion is required. If HDFS is used because the storage layer for Hive databases and tables, then take a notice of them. You’ll need to repeat the recordsdata on HDFS to the same path on the brand new cluster, after which confirm or replace the placement attribute for databases and tables on the brand new cluster accordingly.

Improve the Amazon EMR Hive Metastore schema with the Hive Schema Device

On this method, you employ the persistent Hive Metastore on a distant database (Amazon RDS for MySQL or Amazon Aurora MySQL-Suitable Version). The next diagram reveals the improve process.

To improve the Amazon EMR Hive Metastore from 5.X (Hive model 2.X) to six.X (Hive model 3.X), we will use the Hive Schema Device. The Hive Schema Device is an offline software for Metastore schema manipulation. You should utilize it to initialize, improve, and validate the Metastore schema. Run the next command to indicate the out there choices for the Hive Schema Device:

sudo hive --service schematool -help

Make sure to full the stipulations talked about earlier, together with taking a backup or snapshot, earlier than continuing with the subsequent steps.

  1. Notice down the small print of the prevailing Hive exterior Metastore to be upgraded.
    This consists of the RDS for MySQL endpoint host title, database title (for this put up, hive), person title, and password. You are able to do this by way of one of many following choices:
    • Get the Hive Metastore DB info from the Hive configuration file – Log in to the EMR 5.X major node, open the file /and so forth/hive/conf/hive-site.xml, and notice the 4 properties:
      <property>
      	  <title>javax.jdo.choice.ConnectionURL</title>
      	  <worth>jdbc:mysql://{hostname}:3306/{dbname}?createDatabaseIfNotExist=true</worth>
      	  <description>username to make use of towards metastore database</description>
      	</property>
      	<property>
      	  <title>javax.jdo.choice.ConnectionDriverName</title>
      	  <worth>org.mariadb.jdbc.Driver</worth>
      	  <description>username to make use of towards metastore database</description>
      	</property>
      	<property>
      	  <title>javax.jdo.choice.ConnectionUserName</title>
      	  <worth>{username}</worth>
      	  <description>username to make use of towards metastore database</description>
      	</property>
      	<property>
      	  <title>javax.jdo.choice.ConnectionPassword</title>
      	  <worth>{password}</worth>
      	  <description>password to make use of towards metastore database</description>
      	</property>

    • Get the Hive Metastore DB info from the Amazon EMR console – Navigate to the EMR 5.X cluster, select the Configurations tab, and notice down the Metastore DB info.
      EMR Cosole for Configuration
  1. Create a brand new EMR 6.X cluster.
    To make use of the Hive Schema Device, we have to create an EMR 6.X cluster. You possibly can create a brand new EMR 6.X cluster by way of the Hive console or the AWS Command Line Interface (AWS CLI), with out specifying exterior hive Metastore particulars. This lets the EMR 6.X cluster launch efficiently utilizing the default Hive Metastore. For extra details about EMR cluster administration, consult with Plan and configure clusters.
  2. After your new EMR 6.X cluster is launched efficiently and is within the ready state, SSH to the EMR 6.X major node and take a backup of /and so forth/hive/conf/hive-site.xml:
    sudo cp /and so forth/hive/conf/hive-site.xml /and so forth/hive/conf/hive-site.xml.bak

  3. Cease Hive providers:
    sudo systemctl cease hive-hcatalog-server.service
    sudo systemctl cease hive-server2.service

    Now you replace the Hive configuration and level it to the outdated hive Metastore database.

  4. Modify /and so forth/hive/conf/hive-site.xml and replace the properties with the values you collected earlier:
    <property>
      <title>javax.jdo.choice.ConnectionURL</title>
      <worth>jdbc:mysql://{hostname}:3306/{dbname}?createDatabaseIfNotExist=true</worth>
      <description>username to make use of towards metastore database</description>
    </property>
    <property>
      <title>javax.jdo.choice.ConnectionDriverName</title>
      <worth>org.mariadb.jdbc.Driver</worth>
      <description>username to make use of towards metastore database</description>
    </property>
    <property>
      <title>javax.jdo.choice.ConnectionUserName</title>
      <worth>{username}</worth>
      <description>username to make use of towards metastore database</description>
    </property>
    <property>
      <title>javax.jdo.choice.ConnectionPassword</title>
      <worth>{password}</worth>
      <description>password to make use of towards metastore database</description>
    </property>

  5. On the identical or new SSH session, run the Hive Schema Device to examine that the Metastore is pointing to the outdated Metastore database:
    sudo hive --service schemaTool -dbType mysql -info

    The output ought to look as follows (old-hostname, old-dbname, and old-username are the values you modified):

    Metastore connection URL:     jdbc:mysql://{old-hostname}:3306/{old-dbname}?createDatabaseIfNotExist=true
    Metastore Connection Driver :     org.mariadb.jdbc.Driver
    Metastore connection Person:     {old-username}
    Hive distribution model:     3.1.0
    Metastore schema model:      2.3.0
    schemaTool accomplished

    You possibly can improve the Hive Metastore by passing the -upgradeSchema choice to the Hive Schema Device. The software figures out the SQL scripts required to initialize or improve the schema after which runs these scripts towards the backend database.

  6. Run the upgradeSchema command with -dryRun, which solely lists the SQL scripts wanted through the precise run:
    sudo hive --service schematool -dbType mysql -upgradeSchema -dryRun

    The output ought to seem like the next code. It reveals the Metastore improve path from the outdated model to the brand new model. You could find the improve order on the GitHub repo. In case of failure through the improve course of, these scripts may be run manually in the identical order.

    Metastore connection URL:     jdbc:mysql://{old-hostname}:3306/{old-dbname}?createDatabaseIfNotExist=true
    Metastore Connection Driver :     org.mariadb.jdbc.Driver
    Metastore connection Person:     {old-username}
    Hive distribution model:     3.1.0
    Metastore schema model:      2.3.0
    schemaTool accomplished

  7. To improve the Hive Metastore schema, run the Hive Schema Device with -upgradeSchema:
    sudo hive --service schematool -dbType mysql -upgradeSchema

    The output ought to seem like the next code:

    Beginning improve metastore schema from model 2.3.0 to three.1.0
    Improve script upgrade-2.3.0-to-3.0.0.mysql.sql
    ...
    Accomplished upgrade-2.3.0-to-3.0.0.mysql.sql
    Improve script upgrade-3.0.0-to-3.1.0.mysql.sql
    ...
    Accomplished upgrade-3.0.0-to-3.1.0.mysql.sql
    schemaTool accomplished

    In case of any points or failures, you may run the previous command with verbose. This prints all of the queries getting run so as and their output.

    sudo hive --service schemaTool -verbose -dbType mysql -upgradeSchema

    In case you encounter any failures throughout this course of and also you need to improve your Hive Metastore by operating the SQL your self, consult with Upgrading Hive Metastore.

    If HDFS was used as storage for the Hive warehouse or any Hive DB location, it’s worthwhile to replace the NameNode alias or URI with the brand new cluster’s HDFS alias.

  8. Use the next instructions to replace the HDFS NameNode alias (substitute <new-loc> <old-loc> with the HDFS root location of the brand new and outdated clusters, respectively):
    hive —service metatool -updateLocation <new-loc> <old-loc>

    You possibly can run the next command on any EMR cluster node to get the HDFS NameNode alias:

    hdfs getconf -confKey dfs.namenode.rpc-address

    At first you may run with the dryRun choice, which shows all of the modifications however aren’t endured. For instance:

    [hadoop@ip-172-31-87-188 ~]$ hive --service metatool -updateLocation hdfs://ip-172-31-50-80.ec2.inner:8020 hdfs://ip-172-31-87-188.ec2.inner:8020 -dryRun
    Initializing HiveMetaTool..
    On the lookout for LOCATION_URI area in DBS desk to replace..
    Dry Run of updateLocation on desk DBS..
    outdated location: hdfs://ip-172-31-87-188.ec2.inner:8020/person/hive/warehouse/testdb.db new location: hdfs://ip-172-31-50-80.ec2.inner:8020/person/hive/warehouse/testdb.db
    outdated location: hdfs://ip-172-31-87-188.ec2.inner:8020/person/hive/warehouse/testdb_2.db new location: hdfs://ip-172-31-50-80.ec2.inner:8020/person/hive/warehouse/testdb_2.db
    outdated location: hdfs://ip-172-31-87-188.ec2.inner:8020/person/hive/warehouse new location: hdfs://ip-172-31-50-80.ec2.inner:8020/person/hive/warehouse
    Discovered 3 data in DBS desk to replace
    On the lookout for LOCATION area in SDS desk to replace..
    Dry Run of updateLocation on desk SDS..
    outdated location: hdfs://ip-172-31-87-188.ec2.inner:8020/person/hive/warehouse/testtable1 new location: hdfs://ip-172-31-50-80.ec2.inner:8020/person/hive/warehouse/testtable1
    Discovered 1 data in SDS desk to replace

    Nevertheless, if the brand new location must be modified to a unique HDFS or S3 path, then use the next method.

    First hook up with the distant Hive Metastore database and run the next question to tug all of the tables for a particular database and checklist the places. Substitute HiveMetastore_DB with the database title used for the Hive Metastore within the exterior database (for this put up, hive) and the Hive database title (default):

    mysql> SELECT d.NAME as DB_NAME, t.TBL_NAME, t.TBL_TYPE, s.LOCATION FROM <HiveMetastore_DB>.TBLS t 
    JOIN <HiveMetastore_DB>.DBS d ON t.DB_ID = d.DB_ID JOIN <HiveMetastore_DB>.SDS s 
    ON t.SD_ID = s.SD_ID AND d.NAME='default';

    Determine the desk for which location must be up to date. Then run the Alter desk command to replace the desk places. You possibly can put together a script or chain of Alter desk instructions to replace the places for a number of tables.

    ALTER TABLE <table_name> SET LOCATION "<new_location>";

  9. Begin and examine the standing of Hive Metastore and HiveServer2:
    sudo systemctl begin hive-hcatalog-server.service
    sudo systemctl begin hive-server2.service
    sudo systemctl standing hive-hcatalog-server.service
    sudo systemctl standing hive-server2.service

Publish-upgrade validation

Carry out the next post-upgrade steps:

  1. Affirm the Hive Metastore schema is upgraded to the brand new model:
    sudo hive --service schemaTool -dbType mysql -validate

    The output ought to seem like the next code:

    Beginning metastore validation
    
    Validating schema model
    Succeeded in schema model validation.
    [SUCCESS]
    Validating sequence quantity for SEQUENCE_TABLE
    Succeeded in sequence quantity validation for SEQUENCE_TABLE.
    [SUCCESS]
    Validating metastore schema tables
    Succeeded in schema desk validation.
    [SUCCESS]
    Validating DFS places
    Succeeded in DFS location validation.
    [SUCCESS]
    Validating columns for incorrect NULL values.
    Succeeded in column validation for incorrect NULL values.
    [SUCCESS]
    Executed with metastore validation: [SUCCESS]
    schemaTool accomplished

  2. Run the next Hive Schema Device command to question the Hive schema model and confirm that it’s upgraded:
    $ sudo hive --service schemaTool -dbType mysql -info
    Metastore connection URL:        jdbc:mysql://<host>:3306/<hivemetastore-dbname>?createDatabaseIfNotExist=true
    Metastore Connection Driver :    org.mariadb.jdbc.Driver
    Metastore connection Person:       <username>
    Hive distribution model:       3.1.0
    Metastore schema model:        3.1.0

  3. Run some DML queries towards outdated tables and guarantee they’re operating efficiently.
  4. Confirm the desk and database counts utilizing the identical instructions talked about within the stipulations part, and evaluate the counts.

The Hive Metastore schema migration course of is full, and you can begin working in your new EMR cluster. If for some cause you need to relaunch the EMR cluster, you then simply want to supply the Hive Metastore distant database that we upgraded within the earlier steps utilizing the choices on the Amazon EMR Configurations tab.

Migrate the Amazon EMR Hive Metastore to the AWS Glue Information Catalog

The AWS Glue Information Catalog is versatile and dependable, and might cut back your operation value. Furthermore, the Information Catalog helps totally different variations of EMR clusters. Due to this fact, if you migrate your Amazon EMR 5.X Hive Metastore to the Information Catalog, you should use the identical Information Catalog with any new EMR 5.8+ cluster, together with Amazon EMR 6.x. There are some elements you must think about when utilizing this method; consult with Concerns when utilizing AWS Glue Information Catalog for extra info. The next diagram reveals the improve process.
EMR Hive Metastore Migrate to Glue Data Catalog
Emigrate your Hive Metastore to the Information Catalog, you should use the Hive Metastore migration script from GitHub. The next are the most important steps for a direct migration.

Make sure that all of the desk knowledge is saved in Amazon S3 and never HDFS. In any other case, tables migrated to the Information Catalog can have the desk location pointing to HDFS, and you may’t question the desk. You possibly can examine your desk knowledge location by connecting to the MySQL database and operating the next SQL:

choose SD_ID, LOCATION from SDS the place LOCATION like 'hdfs%';

Make sure that to finish the prerequisite steps talked about earlier earlier than continuing with the migration. Make sure the EMR 5.X cluster is in a ready state and all of the parts’ standing are in service.

  1. Notice down the small print of the prevailing EMR 5.X cluster Hive Metastore database to be upgraded.
    As talked about earlier than, this consists of the endpoint host title, database title, person title, and password. You are able to do this by way of one of many following choices:
    • Get the Hive Metastore DB info from the Hive configuration file – Log in to the Amazon EMR 5.X major node, open the file /and so forth/hive/conf/hive-site.xml, and notice the 4 properties:
    <property>
      <title>javax.jdo.choice.ConnectionURL</title>
      <worth>jdbc:mysql://{hostname}:3306/{dbname}?createDatabaseIfNotExist=true</worth>
      <description>username to make use of towards metastore database</description>
    </property>
    <property>
      <title>javax.jdo.choice.ConnectionUserName</title>
      <worth>{username}</worth>
      <description>username to make use of towards metastore database</description>
    </property>
    <property>
      <title>javax.jdo.choice.ConnectionPassword</title>
      <worth>{password}</worth>
      <description>password to make use of towards metastore database</description>
    </property>

    • Get the Hive Metastore DB info from the Amazon EMR console – Navigate to the Amazon EMR 5.X cluster, select the Configurations tab, and notice down the Metastore DB info.

    EMR Cosole for Configuration

  2. On the AWS Glue console, create a connection to the Hive Metastore as a JDBC knowledge supply.
    Use the connection JDBC URL, person title, and password you gathered within the earlier step. Specify the VPC, subnet, and safety group related along with your Hive Metastore. You could find these on the Amazon EMR console if the Hive Metastore is on the EMR major node, or on the Amazon RDS console if the Metastore is an RDS occasion.
  3. Obtain two extract, remodel, and cargo (ETL) job scripts from GitHub and add them to an S3 bucket:
    wget https://uncooked.githubusercontent.com/aws-samples/aws-glue-samples/grasp/utilities/Hive_metastore_migration/src/hive_metastore_migration.py
    wget https://uncooked.githubusercontent.com/aws-samples/aws-glue-samples/grasp/utilities/Hive_metastore_migration/src/import_into_datacatalog.py

    In case you configured AWS Glue to entry Amazon S3 from a VPC endpoint, you need to add the script to a bucket in the identical AWS Area the place your job runs.

    Now you need to create a job on the AWS Glue console to extract metadata out of your Hive Metastore emigrate it to the Information Catalog.

  4. On the AWS Glue console, select Jobs within the navigation pane.
  5. Select Create job.
  6. Choose Spark script editor.
  7. For Choices¸ choose Add and edit an present script.
  8. Select Select file and add the import_into_datacatalog.py script you downloaded earlier.
  9. Select Create.
    Glue Job script editor
  10. On the Job particulars tab, enter a job title (for instance, Import-Hive-Metastore-To-Glue).
  11. For IAM Position, select a job.
  12. For Sort, select Spark.
    Glue ETL Job details
  13. For Glue model¸ select Glue 3.0.
  14. For Language, select Python 3.
  15. For Employee kind, select G1.X.
  16. For Requested variety of employees, enter 2.
    Glue ETL Job details
  17. Within the Superior properties part, for Script filename, enter import_into_datacatalog.py.
  18. For Script path, enter the S3 path you used earlier (simply the dad or mum folder).
    Glue ETL Job details
  19. Beneath Connections, select the connection you created earlier.
    Glue ETL Job details
  20. For Python library path, enter the S3 path you used earlier for the file hive_metastore_migration.py.
  21. Beneath Job parameters, enter the next key-pair values:
    • --mode: from-jdbc
    • --connection-name: EMR-Hive-Metastore
    • --region: us-west-2

    Glue ETL Job details

  22. Select Save to save lots of the job.
  23. Run the job on demand on the AWS Glue console.

If the job runs efficiently, Run standing ought to present as Succeeded. When the job is completed, the metadata from the Hive Metastore is seen on the AWS Glue console. Verify the databases and tables listed to confirm that they have been migrated appropriately.

Recognized points

In some instances the place the Hive Metastore schema model is on a really outdated launch or if some required metadata tables are lacking, the improve course of might fail. On this case, you should use the next steps to determine and repair the problem. Run the schemaTool upgradeSchema command with verbose as follows:

sudo hive --service schemaTool -verbose -dbType mysql -upgradeSchema

This prints all of the queries being run so as and their output:

jdbc:mysql://metastore.xxxx.us-west-1> CREATE INDEX PCS_STATS_IDX ON PAR T_COL_STATS (DB_NAME,TABLE_NAME,COLUMN_NAME,PARTITION_NAME) USING BTREE
Error: (conn=6831922) Duplicate key title 'PCS_STATS_IDX' (state=42000,code=1061)
Closing: 0: jdbc:mysql://metastore.xxxx.us-west-1.rds.amazonaws.com:3306/hive?createDatabaseIfNotExist=true
org.apache.hadoop.hive.metastore.HiveMetaException: Schema initialization FAILED! Metastore state could be inconsistent !!
Underlying trigger: java.io.IOException : Schema script failed, errorcode 2

Notice down the question and the error message, then take the required steps to handle the problem. For instance, relying on the error message, you could have to create the lacking desk or alter an present desk. Then you may both rerun the schemaTool upgradeSchema command, or you may manually run the remaining queries required for improve. You may get the whole script that schemaTool runs from the next path on the first node /usr/lib/hive/scripts/metastore/improve/mysql/ or from GitHub.

Clear up

Operating further EMR clusters to carry out the improve exercise in your AWS account might incur further costs. While you full the Hive Metastore improve efficiently, we suggest deleting the extra EMR clusters to save lots of value.

Conclusion

To improve Amazon EMR from 5.X to six.X and make the most of some options from Hive 3.X or Spark SQL 3.X, you need to improve the Hive Metastore first. In case you’re utilizing the AWS Glue Information Catalog as your Hive Metastore, you don’t have to do something as a result of the Information Catalog helps each Amazon EMR variations. In case you’re utilizing a MySQL database because the exterior Hive Metastore, you may improve by following the steps outlined on this put up, or you may migrate your Hive Metastore to the Information Catalog.

There are some practical variations between the totally different variations of Hive, Spark, and Flink. If in case you have some functions operating on Amazon EMR 5.X, be certain take a look at your functions in Amazon EMR 6.X and validate the perform compatibility. We are going to cowl utility upgrades for Amazon EMR parts in a future put up.


In regards to the authors

Jianwei Li is Senior Analytics Specialist TAM. He gives advisor service for AWS enterprise help clients to design and construct fashionable knowledge platform. He has greater than 10 years expertise in huge knowledge and analytics area. In his spare time, he like operating and mountain climbing.

Narayanan Venkateswaran is an Engineer within the AWS EMR group. He works on creating Hive in EMR. He has over 17 years of labor expertise within the trade throughout a number of firms together with Solar Microsystems, Microsoft, Amazon and Oracle. Narayanan additionally holds a PhD in databases with concentrate on horizontal scalability in relational shops.

Partha Sarathi is an Analytics Specialist TAM – at AWS primarily based in Sydney, Australia. He brings 15+ years of expertise experience and helps Enterprise clients optimize Analytics workloads. He has extensively labored on each on-premise and cloud Bigdata workloads together with varied ETL platform in his earlier roles. He additionally actively works on conducting proactive operational critiques across the Analytics providers like Amazon EMR, Redshift, and OpenSearch.

Krish is an Enterprise Assist Supervisor liable for main a group of specialists in EMEA centered on BigData & Analytics, Databases, Networking and Safety. He’s additionally an professional in serving to enterprise clients modernize their knowledge platforms and encourage them to implement operational finest practices. In his spare time, he enjoys spending time together with his household, travelling, and video video games.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

20 − five =

Most Popular

Recent Comments