Saturday, September 24, 2022
HomeBig DataMigrate a big knowledge warehouse from Greenplum to Amazon Redshift utilizing AWS...

Migrate a big knowledge warehouse from Greenplum to Amazon Redshift utilizing AWS SCT – Half 2


On this second publish of a multi-part collection, we share finest practices for selecting the optimum Amazon Redshift cluster, knowledge structure, changing saved procedures, suitable features and queries extensively used for SQL conversions, and proposals for optimizing the size of knowledge varieties for desk columns. You may take a look at the first publish of this collection for steerage on planning, operating, and validation of a large-scale knowledge warehouse migration from Greenplum to Amazon Redshift utilizing AWS Schema Conversion Instrument (AWS SCT).

Select your optimum Amazon Redshift cluster

Amazon Redshift has two sorts of clusters: provisioned and serverless. For provisioned clusters, you want to arrange the identical with required compute sources. Amazon Redshift Serverless can run high-performance analytics within the cloud at any scale. For extra info, confer with Introducing Amazon Redshift Serverless – Run Analytics At Any Scale With out Having to Handle Knowledge Warehouse Infrastructure.

An Amazon Redshift cluster consists of nodes. Every cluster has a pacesetter node and a number of compute nodes. The chief node receives queries from shopper functions, parses the queries, and develops question run plans. The chief node then coordinates the parallel run of those plans with the compute nodes and aggregates the intermediate outcomes from these nodes. It then returns the outcomes to the shopper functions.

When figuring out your kind of cluster, take into account the next:

  • Estimate the dimensions of the enter knowledge compressed, vCPU, and efficiency. As of this writing, we advocate the Amazon Redshift RA3 occasion with managed storage, which scales compute and storage independently for quick question efficiency.
  • Amazon Redshift supplies an automatic “Assist me select” cluster primarily based on the dimensions of your knowledge.
  • A fundamental benefit of a cloud Amazon Redshift knowledge warehouse is that you simply’re now not caught with {hardware} and commodities like previous guard knowledge warehouses. For quicker innovation, you will have the choice to strive totally different cluster choices and select the optimized one when it comes to efficiency and value.
  • On the time of improvement or pilot, you possibly can often begin with a smaller variety of nodes. As you progress to manufacturing, you possibly can alter the variety of nodes primarily based in your utilization sample. When right-sizing your clusters, we advocate selecting the reserved occasion kind to chop down the fee even additional. The general public-facing utility Easy Replay may also help you identify efficiency in opposition to totally different cluster varieties and sizes by replaying the shopper workload. For provisioned clusters, when you’re planning to make use of the really helpful RA3 occasion, you possibly can evaluate totally different node varieties to find out the fitting occasion kind.
  • Primarily based in your workload sample, Amazon Redshift helps resize, pause and cease, and concurrency scaling of the cluster. Amazon Redshift workload administration (WLM) permits efficient and versatile administration of reminiscence and question concurrency.

Create knowledge extraction duties with AWS SCT

With AWS SCT extraction brokers, you possibly can migrate your supply tables in parallel. These extraction brokers authenticate utilizing a legitimate person on the info supply, permitting you to regulate the sources out there for that person in the course of the extraction. AWS SCT brokers course of the info domestically and add it to Amazon Easy Storage Service (Amazon S3) via the community (through AWS Direct Join). We advocate having a constant community bandwidth between your Greenplum machine the place the AWS SCT agent is put in and your AWS Area.

When you have tables round 20 million rows or 1 TB in dimension, you should utilize the digital partitioning characteristic on AWS SCT to extract knowledge from these tables. This creates a number of sub-tasks and parallelizes the info extraction course of for this desk. Subsequently, we advocate creating two teams of duties for every schema that you simply migrate: one for small tables and one for big tables utilizing digital partitions.

For extra info, confer with Creating, operating, and monitoring an AWS SCT knowledge extraction job.

Knowledge structure

To simplify and modernize your knowledge structure, take into account the next:

  • Set up accountability and authority to implement enterprise knowledge requirements and insurance policies.
  • Formalize the info and analytics working mannequin between enterprise and enterprise models and features.
  • Simplify the info expertise ecosystem via rationalization and modernization of knowledge belongings and instruments or expertise.
  • Develop organizational constructs that facilitate extra strong integration of the enterprise and supply groups, and construct data-oriented merchandise and options to deal with the enterprise issues and alternatives all through the lifecycle.
  • Again up the info periodically in order that if one thing is fallacious, you will have the power to replay.
  • Throughout planning, design, execution, and all through implementation and upkeep, guarantee knowledge high quality administration is added to attain the specified final result.
  • Easy is the important thing to a straightforward, quick, intuitive, and low-cost resolution. Easy scales significantly better than advanced. Easy makes it potential to suppose huge (Invent and Simplify is one other Amazon management precept). Simplify the legacy course of by migrating solely the required knowledge utilized in tables and schemas. For instance, when you’re performing truncate and cargo for incremental knowledge, determine a watermark and solely course of incremental knowledge.
  • You’ll have use instances that requiring record-level inserts, updates, and deletes for privateness laws and simplified pipelines; simplified file administration and near-real-time knowledge entry; or simplified change knowledge seize (CDC) knowledge pipeline improvement. We advocate utilizing purposeful instruments primarily based in your use case. AWS provides the choices to make use of Apache HUDI with Amazon EMR and AWS Glue.

Migrate saved procedures

On this part, we share finest practices for saved process migration from Greenplum to Amazon Redshift. Knowledge processing pipelines with advanced enterprise logic usually use saved procedures to carry out the info transformation. We advise utilizing huge knowledge processing like AWS Glue or Amazon EMR to modernize your extract, remodel, and cargo (ETL) jobs. For extra info, take a look at High 8 Finest Practices for Excessive-Efficiency ETL Processing Utilizing Amazon Redshift. For time-sensitive migration to cloud-native knowledge warehouses like Amazon Redshift, redesigning and growing the complete pipeline in a cloud-native ETL instrument is perhaps time-consuming. Subsequently, migrating the saved procedures from Greenplum to Amazon Redshift saved procedures could be the fitting selection.

For a profitable migration, ensure that to observe Amazon Redshift saved process finest practices:

  • Specify the schema identify whereas making a saved process. This helps facilitate schema-level safety and you may implement grants or revoke entry management.
  • To forestall naming conflicts, we advocate naming procedures utilizing the prefix sp_. Amazon Redshift reserves the sp_ prefix solely for saved procedures. By prefixing your process names with sp_, you make sure that your process identify received’t battle with any present or future Amazon Redshift process names.
  • Qualify your database objects with the schema identify within the saved process.
  • Observe the minimal required entry rule and revoke undesirable entry. For comparable implementation, ensure that the saved process run permission will not be open to ALL.
  • The SECURITY attribute controls a process’s privileges to entry database objects. Whenever you create a saved process, you possibly can set the SECURITY attribute to both DEFINER or INVOKER. If you happen to specify SECURITY INVOKER, the process makes use of the privileges of the person invoking the process. If you happen to specify SECURITY DEFINER, the process makes use of the privileges of the proprietor of the process. INVOKER is the default. For extra info, confer with Safety and privileges for saved procedures.
  • Managing transactions in the case of saved procedures are necessary. For extra info, confer with Managing transactions.
  • TRUNCATE points a commit implicitly inside a saved process. It interferes with the transaction block by committing the present transaction and creating a brand new one. Train warning whereas utilizing TRUNCATE to make sure it by no means breaks the atomicity of the transaction. This additionally applies for COMMIT and ROLLBACK.
  • Adhere to cursor constraints and perceive efficiency concerns whereas utilizing cursor. It is best to use set-based SQL logic and non permanent tables whereas processing giant datasets.
  • Keep away from hardcoding in saved procedures. Use dynamic SQL to assemble SQL queries dynamically at runtime. Guarantee applicable logging and error dealing with of the dynamic SQL.
  • For exception dealing with, you possibly can write RAISE statements as a part of the saved process code. For instance, you possibly can elevate an exception with a customized message or insert a document right into a logging desk. For unhandled exceptions like WHEN OTHERS, use built-in features like SQLERRM or SQLSTATE to cross it on to the calling utility or program. As of this writing, Amazon Redshift limits calling a saved process from the exception block.

Sequences

You should use IDENTITY columns, system timestamps, or epoch time as an choice to make sure uniqueness. The IDENTITY column or a timestamp-based resolution might need sparse values, so when you want a steady quantity sequence, you want to use devoted quantity tables. You can even use of the RANK() or ROW_NUMBER() window operate over the complete set. Alternatively, get the high-water mark from the prevailing ID column from the desk and increment the values whereas inserting data.

Character datatype size

Greenplum char and varchar knowledge kind size is specified when it comes to character size, together with multi-byte ones. Amazon Redshift character varieties are outlined when it comes to bytes. For desk columns utilizing multi-byte character units in Greenplum, the transformed desk column in Amazon Redshift ought to allocate satisfactory storage to the precise byte dimension of the supply knowledge.

A straightforward workaround is to set the Amazon Redshift character column size to 4 instances bigger than the corresponding Greenplum column size.

A finest observe is to make use of the smallest potential column dimension. Amazon Redshift doesn’t allocate space for storing based on the size of the attribute; it allocates storage based on the true size of the saved string. Nevertheless, at runtime, whereas processing queries, Amazon Redshift allocates reminiscence based on the size of the attribute. Subsequently, not setting a default dimension of 4 instances higher helps from a efficiency perspective.

An environment friendly resolution is to research manufacturing datasets and decide the utmost byte dimension size of the Greenplum character columns. Add a 20% buffer to assist future incremental development on the desk.

To reach on the precise byte dimension size of an present column, run the Greenplum knowledge construction character utility from the AWS Samples GitHub repo.

Numeric precision and scale

The Amazon Redshift numeric knowledge kind has a restrict to retailer as much as most precision of 38, whereas in a Greenplum database, you possibly can outline a numeric column with none outlined size.

Analyze your manufacturing datasets and decide numeric overflow candidates utilizing the Greenplum knowledge construction numeric utility from the AWS Samples GitHub repo. For numeric knowledge, you will have choices to deal with this primarily based in your use case. For numbers with a decimal half, you will have the choice to spherical the info primarily based on the info kind with none knowledge loss in the entire quantity half. For future reference, you possibly can a hold copy of the column in VARCHAR or retailer in an S3 knowledge lake. If you happen to see an especially small proportion of an outlier of overflow knowledge, clear up the supply knowledge for high quality knowledge migration.

SQL queries and features

Whereas changing SQL scripts or saved procedures to Amazon Redshift, when you encounter unsupported features, database objects, or code blocks for which you might need to rewrite the question, create user-defined features (UDFs), or redesign. You may create a customized scalar UDF utilizing both a SQL SELECT clause or a Python program. The brand new operate is saved within the database and is accessible for any person with adequate privileges to run. You run a customized scalar UDF in a lot the identical approach as you run present Amazon Redshift features to match any performance of legacy databases. The next are some examples of alternate question statements and methods to attain particular aggregations that is perhaps required throughout a code rewrite.

AGE

The Greenplum operate AGE () returns an interval subtracting from the present date. You possibly can accomplish the identical utilizing a subset of MONTHS_BETWEEN(), ADD_MONTH(), DATEDIFF(), and TRUNC() features primarily based in your use case.

The next instance Amazon Redshift question calculates the hole between the date 2001-04-10 and 1957-06-13 when it comes to 12 months, month, and days. You may apply this to any date column in a desk.

choose
	trunc(trunc(months_between('2001-04-10'::date, '1957-06-13'::date))/ 12) years,
	mod(trunc(months_between('2001-04-10'::date, '1957-06-13'::date))::int4, 12) months,
	'2001-04-10'::date -add_months('1957-06-13'::date,
	trunc(months_between('2001-04-10'::date, '1957-06-13'::date))::int4) days;

COUNT

When you have a use case to get distinct aggregation within the Depend() window operate, you could possibly accomplish the identical utilizing a mix of the Dense_Rank () and Max() window features.

The next instance Amazon Redshift question calculates the distinct merchandise depend for a given date of sale:

choose
	sale_date,
	merchandise,
	price,
	max(densernk) over (partition by sale_date order by merchandise rows between unbounded previous and unbounded following) as distinct_itemcount
from
	(
	choose
		*,dense_rank() over (partition by sale_date order by merchandise) as densernk
	from
		testaggr)
order by
	sale_date,
	merchandise,
price;

ORDER BY

Amazon Redshift mixture window features with an ORDER BY clause require a compulsory body.

The next instance Amazon Redshift question creates a cumulative sum of price by sale date and orders the outcomes by merchandise throughout the partition:

choose
	*,
    sum(price) over (partition by sale_date
order by
	merchandise rows between unbounded previous and unbounded following) as total_cost_by_date
from
	testaggr
order by
	sale_date,
	merchandise,
	price;

STRING_AGG

In Greenplum, STRING_AGG() is an mixture operate, which is used to concatenate an inventory of strings. In Amazon Redshift, use the LISTAGG() operate.

The next instance Amazon Redshift question returns a semicolon-separated record of e-mail addresses for every division:

choose
	dept,
	listagg(email_address,';') 
inside group (order by dept) as email_list
from
	employee_contact
group by
	dept
order by
	dept;

ARRAY_AGG

In Greenplum, ARRAY_AGG() is an mixture operate that takes a set of values as enter and returns an array. In Amazon Redshift, use a mix of the LISTAGG() and SPLIT_TO_ARRAY() features. The SPLIT_TO_ARRAY() operate returns a SUPER datatype.

The next instance Amazon Redshift question returns an array of e-mail addresses for every division:

choose
	dept,
	SPLIT_TO_ARRAY(email_list,
	';') email_array
from
	(
	choose
		dept,
		listagg(email_address,
		';') 
inside group (
		order by dept) as email_list
	from
		employee_contact
	group by
		dept
	order by
		dept);

To retrieve array components from a SUPER expression, you should utilize the SUBARRAY() operate:

 choose
	SUBARRAY( email_array,
	0,
	1 ) first_element,
	SUBARRAY( email_array,
	1,
	1) second_element,
	SUBARRAY( email_array,
	0 ) all_element
from
	testarray
the place
	dept="HR";

UNNEST

In Greenplum, you should utilize the UNNEST operate to separate an array and convert the array components right into a set of rows. In Amazon Redshift, you should utilize PartiQL syntax to iterate over SUPER arrays. For extra info, confer with Querying semistructured knowledge.

create temp desk unnesttest as
choose
	json_parse('{"scalar_array": [0,1,2,3,4,5.5,6,7.9,8,9]}') as knowledge;

choose
	factor
from
	unnesttest as un,
	un.knowledge.scalar_array as factor at index;

WHERE

You may’t use a window operate within the WHERE clause of a question in Amazon Redshift. As a substitute, assemble the question utilizing the WITH clause after which refer the calculated column within the WHERE clause.

The next instance Amazon Redshift question returns the sale date, merchandise, and value from a desk for the gross sales dates the place the overall sale is greater than 100:

with aggrcost as (
choose
	sale_date ,
	merchandise,
	price,
	sum(price) over (partition by sale_date) as total_sale
from
	testaggr )
choose
	*
from
	aggrcost
the place
	total_sale > 100;

Confer with the next desk for added Greenplum date/time features together with the Amazon Redshift equal to speed up you code migration.

. Description Greenplum Amazon Redshift
1 The now() operate return the beginning time of the present transaction now () sysdate
2 clock_timestamp() returns the beginning timestamp of the present assertion inside a transaction block clock_timestamp () to_date(getdate(),'yyyy-mm-dd') + substring(timeofday(),12,15)::timetz
3 transaction_timestamp () returns the beginning timestamp of the present transaction transaction_timestamp () to_date(getdate(),'yyyy-mm-dd') + substring(timeofday(),12,15)::timetz
4 Interval – This operate provides x years and y months to the date_time_column and returns a timestamp kind date_time_column + interval ‘ x years y months’ add_months(date_time_column, x*12 + y)
5 Get complete variety of seconds between two-time stamp fields date_part('day', end_ts - start_ts) * 24 * 60 * 60+ date_part('hours', end_ts - start_ts) * 60 * 60+ date_part('minutes', end_ts - start_ts) * 60+ date_part('seconds', end_ts - start_ts) datediff('seconds', start_ts, end_ts)
6 Get complete variety of minutes between two-time stamp fields date_part('day', end_ts - start_ts) * 24 * 60 + date_part('hours', end_ts - start_ts) * 60 + date_part('minutes', end_ts - start_ts) datediff('minutes', start_ts, end_ts)
7 Extract date half literal from distinction of two-time stamp fields date_part('hour', end_ts - start_ts) extract(hour from (date_time_column_2 - date_time_column_1))
8 Operate to return the ISO day of the week date_part('isodow', date_time_column) TO_CHAR(date_time_column, 'ID')
9 Operate to return ISO 12 months from date time area extract (isoyear from date_time_column) TO_CHAR(date_time_column, ‘IYYY’)
10 Convert epoch seconds to equal datetime to_timestamp(epoch seconds) TIMESTAMP 'epoch' + Number_of_seconds * interval '1 second'

Amazon Redshift utility for troubleshooting or operating diagnostics for the cluster

The Amazon Redshift Utilities GitHub repo comprises a set of utilities to speed up troubleshooting or evaluation on Amazon Redshift. Such utilities encompass queries, views, and scripts. They aren’t deployed by default onto Amazon Redshift clusters. One of the best observe is to deploy the wanted views into the admin schema.

Conclusion

On this publish, we lined prescriptive steerage round knowledge varieties, features, and saved procedures to speed up the migration course of from Greenplum to Amazon Redshift. Though this publish describes modernizing and transferring to a cloud warehouse, you ought to be augmenting this transformation course of in direction of a full-fledged fashionable knowledge structure. The AWS Cloud allows you to be extra data-driven by supporting a number of use instances. For a contemporary knowledge structure, you need to use purposeful knowledge shops like Amazon S3, Amazon Redshift, Amazon Timestream, and others primarily based in your use case.


In regards to the Authors

Suresh Patnam is a Principal Options Architect at AWS. He’s obsessed with serving to companies of all sizes remodeling into fast-moving digital organizations specializing in huge knowledge, knowledge lakes, and AI/ML. Suresh holds a MBA diploma from Duke College- Fuqua Faculty of Enterprise and MS in CIS from Missouri State College. In his spare time, Suresh enjoys enjoying tennis and spending time together with his household.

Arunabha Datta is a Sr. Knowledge Architect at AWS Skilled Providers. He collaborates with clients and companions to architect and implement fashionable knowledge structure utilizing AWS Analytics providers. In his spare time, Arunabha enjoys pictures and spending time together with his household.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

19 + 18 =

Most Popular

Recent Comments