A Tale of Data Warehouses
• A luxury for the ‘fat cats’
• Complex
• Expensive
• Even before the first query!
• Time Factor
• Staffing
• DBAs
• IT
• Traditional databases will never cut it
• Expectations
Enter ….
Amazon Redshift
Big Deal for Big Data?
• Fast petabyte data warehouse service
• Fully managed (setup, operation, scale)
• Seamless integration with existing BI tools (Jaspersoft,
Microstrategy, Pentaho, Tableu, BO, Cognos and more!)
• No new languages to learn
• Simple 3 steps
• Load up your cluster with data
• Connect your favorite query tool
• Query away!
• There’s an API too!
Wow – what steps does it really
involve?
• Simple Management
• Just pop into the AWS Management Console
• Pick a node with pre-allocated storage
• Start off with a few hundred GBs and scale up to a terabyte
• < $1000 / TB / year !
• Ready to accept data so load it up !
• Key point on scaling
• Zero downtime!
• Automatic addition of storage and dynamically increased
performance when more nodes added
• No separate tuning required ! Redshift does this for us
How does it work?
• Uses columnar storage
• Parallel processing architecture
• Queries spread across nodes / cluster given horizontal scale
readiness of architecture
• Monitoring of cluster
• Automatic backups or Manual Snapshots
• Easy integration with other AWS services (S3, DynamoDB,
Data Pipeline, EMR, RDS)
Worried about security?
• Encryption easily flipped on
• Backups + data encrypted
• Also compatible with SSL and works with Amazon VPC (Virtual
Private Cloud) as well
This is great – but this must cost
some serious $$$ !
• Wrong!
• Cost effective and as low as 1/10th the cost of traditional data
warehouse systems
• Zero upfront fees
• Flexible payment options
• Pay as you go
• 70% discount if you commit to a reserved instance
• A 2TB Amazon Redshift data warehouse cluster costs < $1 /
hour !
Amazon.com’s Retail Biz Test*
• On-premises data warehouse
• 32 nodes
• 4.2TB of RAM
• 1.6PB of disk
• Cost: Several Million USDs
• Amazon Redshift
• 2 nodes (128GB RAM each)
• 16TB of disk per node
• $32,000 / year (or $3.65 / hour)
• The test
• 2 billion row data & 6 most complex queries
• At least 10x faster!
HOWEVER.. beware Big Data
Myths
• Technology is only part of the solution
• Besides the tooling
• Devising hypotheses
• Determining metrics/parameters to look at
• Asking the right kinds of questions when the data is ready to
work for you
• If you’re not sure how you intend to use the data / data
warehouse focus more on making sure you have your
questions in place before implementing any technology/using
a DW (NX - HMO Principle)
All great, but how have YOU guys
used Redshift?
• Telecom
• Caller data
• Improving call routing (reducing costs)
• Identifying carrier issues in near real time
• Identifying customer trends which led to the development of new system
features and subsequently translating to more subscriptions
• Better infrastructure preparedness
• Performance Management & Analysis
• Logging granular server & network statistics data
• Process, server, cluster, I/O level metrics
• 2.5TB worth of data for a 24 hour window
• Correlating resource trends against traffic trends
• Descriptively determine bottlenecks
• Determine high risk components in cases of projected high traffic
• Pro-active improvements before any issues hit us
What does this mean for
business?
• Data warehouses are far more affordable now especially for
small to medium sized companies
• Provide an edge to smaller businesses and entrepreneurs –
potentially serve as a catalyst for small business
• Enterprises shouldn’t feel left out
• Test quantitative hypotheses faster based on actual data
• Teams at companies get quality computing in fraction of the time
• Growth in data-driven/centered businesses
• Encourage competition
• Bottom Line:
• Redshift will do the grunt work, allowing companies to focus
more on strategic utilization of the technology and allow their
data to do the work for them
• A luxury for the ‘fat cats’
• Complex
• Expensive
• Even before the first query!
• Time Factor
• Staffing
• DBAs
• IT
• Traditional databases will never cut it
• Expectations
Enter ….
Amazon Redshift
Big Deal for Big Data?
• Fast petabyte data warehouse service
• Fully managed (setup, operation, scale)
• Seamless integration with existing BI tools (Jaspersoft,
Microstrategy, Pentaho, Tableu, BO, Cognos and more!)
• No new languages to learn
• Simple 3 steps
• Load up your cluster with data
• Connect your favorite query tool
• Query away!
• There’s an API too!
Wow – what steps does it really
involve?
• Simple Management
• Just pop into the AWS Management Console
• Pick a node with pre-allocated storage
• Start off with a few hundred GBs and scale up to a terabyte
• < $1000 / TB / year !
• Ready to accept data so load it up !
• Key point on scaling
• Zero downtime!
• Automatic addition of storage and dynamically increased
performance when more nodes added
• No separate tuning required ! Redshift does this for us
How does it work?
• Uses columnar storage
• Parallel processing architecture
• Queries spread across nodes / cluster given horizontal scale
readiness of architecture
• Monitoring of cluster
• Automatic backups or Manual Snapshots
• Easy integration with other AWS services (S3, DynamoDB,
Data Pipeline, EMR, RDS)
Worried about security?
• Encryption easily flipped on
• Backups + data encrypted
• Also compatible with SSL and works with Amazon VPC (Virtual
Private Cloud) as well
This is great – but this must cost
some serious $$$ !
• Wrong!
• Cost effective and as low as 1/10th the cost of traditional data
warehouse systems
• Zero upfront fees
• Flexible payment options
• Pay as you go
• 70% discount if you commit to a reserved instance
• A 2TB Amazon Redshift data warehouse cluster costs < $1 /
hour !
Amazon.com’s Retail Biz Test*
• On-premises data warehouse
• 32 nodes
• 4.2TB of RAM
• 1.6PB of disk
• Cost: Several Million USDs
• Amazon Redshift
• 2 nodes (128GB RAM each)
• 16TB of disk per node
• $32,000 / year (or $3.65 / hour)
• The test
• 2 billion row data & 6 most complex queries
• At least 10x faster!
HOWEVER.. beware Big Data
Myths
• Technology is only part of the solution
• Besides the tooling
• Devising hypotheses
• Determining metrics/parameters to look at
• Asking the right kinds of questions when the data is ready to
work for you
• If you’re not sure how you intend to use the data / data
warehouse focus more on making sure you have your
questions in place before implementing any technology/using
a DW (NX - HMO Principle)
All great, but how have YOU guys
used Redshift?
• Telecom
• Caller data
• Improving call routing (reducing costs)
• Identifying carrier issues in near real time
• Identifying customer trends which led to the development of new system
features and subsequently translating to more subscriptions
• Better infrastructure preparedness
• Performance Management & Analysis
• Logging granular server & network statistics data
• Process, server, cluster, I/O level metrics
• 2.5TB worth of data for a 24 hour window
• Correlating resource trends against traffic trends
• Descriptively determine bottlenecks
• Determine high risk components in cases of projected high traffic
• Pro-active improvements before any issues hit us
What does this mean for
business?
• Data warehouses are far more affordable now especially for
small to medium sized companies
• Provide an edge to smaller businesses and entrepreneurs –
potentially serve as a catalyst for small business
• Enterprises shouldn’t feel left out
• Test quantitative hypotheses faster based on actual data
• Teams at companies get quality computing in fraction of the time
• Growth in data-driven/centered businesses
• Encourage competition
• Bottom Line:
• Redshift will do the grunt work, allowing companies to focus
more on strategic utilization of the technology and allow their
data to do the work for them
No comments:
Post a Comment