<html>

<head>
<meta http-equiv=Content-Type content="text/html; charset=unicode">
<meta name=Generator content="Microsoft Word 15 (filtered)">
<style>
<!--
 /* Font Definitions */
 @font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
	{font-family:Chalkboard;
	panose-1:3 5 6 2 4 2 2 2 2 5;}
 /* Style Definitions */
 p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0in;
	font-size:12.0pt;
	font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
	{color:#0563C1;
	text-decoration:underline;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
	{margin-top:0in;
	margin-right:0in;
	margin-bottom:0in;
	margin-left:.5in;
	font-size:12.0pt;
	font-family:"Calibri",sans-serif;}
.MsoChpDefault
	{font-size:10.0pt;
	font-family:"Calibri",sans-serif;}
@page WordSection1
	{size:8.5in 11.0in;
	margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
	{page:WordSection1;}
-->
</style>

</head>

<body lang=EN-US link="#0563C1" vlink="#954F72" style='word-wrap:break-word'>

<div class=WordSection1>

<p class=MsoNormal align=center style='text-align:center'><b><span
style='font-size:20.0pt;font-family:Chalkboard'>Db2, Lakehouse, and
watsonx.data</span></b></p>

<p class=MsoNormal align=center style='text-align:center'><b><span
style='font-family:"Times New Roman",serif'>27 June 2023</span></b></p>

<p class=MsoNormal align=center style='text-align:center'><b><span
style='font-family:"Times New Roman",serif'>Dean Compher</span></b></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>&nbsp;</span></p>

<iframe
 src="file:///www.facebook.com/plugins/like.php%3fhref=http%253A%252F%252Fwww.facebook.com%252Fdb2dean&amp;width=200&amp;height=35&amp;colorscheme=light&amp;layout=standard&amp;action=like&amp;show_faces=false&amp;send=false"
 scrolling=no frameborder=0 style='border-bottom-style:none;border-left-style:
 none;border-right-style:none;border-top-style:none;height:35px;overflow:hidden;
 width:200px' allowTransparency=true>
</iframe>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>&nbsp;</span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>Db2 is
participating in two large trends, namely the ability to put various data sets where
they get the best cost/performance tradeoff and the ability to separate compute
from storage.&nbsp; In this article I will discuss these trends and then talk
about how Db2 is supporting them.&nbsp; </span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>&nbsp;</span></p>

<p class=MsoNormal><b><span style='font-family:"Times New Roman",serif'>Put
Data where you get best cost/performance tradeoff</span></b></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>&nbsp;</span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>The idea
here is that not all data has the same value per byte.&nbsp; For example,
transactions captured in your systems in the last 6 months are probably more
valuable than transaction data that is 2 years old and that data is more
valuable than 10-year-old transaction data.&nbsp; This is because newer data
tends to be used by higher value applications that run your business than old
data.&nbsp; Another example is where data for your core business systems is
more valuable than data for systems that you could do without in a pinch.&nbsp;
Because different sets of data have different values per byte, it would make
sense to put data sets on the least expensive storage that provided adequate
performance for the way you want to use it, especially when you have huge
volumes of lesser value data.&nbsp; In the past, the lessor value data would
likely have been deleted or archived off-line, but with newer technologies, you
can afford to keep old data for things like historical analysis and training
machine learning models.&nbsp; </span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>&nbsp;</span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>One of the
things enabling this is the advent of extremely low-cost cloud object storage
that also has pretty good performance.&nbsp; This type of storage is provided
by Amazon Web Services, IBM Cloud and others.&nbsp; The most widely used way to
interact with this storage is the S3 Standard.&nbsp; AWS created the standard
and their cloud object S3 storage service is probably the best known.&nbsp;
However, other cloud providers use this standard and you can even use it to
connect to a number of on-premises storage systems using open source software
such as Ceph or MinIO.&nbsp; In addition to cloud object storage there are
other ways to get more inexpensive storage such as with Hadoop clusters.&nbsp; </span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>&nbsp;</span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>There are
various cases driving this trend, but they all revolve around storing enormous
amounts of data as cheaply as possibly while getting easy and relatively fast
access to it.&nbsp; Here are a couple of examples:</span></p>

<p class=MsoListParagraph style='text-indent:-.25in'><span style='font-family:
Symbol'>·</span><span style='font-size:7.0pt;font-family:"Times New Roman",serif'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span><span style='font-family:"Times New Roman",serif'>Artificial
Intelligence – In general terms, the more data you can keep for training your
AI models the more accurate and precise your predictions will be.&nbsp; This is
true for Large Language Models like that used by ChatGPT or to do more mundane
machine learning models that can do things like score the probability that your
internet user is who they say they are.&nbsp; </span></p>

<p class=MsoListParagraph style='text-indent:-.25in'><span style='font-family:
Symbol'>·</span><span style='font-size:7.0pt;font-family:"Times New Roman",serif'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span><span style='font-family:"Times New Roman",serif'>I might need that –
Many databases are clogged with old data that the owners won’t give permission
to delete (not yours, of course).&nbsp; It would be great to continue to allow
access but keep that old data in inexpensive storage.&nbsp; </span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>&nbsp;</span></p>

<p class=MsoNormal><b><span style='font-family:"Times New Roman",serif'>Separation
of Compute and Storage</span></b></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>&nbsp;</span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>This trend
is primarily being driven by cloud computing but is also applicable to
on-premises and hybrid systems too.&nbsp; I see three different aspects of this
trend:</span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>&nbsp;</span></p>

<p class=MsoListParagraph style='text-indent:-.25in'><span style='font-family:
"Times New Roman",serif'>1.</span><span style='font-size:7.0pt;font-family:
"Times New Roman",serif'>&nbsp;&nbsp;&nbsp;&nbsp; </span><span
style='font-family:"Times New Roman",serif'>Cloud computing is like the power
connected to your home – each additional unit you consume adds to your bill at
the end of the month.&nbsp; With databases the amount of storage you consume
tends to be somewhat stable and growing at some rate.&nbsp; However, the amount
of CPU and RAM you need at any given moment can vary wildly, depending the query
load and you may not even need any at night if no users are connected.&nbsp; So
being able to scale down the compute resources you are using or even turn it
off can save money while providing adequate performance for peak loads.&nbsp; </span></p>

<p class=MsoListParagraph style='text-indent:-.25in'><span style='font-family:
"Times New Roman",serif'>2.</span><span style='font-size:7.0pt;font-family:
"Times New Roman",serif'>&nbsp;&nbsp;&nbsp;&nbsp; </span><span
style='font-family:"Times New Roman",serif'>Within the same database it is desirable
to store different classes of data on different classes of storage to get the
best cost/performance for a class of data as discussed above.&nbsp; In the
cloud world this trade-off is typically between more expensive and better
performing block storage vs. object storage. Previously, databases tended to
have only one type of storage for the engine, but with the separation, the
engine can use any type of storage.</span></p>

<p class=MsoListParagraph style='text-indent:-.25in'><span style='font-family:
"Times New Roman",serif'>3.</span><span style='font-size:7.0pt;font-family:
"Times New Roman",serif'>&nbsp;&nbsp;&nbsp;&nbsp; </span><span
style='font-family:"Times New Roman",serif'>Store data in some open format such
as Parquet or CSV, and allow any query engine whether SQL based or not to
access the data files directly.&nbsp; For example, if you have Parquet files in
Amazon S3 you may want to be able to see it as a table in your Db2 database
while also reading the files from a Spark application or query it from an open
source </span><a href="https://ahana.io/"><span style='font-family:"Times New Roman",serif'>Presto
engine</span></a><span style='font-family:"Times New Roman",serif'>.&nbsp; In
this case you have one copy of the data, but you can use a variety of different
compute engines.&nbsp; </span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>&nbsp;</span></p>

<p class=MsoNormal><b><span style='font-family:"Times New Roman",serif'>How
does Db2 Play</span></b></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>&nbsp;</span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>There are
two different ways that Db2 Warehouse and Db2 Warehouse on Cloud will
participate.&nbsp; One way that I will explore in a later article is by
creating tablespaces on object storage. In this article I will explore Db2
Warehouse’s interaction with the IBM watsonx.data Lakehouse.&nbsp; For now, Db2
Warehouse and Db2 Warehouse on Cloud are the only Db2 offerings that will have
this ability.&nbsp; Also, some of the features I mention here will not be
available in the initial release.&nbsp; </span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>&nbsp;</span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>At the Think
Conference in May, IBM announced the IBM watsonx.data Lakehouse.&nbsp; Db2
Warehouse and Netezza are getting new features where you can create Db2 tables
on files/tables in the Lakehouse.&nbsp; The Lakehouse can also connect to Db2
and Netezza and query data in the traditionally stored tables in those
databases.&nbsp; This means that you can use either the Db2 Warehouse or
watsonx.data engines to query any tables owned by either engine.&nbsp; The
capabilities I discuss in the rest of the articles for Db2 Warehouse also apply
to Netezza.</span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>&nbsp;</span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>What is
the watsonx.data Lakehouse?&nbsp; That’s a pretty big topic, but I will attempt
to summarize that here.&nbsp; This description is only meant to give you a very
high-level understanding of what it is.&nbsp; There is much more to it that you
can read about in the </span><a href="https://www.ibm.com/products/watsonx-data"><span
style='font-family:"Times New Roman",serif'>watsonx.data</span></a><span
style='font-family:"Times New Roman",serif'> site.&nbsp; &nbsp;&nbsp;It is made
up of a number of open-source tools, combined in such a way that they provide
an easy way to store data in open formats and query it along with some added
tooling from IBM to make them easy to administer.&nbsp; The first is a
combination </span><a href="https://iceberg.apache.org/"><span
style='font-family:"Times New Roman",serif'>Iceberg</span></a><span
style='font-family:"Times New Roman",serif'> and </span><a
href="https://hive.apache.org/"><span style='font-family:"Times New Roman",serif'>Hive</span></a><span
style='font-family:"Times New Roman",serif'> to provide a catalog also known as
a metadata store.&nbsp; Just as the Db2 catalog tables enable queries by
keeping information about tables, columns and data types, underlying storage
for the tables and tablespaces, etc., the watsonx.data catalog does this for
files stored in S3 storage in formats like Parquet.&nbsp; This catalog
(metadata store) is also a service that can be queried and updated by other
applications.&nbsp; It will also keep connection information to other databases
like Postgres and allow you to catalog tables in this catalog that physically
exist in another database and not in files on S3 Storage.&nbsp; </span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>&nbsp;</span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>The next
major component is the </span><a href="https://ahana.io/"><span
style='font-family:"Times New Roman",serif'>Presto</span></a><span
style='font-family:"Times New Roman",serif'> engine.&nbsp; IBM is using the
Ahana distribution of the open-source Presto engine.&nbsp; You can think of
Presto as doing a similar function as any other database engine like Db2,
Oracle, Postgres, etc.&nbsp; It is a process running on a server somewhere and
applications connect to it using a JDBC driver to submit SQL queries and get
back result sets.&nbsp; It also provides administrative tools like other
databases do to create tables, test queries and administer the engine.&nbsp;
When you create a table using Presto it stores the table definition, including
storage information like file name and file format (say Parquet) in the catalog
(metadata service).&nbsp; When Presto executes a query, it gets the table
definition and location information from the catalog and then reads the file(s)
and presents the results to the client.&nbsp; You can SELECT and INSERT tables
using a connection to Presto.&nbsp; Since the data is in files that can’t be
changed, inserts to an existing table are added in additional files in the same
object store bucket (a.k.a directory).&nbsp; &nbsp;This allows you to query
data as of various points in time.&nbsp; When deploying watsonx.data (either as
a cloud service or on your own infrastructure) you can optionally also install
a Spark engine.&nbsp; </span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>&nbsp;</span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>Since the
files are in open-source formats like Parquet any application that has the
credentials to connect to the object store can read and process those files
directly.&nbsp; Further, since the catalog is in open-source Iceberg/Hive any
query engine including Db2 Warehouse can query the metadata service to determine
table format and location, and then read the files to provide query
results.&nbsp; This means that you aren’t locked into watsonx.data.&nbsp; If
you ever decide that it isn’t for you, all your data is still available without
changes because it’s in an open format in object storage, plus if your new
query tool can talk to Iceberg and Hive then it can see the files as tables to
be queried by SQL.&nbsp; </span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>&nbsp;</span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>Db2
Warehouse is getting features that allow you to synchronize the watsonx.data
catalog with their catalogs.&nbsp; This means that when you configure it to
talk to watsonx.data it can see the tables defined to Iceberg and those tables
can be made available to the users of the Db2 Warehouse database.&nbsp; When
this is done, applications connected to Db2 Warehouse can see the local tables
as well as the watsonx.data tables.&nbsp; Local tables can be joined with
watsonx.data tables.&nbsp; The applications will not be able to tell the
difference.&nbsp; This provides you with the ability to put your data where you
get the best cost/performance tradeoff for the particular set of data while
accessing everything through your awesome Db2 Warehouse data warehouse.&nbsp; </span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>&nbsp;</span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>An example
of how you might use that is archiving old data in your Db2 Warehouse while
still having easy access.&nbsp; You could create a local table in Db2 to hold
some sort of data rows less than two years old that are quite valuable and need
the fastest access, but older rows, while still being useful for historical
analysis don’t need the fastest access.&nbsp; In this case you could create
another table that looks just like the original but created on
watsonx.data.&nbsp; Now your applications can query just the local table, just
the remote table or join the tables using a UNION ALL query.&nbsp; Further,
applications that need just the older data can read it without adding any load
to Db2 Warehouse because those applications can query the data through the
Presto engine or even read the files directly.&nbsp; </span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>&nbsp;</span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>As I noted
earlier, you can create tables in watsonx.data that are actually tables in
other databases. &nbsp;You just define the credentials for the Db2 database to
watsonx.data metadata service.&nbsp; These table definitions are placed in the
watsonx.data catalog, but instead of pointing to files in object storage, they
point to tables in Db2.&nbsp;&nbsp; This is very similar to creating federated
tables in Db2 or Fluid Query in Netezza that point to tables outside of those
databases.&nbsp; Now you can query tables from any query engine whether stored
in files in object storage or in Db2, Netezza or other databases.&nbsp; </span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>&nbsp;</span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>So with
these capabilities, Db2 Warehouse allows for both storing data where it is most
efficient and separate compute from storage by allowing to pick the engine for
accessing the data that you prefer or keeping the data active while scaling the
Db2 Warehouse engine up or down. </span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>&nbsp;</span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>This
diagram by Herb Pereyra that shows how the components fit together.&nbsp; </span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>&nbsp;</span></p>

<p class=MsoNormal style='margin-left:.5in'><span style='font-family:"Times New Roman",serif'><img
border=0 width=640 height=359 src="Db2Lakehouse.fld/image001.png"
alt="A picture containing text, screenshot, diagram, font&#10;&#10;Description automatically generated"></span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>&nbsp;</span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>&nbsp;</span></p>

<p class=MsoNormal align=center style='text-align:center'><span
style='font-family:"Times New Roman",serif'>***</span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>&nbsp;</span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>As of the
date of this article, these features are not yet available, but the features
will be released starting in July and throughout the year.&nbsp; Please feel
free to share any thought you have on this on my </span><a
href="http://www.facebook.com/db2Dean"><span style='font-family:"Times New Roman",serif'>Facebook
Page</span></a><span style='font-family:"Times New Roman",serif'>. &nbsp;</span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>&nbsp;</span></p>

<p class=MsoNormal align=center style='text-align:center;text-autospace:none'><a
href="https://www.db2dean.com/"><b><span style='font-family:"Times New Roman",serif'>HOME</span></b></a><b><span
style='font-family:"Times New Roman",serif'> | </span></b><a
href="http://www.db2dean.com/Search.html"><b><span style='font-family:"Times New Roman",serif'>Search</span></b></a></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>&nbsp;</span></p>

<p class=MsoNormal><span style='font-family:"Times New Roman",serif'>&nbsp;</span></p>

</div>

</body>

</html>