-
Notifications
You must be signed in to change notification settings - Fork 0
/
sec-cover.tex
587 lines (523 loc) · 22.7 KB
/
sec-cover.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
%\vspace{2ex}
\vspace{-0.3ex}
%%%%%%%% Section 5 %%%%%%%%
\section{Computing Optimal Schema Cover}
\label{sec-cover}
\vspace{-0.4ex}
\revise{In this section, we} develop an algorithm to construct a universe set
$\kb{\A}$ of \bss for $\Q$, as part of the input for algorithm
\opts above. %in Section~\ref{sec-select}.
\stitle{Algorithm}. The algorithm, denoted by \usc, takes as input a
query load $\Q$
over a database schema $\R$, and returns a set $\kb{\A}$ of \bss
for $\Q$.
While there exist up to exponentially many distinct \bss for
$\Q$, \usc computes $\kb{\A}$ with the following properties:
\mbi
\item[(a)] for each K-FK join \SPC query $Q\in \Q$, any rank aggregate function $f$,
and any \bds $\kb{\R}_{Q}$ that is in the normal form for $Q$,
if $\kb{\R}_{Q}$ is optimal under $f$, then
$\kb{\R}_{Q}\subseteq \kb{\A}$; and
\item[(b)] the time for constructing $\kb{\A}$ is polynomial in
$|\Q|$ and $|\R|$.
%\footnote{We assume that the number joins in each $Q\in \Q$ is bounded by a constant.}
\mei
\vspace{-0.4ex}
By (b), the size of $\kb{\A}$ is also polynomial in
$|\Q|$ and $|\R|$.
\vspace{0.36ex}
To do this, \revise{\usc builds} %constructs
$\kb{\A}$ for $\Q$ by
constructing a universe set $\kb{\A}_{Q}$ of \bss for each
$Q$ in $\Q$,
such that under any aggregate rank function $f$, the optimal
normalized \bds $\kb{R}_{Q}$ for $Q$ is %must be
subsumed by $\kb{\A}_{Q}$.
It returns the union of $\kb{\A}_{Q}$ for all $Q$ in $\Q$ as
the universe set $\kb{\A}$ of \bss for $Q$.
%
When $Q$ is \revise{in \SQL}, we decompose it into its max-\SPC
sub-queries, and $\kw{\A}$ is the union of the universe sets of
all the sub-queries. \revise{As remarked earlier},
this does not miss any \revise{useful \bss} since values only propagate
within \SPC queries; hence we
only need to capture attribute correlations within
max-\SPC sub-queries of \SQL.
% \warn{This does not miss any \bss useful for
%$\Q$ since values only propagate within \SPC queries, which
%means we only need to capture attribute correlations within
% max-\SPC sub-queries of \SQL.}
Hence, below we assume that $\Q$ consists of \SPC queries.
\eat{This does not miss any \bss useful for
$\Q$ since values only propagate within \SPC queries, which
means we only need to capture attribute correlations within
max-\SPC sub-queries of \SQL.}
%\warn{Justify the consideration of \SPC queries only}
More specifically, \usc generates $\kb{\A}_{Q}$ for each $Q$ as follows.
\eetitle{Step (1)}. It first generates two \bss for each $Q$:
$\kb{R}_{Q}^{1} = \bschema{P_{Q}}{Y_{Q}}$ and $\kb{R}_{Q}^{2} =
\bschema{P_{Q}Y_{Q}}{\emptyset}$. Here $\{\kb{R}_{Q}^{1}\}$ is
an optimal \bds for $Q$ when the scan-free evaluability (measure
(1))
and the size (measure (3)) are considered; similarly,
$\{\kb{R}_{Q}^{2}\}$ is optimal when degrees (measure (2)) is
concerned.
One can verify that $\kb{R}_{Q}^{1}$ and $\kb{R}_{Q}^{2}$ taken
together can subsume optimal \bdss for $Q$ under any ranking
function $f$ with $c_{4} = 0$.
\looseness = -1
\vspace{0.36ex}
Here we include $\kb{R}_{Q}^{2}$ %only
to guarantee property (a)
of $\kb{\A}$, %claimed,
\ie $\kb{\A}_{Q}$ %can
covers all optimal
\bdss for $Q$, even when only
measure (2) is present.
%concerned. %alone.
% \warn{The following is removed, since it creates confusion as
% witnessed by SIGMOD reviewers.
% ``In
% practice, we omit $\kb{R}_{Q}^{2}$ since measure (2) is typically
% used along with other measures, \eg measure (1), to quantify
% query evaluation cost over $\kb{\R}$.''}
\vspace{0.36ex}
\eetitle{Step (2)}. It then decomposes $\kb{R}_{Q}^{1}$ over relations of
$Q$, yielding \bss that do not span across multiple relations. The
process starts with the relation $R$ that contains the maximum
number of parameters of $Q$ (\ie $P_{Q}$). It generates
$\kb{R}_{Q} = \bschema{P_{Q}^{R}}{Y_{Q}^{R}}$ from $R$, where
$P_{Q}^{R}$ consists of attributes in $P_{Q}$ that are also in
$R$; similarly for $Y_{Q}^{R}$; it marks $R$ as {\em processed}.
It then identifies the relation $R'$ in $Q$ such that
it contains the maximum
number of attributes that are in \bss $\kb{R}_{Q}$ with $R$
marked processed, and handles $R'$ along the same lines. The
process continues until all relations in $Q$ are marked
processed. This derives one \bs $\kb{R}_{Q}$ for each $R$ of $Q$
from $\kb{R}_{Q}^{1}$ generated in step (1). \looseness = -1
\vspace{0.36ex}
\eetitle{Step (3)}. This final step is to generate cross-relation \bss
for $Q$. For each set $\Sigma$ consisting of two relations in $Q$
that can be joined together with K-FK joins, \usc replaces both
relations with the universal relation $R_{\Sigma}$
(see~\cite{AbHuVi1995}) of $\Sigma$, and re-does step (2) over
$R_{\Sigma}$ and those relations that are not in $\Sigma$. It
generates again one \bs per each relation of $Q$ that is not yet
in $\Sigma$ and one \bs $\kb{R}_{Q}^{\Sigma}$ for $R_{\Sigma}$. Note
that $\kb{R}_{Q}^{\Sigma}$ may span across two relations in $\Sigma$.
The step is done %finishes
when all sets $\Sigma$ for $Q$ are processed
and the new \bss are all added to $\kb{\A}_{Q}$. \looseness = -1
\begin{myfloat}[t]
\vspace{1.2ex}
\begin{minipage}{0.50\textwidth}
\removelatexerror
%%%%%%%%%%%%%%%%%%%%% Algorithm: USC
{\scriptsize
\setlength{\floatsep}{0cm} % set blank space above the figure
\setlength{\textfloatsep}{-2cm}% set blank space below the figure
\IncMargin{1em}
\vspace{-0.7ex}
\begin{algorithm}[H]
\setstretch{1.1}
%\SetAlgoNoLine
\Indentp{-2ex}
%\DontPrintSemicolon
%\SetAlgoLined
\KwIn{A database schema $\R$ and a workload $\Q$ over $\R$.}
\KwOut{A set $\kb{\A}$ of \bss for $\Q$.}
\Indentp{1em}
\BlankLine
\ForEach{$Q\in \Q$\label{usc-l1}}{$\kb{\A}_{Q} \la
\{\kb{R}_{Q}^{1} = \bschema{P_{Q}}{Y_{Q}}), \kb{R}^{2}_{Q} =
\bschema{P_{Q}Y_{Q}}{\emptyset})\}$\;\label{usc-l2}
\ForEach{$\Sigma\subseteq \Sigma_{Q}$ consisting no more than
2 relations\label{usc-l3}}{
\tcc{$\Sigma_{Q}$: the set of relation atoms appeared in $Q$}
% \tcc{relations in $\Sigma$ can be K-FK joined into a universal relation $R_{\Sigma}$}
$\kb{\A}' \la \decompose(\Sigma_{Q}\setminus \Sigma \cup
\{R_{\Sigma}\}, Q)$\label{usc-l4}\;
$\kb{\A}_{Q} \la \kb{\A}_{Q}\cup \kb{\A}'$\label{usc-l5}\;
}
}
$\kb{\A} \la \bigcup_{Q\in \Q} \kb{\A}_{Q}$\label{usc-l6}\;
\Return $\kb{\A}$\label{usc-l7}\;
\nonl\SetKwFunction{proc}{\pdecompose}
\nonl\SetKwProg{myproc}{Procedure}{}{}
\vspace{0.3cm}
\nonl\SetKwFunction{pdecompose}{\decompose}
% \setcounter{AlgoLine}{0}
\nonl \myproc{\pdecompose{$\R_{Q}, Q$}}{
$\kb{\A} \la \emptyset$\;\label{usc-l8}
\While{$\R_{Q}\neq \emptyset$}{
\tcc{$\atset(R)$ (resp. $\atset(\kb{\A})$): the set of
attributes in $R$ (resp. $\kb{\A}$)}
$R \la \argmax_{R\in \R_{Q}}|\atset(R)\cap (P_{Q}\cup
\atset(\kb{\A}))|$\;
$X_{R}\la \atset(R)\cap (P_{Q}\cup \atset(\kb{\A}))$\;
$Y_{R}\la \atset(R)\cap (Y_{Q}\cup \atset(\kb{\A}))$\;
add \bs $\bschema{X_{R}}{Y_{R}}$ to $\kb{\A}$\;
remove $R$ from $\R_{Q}$\;
}
\Return $\kb{\A}$;\label{usc-l15}
}
\caption{Algorithm \usc\label{alg-usc}}
\end{algorithm}
\DecMargin{1em}
}
%%%%%%%%%%%%%%%%%%%%%
\end{minipage}
\vspace{-2.4ex}
\end{myfloat}
\vspace{0.36ex}
\stitle{Implementing USC}.
Algorithm \usc is outlined as Algorithm~\ref{alg-usc}.
Step (1) is carried out by lines~\ref{usc-l1}-\ref{usc-l2}.
followed by
steps (2) and (3) (lines~\ref{usc-l3}-\ref{usc-l5}).
In particular, when $\Sigma$ \revise{consists of}
%contains
only a single relation of $Q$, a single execution
of lines~\ref{usc-l4}-\ref{usc-l5} conducts step (2);
the other iterations of lines~\ref{usc-l3}-\ref{usc-l5} are for
step (3). The key part of each iteration is the decomposition of
$\kb{R}_{Q}^{1}$ computed in step (1) (line~\ref{usc-l2}) over
relations of $Q$, yielding ``projected'' \bss of
$\kb{R}_{Q}^{1}$, in various sizes. This is implemented via
procedure \decompose, also shown
in Algorithm~\ref{alg-usc} (lines~\ref{usc-l8}-\ref{usc-l15}), which
largely follows the description of steps (2) and (3) of above.
It takes the set of all \bss generated in steps (1)-(3) for
each $Q$ as $\kb{\A}_{Q}$ (lines~\ref{usc-l1}-\ref{usc-l5}) and
returns the union of $\kb{\A}_{Q}$ for all queries in $\Q$ as the
final universe set $\kb{\A}$ (lines~\ref{usc-l6}-\ref{usc-l7}).
%\warn{Check the highlighted part}
\begin{example}\label{exa-usc}
Recall query $Q_{1}$ from Example~\ref{exa-baav}.
Step (1) of algorithm \usc generates $\kb{R}_{Q_{1}}^{1} \ak =\ak
\bschema{(\at{date}, \ak \at{payee})}{(\at{cid}, \ak \at{bid},
\ak \at{city})}$ and
$\kb{R}_{Q_{1}}^{2} = \bschema{(\at{date}, \ak \at{payee},
\at{cid}, \ak \at{bid}, \ak \at{city})}{\emptyset}$.
Step (2) %of \usc
populates $\kb{R}_{Q_{1}}^{1}$ into
$\kb{\at{B}}_{1}$, $\kb{\at{T}}_{1}$ and $\kb{\at{C}}_{1}$ of
$\kb{\R}_{1}$ in Example~\ref{exa-mapping}.
Step (3) further generates $\kb{\at{TC}}_{2}\bschema{(\at{data},\ak
\at{payee})}{(\at{cid},\ak \at{bid})}$ and
$\kb{\at{CB}}\bschema{\at{cid}}{(\at{bid},\ak \at{city})}$.
Finally, \opts returns a set $\kb{\A}$ consisting of
$\kb{\at{B}}_{1}$, $\kb{\at{C}}_{1}$, $\kb{\at{T}}_{1}$,
$\kb{\at{TC}}_{2}$ and $\kb{\at{CB}}$.
Note that $\kb{\A}$ subsumes the optimal \bds $\kb{\R}_{1}$ and
$\kb{\R}_{2}$ given in Example~\ref{exa-ILP-I}.
\end{example}
\vspace{-0.7ex}
\stitle{Analysis}.
Step (1) of \usc warrants property (a) of $\kb{\A}$ claimed at
the beginning of the section. Indeed, for each $Q$, step (1)
%of \usc
guarantees that $\kb{\A}_{Q}$ subsumes optimal \bdss under
any aggregate ranking function $f$.
%with \warn{$c_{4} = 0$}.
The key observation
here is that $\kb{R}_{Q}^{1}$ constructed in step (1) already forms an
optimal \bds when scan-free evaluability and size are considered
(recall the three criteria used by $f$).
After expanding $\kb{\A}_{Q}$ with \bss over individual relations
in $Q$ generated by step (2), by the definition of the degree
criterion in $f$, \usc further guarantees that all optimal \bdss for $Q$ are subsumed by
$\kb{\A}_{Q}$, even all measures are taken into account in $f$.
The analysis is based on the assumption that a \bds $\kb{\R}$ is
in the normal form for $Q$ if and only if for each \qcs of $Q$, there
exists \bs $\kb{R}$ in $\kb{\R}$ that supports $W$ (recall
Section~\ref{sec-select}); this holds when $\Q$
consists of K-FK join \SPC queries.
\looseness = -1
%\warn{There is no $c_{4} = 0$ in the revised $f$ or in a given $f$}
%\warn{For an arbitrary $f$, what are measures (1) and (3)?}
\vspace{0.36ex}
To see that property (b) also holds,
% for $\kb{\A}$,
observe that \usc computes $\kb{\A}$ with
no more than $2\cdd{\Q} + \frac{(\kappa+1)(\kappa+2)}{2}$
many \bss in $O(\kappa^{2}\cd{\Q})$-time, where $\kappa$ is the
maximum number of K-FK joins that a single query in $\Q$
contains. Indeed, step (1) of \usc generates $2\cdd{\Q}$ \bss in
$O(\cd{\Q})$-time; steps (2) and (3) generate no more than
$\frac{(\kappa+1)(\kappa+2)}{2}$ \bss in
$O(\frac{\cd{\Q}(\kappa+1)(\kappa+2)}{2})$-time, which is bounded
by $O(\kappa^{2}\cd{\Q})$.
\eat{%EAT
\begin{myfloat}[t]
\vspace{1.2ex}
\begin{minipage}{0.50\textwidth}
\removelatexerror
%%%%%%%%%%%%%%%%%%%%% Algorithm: USC
{\scriptsize
\setlength{\floatsep}{0cm} % set blank space above the figure
\setlength{\textfloatsep}{-2cm}% set blank space below the figure
\IncMargin{1em}
\vspace{-0.7ex}
\begin{algorithm}[H]
\setstretch{1.1}
%\SetAlgoNoLine
\Indentp{-2ex}
%\DontPrintSemicolon
%\SetAlgoLined
\KwIn{A database schema $\R$ and a workload $\Q$ over $\R$.}
\KwOut{A set $\kb{\A}$ of \bss for $\Q$.}
\Indentp{1em}
\BlankLine
\ForEach{$Q\in \Q$\label{usc-l1}}{$\kb{\A}_{Q} \la
\{\kb{R}_{Q}^{1} = \bschema{P_{Q}}{Y_{Q}}), \kb{R}^{2}_{Q} =
\bschema{P_{Q}Y_{Q}}{\emptyset})\}$\;\label{usc-l2}
\ForEach{$\Sigma\subseteq \Sigma_{Q}$ consisting no more than
2 relations\label{usc-l3}}{
\tcc{$\Sigma_{Q}$: the set of relation atoms appeared in $Q$}
% \tcc{relations in $\Sigma$ can be K-FK joined into a universal relation $R_{\Sigma}$}
$\kb{\A}' \la \decompose(\Sigma_{Q}\setminus \Sigma \cup
\{R_{\Sigma}\}, Q)$\label{usc-l4}\;
$\kb{\A}_{Q} \la \kb{\A}_{Q}\cup \kb{\A}'$\label{usc-l5}\;
}
}
$\kb{\A} \la \bigcup_{Q\in \Q} \kb{\A}_{Q}$\label{usc-l6}\;
\Return $\kb{\A}$\label{usc-l7}\;
\nonl\SetKwFunction{proc}{\pdecompose}
\nonl\SetKwProg{myproc}{Procedure}{}{}
\vspace{0.3cm}
\nonl\SetKwFunction{pdecompose}{\decompose}
% \setcounter{AlgoLine}{0}
\nonl \myproc{\pdecompose{$\R_{Q}, Q$}}{
$\kb{\A} \la \emptyset$\;\label{usc-l8}
\While{$\R_{Q}\neq \emptyset$}{
\tcc{$\atset(R)$ (resp. $\atset(\kb{\A})$): the set of
attributes in $R$ (resp. $\kb{\A}$)}
$R \la \argmax_{R\in \R_{Q}}|\atset(R)\cap (P_{Q}\cup
\atset(\kb{\A}))|$\;
$X_{R}\la \atset(R)\cap (P_{Q}\cup \atset(\kb{\A}))$\;
$Y_{R}\la \atset(R)\cap (Y_{Q}\cup \atset(\kb{\A}))$\;
add \bs $\bschema{X_{R}}{Y_{R}}$ to $\kb{\A}$\;
remove $R$ from $\R_{Q}$\;
}
\Return $\kb{\A}$;\label{usc-l15}
}
\caption{Algorithm \usc\label{alg-usc}}
\end{algorithm}
\DecMargin{1em}
%%%%%%%%%%%%%%%%%%%%%
\end{minipage}
\vspace{-2.4ex}
\end{myfloat}
}}%EAT
\eetitle{Remark}.
Note that step (3) of \usc only generates \bss over a single
relation or across two relations connected via K-FK join. While
this already guarantees property (a) of \usc, we can further
improve the coverage of the universe set $\kb{\A}$ returned by
\usc by covering all \bss across more than two relations for each
$Q$. To do this, we only need to lift the size restriction on set
$\Sigma\subseteq \Sigma_{Q}$ in line~\ref{usc-l3} of \usc, so
that it iterates over all subsets of $\Sigma_{Q}$. Note that
this only increases
the complexity of \usc to
$O(2^{\kappa}\cd{\Q})$-time; where $\kappa$ %$2^{\kappa}$
is typically a
small constant, \eg on average $2$ %$2^{2}$
for \warn{\tpch and \imdb} queries.
%\warn{revise the benchmarks}
\eat{%EAT
\eat{%EAT
\begin{myfloat}[t]
\vspace{1.2ex}
\begin{minipage}{0.50\textwidth}
\removelatexerror
%%%%%%%%%%%%%%%%%%%%% Algorithm: USC
{\scriptsize
\setlength{\floatsep}{0cm} % set blank space above the figure
\setlength{\textfloatsep}{-2cm}% set blank space below the figure
\IncMargin{1em}
\vspace{-0.7ex}
\begin{algorithm}[H]
\setstretch{1.1}
%\SetAlgoNoLine
\Indentp{-2ex}
%\DontPrintSemicolon
%\SetAlgoLined
\KwIn{Database schema $\R$ and workload $\Q$ over $\R$.}
\KwOut{A set $\kb{\A}$ of \bss for $\Q$.}
\Indentp{1em}
\BlankLine
\ForEach{$Q\in \Q$\label{usc-l1}}{$\kb{\A}_{Q} \la
\{\kb{R}_{Q}^{1} = \bschema{P_{Q}}{Y_{Q}}), \kb{R}^{2}_{Q} =
\bschema{P_{Q}Y_{Q}}{\emptyset})\}$\;\label{usc-l2}
\ForEach{$\Sigma\subseteq \Sigma_{Q}$ consisting no more than
2 relations\label{usc-l3}}{
\tcc{$\Sigma_{Q}$: the set of relation atoms appeared in $Q$}
% \tcc{relations in $\Sigma$ can be K-FK joined into a universal relation $R_{\Sigma}$}
$\kb{\A}' \la \decompose(\Sigma_{Q}\setminus \Sigma \cup
\{R_{\Sigma}\}, Q)$\label{usc-l4}\;
$\kb{\A}_{Q} \la \kb{\A}_{Q}\cup \kb{\A}'$\label{usc-l5}\;
}
}
$\kb{\A} \la \bigcup_{Q\in \Q} \kb{\A}_{Q}$\label{usc-l6}\;
\Return $\kb{\A}$\label{usc-l7}\;
\nonl\SetKwFunction{proc}{\pdecompose}
\nonl\SetKwProg{myproc}{Procedure}{}{}
\vspace{0.3cm}
\nonl\SetKwFunction{pdecompose}{\decompose}
% \setcounter{AlgoLine}{0}
\nonl \myproc{\pdecompose{$\R_{Q}$, $Q$}}{
$\kb{\A} \la \emptyset$\;\label{usc-l8}
\While{$\R_{Q}\neq \emptyset$}{
\tcc{$\atset(R)$ (resp. $\atset(\kb{\A})$): the set of
attributes in $R$ (resp. $\kb{\A}$)}
$R \la \argmax_{R\in \R_{Q}}|\atset(R)\cap (P_{Q}\cup
\atset(\kb{\A}))|$\;
$X_{R}\la \atset(R)\cap (P_{Q}\cup \atset(\kb{\A}))$\;
$Y_{R}\la \atset(R)\cap (Y_{Q}\cup \cup \atset(\kb{\A}))$\;
add \bs $\bschema{X_{R}}{Y_{R}}$ to $\kb{\A}$\;
remove $R$ from $\R_{Q}$\;
}
\Return $\kb{\A}$;\label{usc-l15}
}
\caption{Algorithm \usc\label{alg-usc}}
\end{algorithm}
\DecMargin{1em}
}
%%%%%%%%%%%%%%%%%%%%%
\end{minipage}
\vspace{-2.4ex}
\end{myfloat}
}%EAT
\section{Computing Optimal Schema Cover}
\label{sec-cover}
We next develop an algorithm to construct a universe set
$\kb{\A}$ of \bss for $\Q$, as part of the input for \opts in
Section~\ref{sec-select}.
\stitle{Algorithm}. The algorithm, denoted by \usc, takes as input a workload $\Q$
over database schema $\R$, and returns a set $\kb{\A}$ of \bss
for $\Q$.
While there exist up to exponentially many distinct \bss for
$\Q$, \usc computes $\kb{\A}$ with the following properties:
\bi
\item[(a)] for each K-FK join \SPC query $Q\in \Q$, any rank aggregate function $f$,
and any \bds $\kb{\R}_{Q}$ that is in the normal form for $Q$,
if $\kb{\R}_{Q}$ is optimal under $f$, then
$\kb{\R}_{Q}\subseteq \kb{\A}$; and
\item[(b)] the time for constructing $\kb{\A}$ is polynomial in
$|\Q|$ and $|\R|$.
%\footnote{We assume that the number joins in each $Q\in \Q$ is bounded by a constant.}
\ei
By (b), the size of $\kb{\A}$ is also polynomial in
$|\Q|$ and $|\R|$.
To do this, \usc constructs $\kb{\A}$ for $\Q$ by
constructing a universe set $\kb{\A}_{Q}$ of \bss for each $Q$ in $\Q$
such that, under any aggregate rank function $f$, the optimal
normalized \bds $\kb{R}_{Q}$ for $Q$ must be subsumed by $\kb{\A}_{Q}$.
It returns the union of $\kb{\A}_{Q}$ for all $Q$ in $\Q$ as
the universe set $\kb{\A}$ of \bss for $Q$. When $\Q$ contains
generic \SQL queries, it decomposes them into their \SPC
sub-queries. Hence, below we assume that $\Q$ consists of \SPC queries.
More specifically, \usc generates $\kb{\A}_{Q}$ for each $Q$ as follows.
\etitle{Step (1)}. It first generates two \bss for each $Q$:
$\kb{R}_{Q}^{1} = \bschema{P_{Q}}{Y_{Q}}$ and $\kb{R}_{Q}^{2} =
\bschema{P_{Q}Y_{Q}}{\emptyset}$. Indeed, $\{\kb{R}_{Q}^{1}\}$ is
an optimal \bds for $Q$ when scan-free evaluability (measure (1))
and size (measure (3)) are considered; similarly,
$\{\kb{R}_{Q}^{2}\}$ is optimal when data access (measure (2)) is
concerned.
One can verify that, $\kb{R}_{Q}^{1}$ and $\kb{R}_{Q}^{2}$ taken
together can subsume optimal \bdss for $Q$ under any ranking
function $f$ with $c_{4} = 0$.
Here we include $\kb{R}_{Q}^{2}$ only to guarantee property (a)
claimed of $\kb{\A}$, \ie $\kb{\A}_{Q}$ can cover all optimal
\bdss for $Q$, even when only measure (2) is considered alone. In
practice, we omit $\kb{R}_{Q}^{2}$ as measure (2) is typically
used along with other measures, \eg measure (1), to quantify
query evaluation cost over $\kb{\R}$.
\etitle{Step (2)}. It then decomposes $\kb{R}_{Q}^{1}$ over relations of
$Q$, yielding \bss that do not span multiple relations. The
process starts with the relation $R$ containing the maximum
number of parameters of $Q$ (\ie $P_{Q}$). It generates
$\kb{R}_{Q} = \bschema{P_{Q}^{R}}{Y_{Q}^{R}}$ from $R$, where
$P_{Q}^{R}$ consists of attributes in $P_{Q}$ that are also in
$R$; similarly for $Y_{Q}^{R}$; it marks $R$ as processed.
It then identifies the relation $R'$ in $Q$ that contains the maximum
number of attributes that are in \bss $\kb{R}_{Q}$ with $R$
marked processed, and process $R'$ along the same lines. The
process continues until all relations in $Q$ are marked
processed. This derives one \bs $\kb{R}_{Q}$ for each $R$ of $Q$
from $\kb{R}_{Q}^{1}$ generated in step (1).
\etitle{Step (3)}. Step (3) is to generate cross-relation \bss
for $Q$. For each set $\Sigma$ consisting of two relations in $Q$
that can be joined together with K-FK joins, \usc replaces both
relations with the universal relation $R_{\Sigma}$
(cf.~\cite{AbHuVi1995}) of $\Sigma$, and re-does step (2) over
$R_{\Sigma}$ and those relations that are not in $\Sigma$. It
generates again one \bs per each relation of $Q$ that is not in
$\Sigma$ and one \bs $\kb{R}_{Q}^{\Sigma}$ for $R_{\Sigma}$. Note
that $\kb{R}_{Q}^{\Sigma}$ may cross two relations in $\Sigma$.
The step finishes when all sets $\Sigma$ for $Q$ are processed
and the new \bss are all added to $\kb{\A}_{Q}$.
\stitle{Implementing \usc}.
We next describe an implementation of algorithm \usc, shown as
Algorithm~\ref{alg-usc}. More specifically, it implements step
(1) via lines~\ref{usc-l1}-\ref{usc-l2}.
It then implements steps (2) and (3) via
lines~\ref{usc-l3}-\ref{usc-l5}. In particular, when $\Sigma$
contains only a single relation of $Q$, this implements step (2);
the other iterations of lines~\ref{usc-l3}-\ref{usc-l5} implement
step (3). The key part of each iteration is the decomposition of
$\kb{R}_{Q}^{1}$ computed in step (1) (line~\ref{usc-l2}) over
relations of $Q$, yielding ``projected'' \bss of
$\kb{R}_{Q}^{1}$, in various sizes. This is implemented via
procedure \decompose also in Algorithm~\ref{alg-usc} (lines~\ref{usc-l8}-\ref{usc-l15}), which
largely follows the description of steps (2) and (3) of above.
It takes the set of all \bss generated in steps (1)-(3) for
each $Q$ as $\kb{\A}_{Q}$ (lines~\ref{usc-l1}-\ref{usc-l5}) and
returns the union of $\kb{\A}_{Q}$ for all queries in $\Q$ as the
final universe set $\kb{\A}$ (lines~\ref{usc-l6}-\ref{usc-l7}).
\begin{example}\label{exa-usc}
\end{example}
\stitle{Analysis}.
Step (1) of \usc warrants property (a) of $\kb{\A}$ claimed at
the beginning of the section. Indeed, for each $Q$, step (1) of
\usc guarantees that $\kb{\A}_{Q}$ subsumes optimal \bdss under
any ranking function $f$ with $c_{4} = 0$. The key observation
here is that $\kb{R}_{Q}^{1}$ constructed in step (1) already forms an
optimal \bds under both measures (1) and (3).
After expanding $\kb{\A}_{Q}$ with \bss over individual relations
in $Q$ generated by step (2), by the definition of measure (4),
\usc further guarantees that all optimal \bdss for $Q$ are subsumed by
$\kb{\A}_{Q}$, even all measures are taken into account in $f$.
The analysis is based on the assumption that a \bds $\kb{\R}$ is
in the normal form for $Q$ if and only if for each \qcs of $Q$, there
is \bs $\kb{R}$ in $\kb{\R}$ that supports $W$ (recall
Section~\ref{sec-select}); this holds when $\Q$
consists of K-FK join \SPC queries.
\looseness = -1
To see that property (b) also holds
for $\kb{\A}$. Observe that \usc computes $\kb{\A}$ consisting of
no more than $2\cdd{\Q} + \frac{(\kappa+1)(\kappa+2)}{2}$
many \bss in $O(\kappa^{2}\cd{\Q})$-time, where $\kappa$ is the
maximum number of K-FK joins that a single query in $\Q$
contains. Indeed, step (1) of \usc generates $2\cdd{\Q}$ \bss in
$O(\cd{\Q})$-time; steps (2) and (3) generate no more than
$\frac{(\kappa+1)(\kappa+2)}{2}$ many \bss in
$O(\frac{\cd{\Q}(\kappa+1)(\kappa+2)}{2})$-time, which is bounded
by $O(\kappa^{2}\cd{\Q})$.
\stitle{Remark}.
Note that step (3) of \usc only generates \bss over a single
relation or cross two relations connecting via K-FK join. While
this already guarantees property (a) of \usc, we can further
improve the coverage of the universe set $\kb{\A}$ returned by
\usc by covering all \bss cross more than two relations for each
$Q$. To do this, we only need to lift the size restriction on set
$\Sigma\subseteq \Sigma_{Q}$ in line~\ref{usc-l3} of \usc, so
that it iterates over all subsets of $\Sigma_{Q}$. Note that,
this only improves the complexity of \usc to
$O(2^{\kappa}\cd{\Q})$-time; where $2^{\kappa}$ is typically a
small constant, \eg as small as $2^{3}$ for \tpch and \tpcds
query workload.}%EAT