Problem 5: Query Evaluation Consider the join between relations R and S, where t
ID: 3712394 • Letter: P
Question
Problem 5: Query Evaluation
Consider the join between relations R and S, where the join condition is R.a = S.b.
We are given the following information about the two relations. The cost metric is the number of page I/Os, and the cost of writing out the result should be uniformly ignored.
Relation R contains 10,000 tuples and has 10 tuples per page.
Relation S contains 2,000 tuples and also has 10 tuples per page.
Attribute b of relation S is the primary key for S.
Both relations are stored as simple heap files.
Neither relation has any indexes built on it. 52 buffer pages are available.
Questions:1. What is the cost of joining R and S using a simple nested loops join?
2. What is the cost of joining R and S using a block nested loops join?
3.What is the cost of joining R and S using a sort-merge join?
4.What is the cost of joining R and S using a hash join?
5. How many tuples will the join of R and S produce, at most, and how many pages would be required to store the result of the join back on disk?
6. If secondary B+ indexes existed on R.a and S.b, would either provide a cheaper alternative for performing the join (using an index nested loops join) than a block nested loops join?
7. Explain. If primary B+ indexes existed on R.a and S.b, would either provide a cheaper alternative for performing the join (using the index nested loops algorithm) than a block nested loops join? Explain.
Explanation / Answer
In nested-loop join page-by-page scanning of the outer relation is done, and, for each outer page, do a page-by-page scan of the inner relation.The cost for joining R and S is minimized when the smaller relation S is used as the outer relation.
Cost = bS + bS * bR = 200 + 200 * 1000 = 200,200.
2. In block nested-loop join read the outer relation in blocks (groups of pages that will fit into whatever buffer pages are available), and, for each block, do a page-by-page scan of the inner relation.The outer relation is still scanned once, but the inner relation is scanned only once for each outer block. If we have B buffers, then the number of blocks is ceil(bouter / (B-2)). As above, the total cost will be minimized when the smaller relation is used as the outer relation.
Cost = bS + ceil(bS/B-2) * bR = 200 + ceil(200/50) * 1000 = 200 + 4 * 1000 = 4,200.
3. The idea with sort-merge join is to sort both relations and then perform a single scan across each, merging into a single joined relation.Each relation has to first be sorted (assuming that it's not already sorted on the join attributes). This requires an initial pass, where chunks of the file of size B are read into memory and sorted. After this, a number of passes are done, where each pass does a (B-1)-way merge of the file. Each pass reads and writes the file, which requires 2b block reads/writes. The total number of passes is 1+ceil(logB-1ceil(b/B))).Once the relations are sorted, the merge phase requires one pass over each relation, assuming that there are enough buffers to hold the longest run in either relation
Cost = 2bS(1+ceil(logB-1ceil(bS/B))) + 2bR(1+ceil(logB-1ceil(bR/B))) + bS + bR
= 2×200×(1+ceil(log51ceil(200/52))) + 2×1000×(1+ceil(log51ceil(1000/52))) + 200 + 1000
= 2.200.(1+ceil(log514)) + 2.1000.(1+ceil(log5120)) + 200 + 1000
= 2.200.2 + 2.1000.2 + 200 + 1000 = 800 + 4000 + 200 + 1000 = 6,000.
4. The basic idea with hash join is that we partition each relation and then perform the join by "matching" elements from the partitions. We need to assume that we have at least sqrt(bR) buffers or sqrt(bS) buffers and that the hash function gives a uniform distribution. Given that both sqrt(200) and sqrt(1000) are less than the number of buffers, the first condition is definitely satisfied.
Cost = 3.(bS + bR) = 3.(200 + 1000) = 3,600.
5. Any tuple in R can match at most one tuple in S because S.b is a primary key. So the maximum number of tuples in the result is equal to the number of tuples in R, which is 10,000.The size of a tuple in the result could be as large as the size of an R tuple plus the size of an S tuple (minus the size of one copy of the join attribute). This size allows only 5 tuples to be stored per page, which means that 10,000/5 = 2,000 pages are required.
6. The idea is that we probe an index on the inner relation for each tuple from the outer relation. The cost of each probe is the cost of traversing the B-tree to a leaf node, plus the cost of retrieving any matching records. In the worst case for an unclustered index, the cost of reading data records could be one page read for each record. Assume that traversing the B-tree for the relation R takes 3 node accesses, while the cost for B-tree traversal for S is 2 node access. Since S.b is a primary key, assume that every tuple in S matches 5 tuples in R. If R is the outer relation, the cost will be the cost of reading R plus, for each tuple in R, the cost of retrieving the data
Cost = bR + rR * (2 + 1) = 1,000 + 10,000*3 = 31,000.
If S is the outer relation, the cost will be the cost of reading S plus, for each tuple in S, the cost of retrieving the data
Cost = bS + rS * (3 + 5) = 200 + 2,000*8 = 16,200
Neither of these is cheaper than the block nested-loops join which required 4,200 page I/Os.
7.With a clustered index, the cost of accessing data records becomes one page I/O for every 10 records. Based on this, we assume that for each type of index nested-loop join, that the data retrieval cost for each index probe is 1 page read.Thus, with R as the outer relation
Cost = bR + rR * (2 + 1) = 1000 + 10000 * 3 = 31,000.
With S as the outer relation
Cost = bS + rS * (3 + 1) = 200 + 2000 * 4 = 8,200.
Neither of these solutions is cheaper than block nested-loop join.