Talk:B-tree
This level-5 vital article is rated C-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||
|
Untitled
[edit]This article needs a complete rewrite from an expert. It's completely useless in many ways. I will erase now the Access concurrency section; the reference links to a non-public site!, WTF?!
In response to the user Wolfkeeper: Thanks for explaining the Wikipedia policy about external links, however I still think this section is wrong, because in order to understand what it means, you must check the link. If it is not removed, at least I expect a more serious explanation. —Preceding unsigned comment added by 190.18.107.147 (talk) 04:39, 17 May 2009 (UTC)
—Preceding unsigned comment added by 190.18.107.147 (talk) 10:58, 15 May 2009 (UTC)
Shouldn't the majority of this be under (a,b)-tree? B-trees are a variant of (a,b) trees, so the (a,b) tree article should contain pretty much everything except stuff like the fact that a = B/4 and b = B/2 in a B-tree.
Moved this text to the discussion section from the main definition. Chadloder 18:34 Jan 22, 2003 (UTC)
The original author interchanged between B+-tree and B-tree. If I remember correctly the only difference is that one only stores values in leaf nodes.
Also, should this article be titled B-tree instead of B tree? And does any know what the B stands for? I think it is balanced.
- German Wikipedia states that it is unclear what the B stands for and that the inventors didn't gave an explanation for it. It could be Balanced, Bayer, Broad, Bushy, or Boeing. (One of the inventors of the B-Tree is Rudolf Bayer, who worked for the Boeing Scientific Research Labs.)
- Bayer tree is a common name for them, at least in German. This is the name by which I was taught about them as a computer science student. A balanced tree is a tree that is balanced in general. I'm not familiar with any of the terms broad tree, bushy tree, or Boeing tree. So my vote is for Bayer. Rp (talk) 17:51, 8 July 2008 (UTC)
- I moved the article from B tree to B-tree. The old title might have been that way because the author was trying to write about B+trees and the '+' got gobbled up. --Mrwojo 15:32 Feb 15, 2003 (UTC)
- Sometimes B+Trees are also named B*Trees.
There is some disagreement on what the B stands for. Some people think it stands for 'balanced', 'bushy', or even 'Bayer' after one of the authors. Chadloder 18:35 Jan 22, 2003 (UTC)
- also Boeing, where it was developed.
132.160.49.90 (talk) 02:30, 26 March 2016 (UTC)
Deletion from an internal node is NOT clear
[edit]In the first case, joining will lead to disaster if the left and right subtrees are not leafs. Example: for a B-Tree of max size of 2, insert values of 1 to 11 and then delete 4. —Preceding unsigned comment added by 190.18.107.147 (talk) 04:52, 15 May 2009 (UTC)
Intro to Algorithms
[edit]I deleted the sentence about the psuedo-code for deletion being in intro to algorithms because I have the newest edition of the book and there is no psuedocode in the deletion section. --Anon —Preceding unsigned comment added by 70.226.19.144 (talk) 03:44, 21 July 2008 (UTC)
duplicates
[edit]B-trees can have duplicates, right? This could be added to the article. Meonkeys 19:16, May 16, 2005 (UTC)
Yes. The B-tree representing the "name index" in a phone book database probably has lots of identical duplicate "John Smith" names.
Also, is a B-tree the same as an n-ary tree? Meonkeys 19:16, May 16, 2005 (UTC)
A B-tree can be seen as a very specialized form of n-ary tree, one that adds a pile of special restrictions. The restrictions make insertion and deletion a little more complicated -- but, in return, the B-tree can guarantee very fast searches.
- Some nodes in some n-ary trees have only 1 or 2 children. B-trees are constructed so every internal node has at least L children.
- Some n-ary trees do not have all the leaves on the same level. (For example, it's impossible for a binary tree with 5 leaf nodes to put them all on the same level). B-trees are constructed so that every leaf is at the same level.
--DavidCary 20:52, 30 July 2005 (UTC)
- REALLY?
- Then, how can the left subtree of a key be "less than" and the right "greater than" if there can be duplicates?
- 68.183.92.186 (talk) — Preceding undated comment added 22:56, 13 May 2014 (UTC)
- Really. It is just a matter of choosing less-than-or-equal in place of less-than relation. That of course complicates the search algorithm, because when we find an item 'equal' to the given key, we need to check also its siblings for equality, and corresponding subtrees as well. In a degenerate case of a tree of N 'equal' nodes we need to find and return all nodes as a result of a single search. That's difficult, so we usually would rather keep a single instance of every key in the tree, and link duplicates in some additional piece of storage, say a kind of linked list. --CiaPan (talk) 19:47, 14 May 2014 (UTC)
Nodes structure
[edit]I 'm wondering if the U (maximum of elements) in the article is not an 2L (minimum of elements).
- There may be a terminology problem here. The usage of L and U appears inconsistent below. For a simple definition, the number of elements (aka keys) should vary by a factor of 2 to allow splitting an merging to work. Comer uses as the minimum number of keys in node, and as the maximum number of keys. The number of children will vary from to . Some use to be the maximum branching factor (max children) and other use it as the maximum number of keys in a node.Glrx (talk) 20:53, 18 August 2009 (UTC)
Other sources from goole tell U =2L (minimization factor): http://www.bluerwhite.org/btree/ (the structure of B-tree), http://www.semaphorecorp.com/btp/algo.html (3 paragraph). --Loading 07:29, 18 Jun 2005 (UTC)
Certainly a programmer *could* pick U=2L. That type of B-tree is the simplest kind for a teacher to try to explain. But often programmers pick some maximum number of elements U that is is an odd number, such as 2-3 trees, L=2, U=3. Donald Knuth seems to like U = L * 3/2, in the form of a B*-tree. --DavidCary 20:52, 30 July 2005 (UTC)
I am teaching B-trees at San Jose State University. I have three sources (other than the Wikipedia article) that pairwise disagree on this L-U issue. Here are what they say:
(1) The well-known algorithms textbook, Introduction to Algorithms, by Corment, Leiserson, Rivest, and Stein (CLRS), defines B-trees in terms of one parameter t, the "minimal degree" (p. 439). Thus L = t and U = 2t. It is required that t be an integer, so in particular U has to be even.
- Do L and U refer to limits on keys or the limits on branching? If they refer to keys, then setting t=1 would create a 1-key/2-key B-tree which would be the ordinary 2-3 B-tree. It depends on the definition of "minimal degree". CLRS may be using t in the same sense as Comer's d.Glrx (talk) 20:53, 18 August 2009 (UTC)
(2) The B-tree animation applet of slady (linked from the article) also uses a single parameter, which he calls the "order". He doesn't explain it, but experiments with the applet give, for order 1, L=2 and U = 3, and for order 2, L=3 and U = 5. In particular you always get U odd, so the animation applet can't duplicate any of the trees discussed in CLRS. (Do you suppose this confuses my students?!)
- This would fit if Slady's "order" of a B-tree is Comer's -- the minimum number of keys in a node. Then L=d+1 and U=2d+1. U will always be odd. Here L and U are refering to number of children/subnodes and not number of keys.Glrx (talk) 20:53, 18 August 2009 (UTC)
(3) Knuth's "Art of Computer Programming", vol. 3, p. 473, defines B-trees in terms of the "order" m. Comparing it to the Wikipedia articla, we see U=m and L = ceiling(m/2); we need ceiling since m is not required to be even. So CLRS covers the case when m is even, m = 2t; but Knuth also allows the case L=t and U = 2t-1; this seems to be the case the applet covers, where the "order" you enter in the applet is t-1.
Some restriction on the relation between L and U has to be made; when we split a full node (containing U-1 elements) upon inserting a new key, we will get two new nodes containing U/2 elements each (for even U) or one with floor(U/2) and one with ceiling(U/2) elements, for odd U. We must require that L be at most floor(U/2) so that these nodes are legal. Moreover, L can't be too small either, or deletion won't work right. On 1 April 2006 I edited the article to correct this point, and also to correct the description of the deletion algorithm. --User and date unknown
- The above description is incorrect because a key must be promoted to the parent level during the split. We split a full node (with U-1 elements) only when inserting a new element (for a total of U elements). The result of the split will keep U-1 elements at the current level and promote one element to the parent level. If U is even, then one sibling has (U-2)/2 and the other (U)/2 elements. If U is odd, then both siblings have (U-1)/2 elements.Glrx (talk) 20:53, 18 August 2009 (UTC)
In the following I'll try to explain the definition tought at Technical University of Munich where Bayer used to work. This definition seems to make sense, and is consistent with the 3 definitions above. Only the terms used differ slightly. As above, I'll use the letters L and U for the minimum and maximum number of children of a node, though the letters a and b are commonly used at TUM.
Any search-tree that works the way explained in the article needs to comply to U >= 2L-1. Otherwise splitting up a node upon insertion will not work. A balanced search tree complying this rule is called (a,b)-tree. An (a,b)-tree where U = 2L-1 is called a B-tree.
These definitions have the following implications:
- The above definitions 2 and 3 comply to this definition of a B-tree. (In the second edition of Knuth's "Art of Computer Programming", vol. 3 it's on page 483, and it's a bit vague: does m/2 mean integer division here?)
- The above definition 1 is not valid for a B-tree in this strict sense, but for another specialization of an (a,b)-tree.
- The values U = L * 3/2 for B*-trees do not comply to this definition of a B-tree or an (a,b)-tree. Therefore a B*-tree is not a specialization of a B-tree, but a different though related datastructure.
- The term (a,b)-tree gives another possible explaination for the meaning of the B in B-tree. For a B-tree, only the b needs to be known, and it's always odd. The a can be calculated as ceiling(b/2).
I don't know if the above nomenclature is common outside of TUM. I'm not an expert on the topic. The only references I can provide are two lecture notes in German: ead_ws9697.ps.gz and ead_ws9798.ps.gz.
Maybe someone should check the original publications! However, this node structure issue needs to be clarified in the article. In particular the rule U >= 2L-1 should be mentioned, even if it turns out that all (a,b)-trees are commonly called B-trees.
Michael R. 22:55, 16 August 2006 (UTC)
Order
[edit]I removed "order 2" from the caption because the citation there does not define order, but only calls it parameter k. Knuth does define order, while describing Two-Three B Trees. Some later textbook authors and web resources have since published different numbers, typically 2. To be less ambiguous, the value could perhaps be referred to as "Knuth's order" (see Knuth,TAOCP)
--132.160.49.90 (talk), 26 March 2016
variants
[edit]Should *this* article B-tree have a section that explains variants such as B*-tree, B+ tree, etc.? Then I want a section that explains pros and cons -- why would someone use one variant instead of another, or instead of a simple binary tree ? Some people find that preferable to having a bunch of very short stub articles that say "just like a B-tree except that ..." -- or, worse, a bunch of fully-fleshed out articles that repeat the preliminary explanation over and over. --DavidCary 20:52, 30 July 2005 (UTC)
Are 2-3-4 trees a kind of B-tree with L=2, U=4 ?
Error in illustration of B+ tree
[edit]It appears that there is an error in the illustration. According to the text, all values in the leftmost subtree should be less than a1. In the illustration, however, the leftmost subtree contains the value 3, which is equal to a1.
I agree, from "Database Systems: The Complete Book"
-- if j pointers are used then there will be j - 1 keys, e.g. K_1, K_2, ..., K_j-1 so the first pointer points to a part of the B tree where some of the records with values < K_1 will be found; the second pointer points to where records with values >= K_1 and < K_2 will be found and so on.
This ordering is also used in 'Silberschatz, Korth, Sudarshan: Database System Concepts, 4th Edition'. Schmid 12:39, 13 August 2006 (UTC)
Concurrent access
[edit]The paragraph at the end of the insertion section suggests that an improvement involves splitting nodes that might fill on a leaf node split on the way down the tree. This buys nothing. It only reduces the effective node size by one.
Unless the nodes are locked there is no guarantee the nodes will remain non-full on the way back up during a split. To allow maximal concurrency, you don't want to lock nodes on the insert descent.
--Karl Malbrain 20:52, 18 May 2006 (UTC)
Second that, please stop restoring that section because is completely off. —Preceding unsigned comment added by 190.18.107.147 (talk) 21:26, 15 May 2009 (UTC)
What is "completely off" about this section? I have a copy of the reference material and summarized it for this section. N.b. there is no requirement that reference materials be available on the web for free.
--karl m {{User electrical engineer}} (talk) 21:58, 15 May 2009 (UTC)
If I publish an online paper (free or not) about "B-Tree Programming" is not enough to reference it, unless we call this advertising. If this is not the case, the least I expect from a summary is what are the alternatives, in which cases is better (not the -why-, I can check the reference if I am interested enough) and optionally which vendor implements it. —Preceding unsigned comment added by 201.231.39.110 (talk) 14:59, 20 May 2009 (UTC)
The paper was published by the Association for Computing Machinery, it is not an on-line paper. Are you challenging this source in specific? I added the section in an attempt to flesh out the material from the perspective of a high-school understanding of computer science. I don't know which vendors use the forward pointers in their B-tree structures. Again, what is "completely off" about this section?
-- karl m {{User electrical engineer}} (talk) 15:58, 20 May 2009 (UTC) --
Thanks for the response. The "off" I meant, is that it sounds like summarizing JFK as a president assassinated with a reference to a book about his life. What are the different strategies? (the current summary sounds like there are not other alternatives. If this is the case, then at least it must say that there are not other). In which cases this specific strategy is convinient?. Are there any known engines that implement this? (MyISAM maybe?). —Preceding unsigned comment added by 201.231.39.110 (talk) 00:52, 21 May 2009 (UTC)
Repeated info
[edit]A lot of information is repeated redundantly throughout the article. Please consider rewriting it to make it a bit more concise. --221.134.26.57 17:37, 21 May 2006 (UTC)
- I like the use of "repeated redundancy" - Yogi Berra would be proud. The article is basically a joke, full of too many ambiguities and self-contradictions. One problem that seems to occur in every Wikepedia article about trees is the failure to distinguish between "keys" and "values" - or state that the two terms are synonymous. Here, that's just one factor that makes this article useless. — Preceding unsigned comment added by 68.183.37.170 (talk) 18:32, 21 May 2014 (UTC)
Picture
[edit]I think the picture should be removed. Why would you put a picture of a B+tree on the B-tree page. It would be like looking up the US capital and finding a picture of the Eiffel tower. Not to mention the B+tree picture is incorrect. 3 should be in the middle node and 5 in the last. The pictures are great but people look at the picture instead of reading and the info in the picture is wrong. I would recommend removing the pictures unless they can be corrected. —The preceding unsigned comment was added by 216.239.10.103 (talk • contribs).
weight-balancing scheme
[edit]Could someone explain the weight-balancing schema invented by Arge and Vitter? It seems to have better performance than the 2-3 B-trees.
— Preceding unsigned comment added by 62.195.5.235 (talk • contribs) 15 September 2006 (UTC)
Relaxed Balance
[edit]There has been quite a bit of work by Kim S Larsen and Rolf Fagerberg (e.g. http://citeseer.ist.psu.edu/51205.html ) showing how the balance of a btree can be maintained lazily with several benefits. For the purposes of elucidation relaxed balance is particularly good because you can describe insertion or deletion and separate the basic steps of the algorithm and the steps for the maintenance of balance. Another benefit is that all modifications to the structure of the tree are local, so concurrency is more easily exploited.
It'd be nice for the text to include a description of this work. Dr Tom Conway (talk) 01:20, 3 January 2008 (UTC)
Minor Quibble
[edit]I could be missing something obvious, but I don't see why tuning a B-tree to disk block sizes would result in a 257-513 B-tree. Shouldn't it be a 257-512 B-tree? —Preceding unsigned comment added by 128.221.197.21 (talk) 12:47, 31 July 2008 (UTC)
hop?
[edit]This depth will increase slowly as elements are added to the tree, but an increase in the overall depth is infrequent, and results in all leaf nodes being one more hop further removed from the root.
Is that "hop" just the writer's voice, or do people frequently use the word "hop" to describe that? I know there are a lot of terms used in computer science that don't sound nearly as technical as they are, but on the other hand, "hop" only occurs once in the entire page. (Not saying it's a problem and needs to be fixed, just wondering.) --146.187.184.231 (talk) 19:45, 6 November 2008 (UTC)
- It's a term borrowed from networking, apparently. I would use "one node," and have modified it accordingly. Dcoetzee 21:16, 6 November 2008 (UTC)
Boeing citation
[edit]There is a "citation needed" marked beside the word Boeing.
The citation is in the external link mentioned below, named NIST's Dictionary of Algorithms and Data Structures: B-tree at http://www.nist.gov/dads/HTML/btree.html written as this:
Note:
- The origin of "B-tree" has never been explained by the authors. ... "balanced," "broad," or "bushy" might apply. Others suggest that the "B" stands for Boeing. [Bayer and McCreight were at Boeing Scientific Research Labs in 1972.] Because of his contributions, however, it seems appropriate to think of B-trees as "Bayer"-trees.
- - Douglas Comer, The Ubiquitous B-Tree, Computing Surveys, 11(2):123, June 1979.
- The origin of "B-tree" has never been explained by the authors. ... "balanced," "broad," or "bushy" might apply. Others suggest that the "B" stands for Boeing. [Bayer and McCreight were at Boeing Scientific Research Labs in 1972.] Because of his contributions, however, it seems appropriate to think of B-trees as "Bayer"-trees.
I am just not sure, how to edit links on the page...
~ —Preceding unsigned comment added by 195.146.109.168 (talk) 15:46, 13 November 2008 (UTC)
Leaves "carry no information"?
[edit]All leaves appear in the same level, and carry no information.
That's not how the pictures look.
Big Oh notation
[edit]"can be done in O(log2N) comparisons. If the table had 1,000,000 records, then a specific record could be located with about 20 comparisons: log21,000,000 = 19.931". It's a detail, but that's not the meaning of the Big-Oh notation. Depending on the implementation it may need a million comparisons or even more. The big Oh notation only says something about how the performance scales as N increases. I understand, though, that this may be harder to explain in Layman's terms and that it might confuse readers who don't know about complexity analysis, but what is written now is just plain wrong. 84.194.88.199 (talk) 15:44, 8 May 2010 (UTC)
I fixed some things; more improvements needed
[edit]I'll get right down to it.
Problem with definition
[edit]First, in the list of five Definition rules, the second rule is:
2. Every node (except root) has at least m⁄2 children.
But, leaf nodes have no children. In defining the number of elements by the number of children, the article later leaves ambiguous the number of elements in leaf nodes.
Furthermore, the section decribing bulkloading uses a different set of rules where the leaf nodes and internal nodes have different maximum sizes, and seems to imply an expectation of the disconnection of the number of elements in leaf nodes from the number of elements in internal nodes. However, the rest of the article above this section assumes leaf nodes have the same maximum number of elements as internal nodes (as I understand it).
Near the top of the article, some inconsistencies in terminology in the literature about B-trees are raised. Perhaps this is at least part of the origin of some of these problems. The article would be better if after expositing the variations in terminology, it would choose and assert definitions for a system of terms to be used at within the remainder of this article. That would make the article clearer both in the minds of editors and to the reader, and would provide a framework on which judgements of correctness (by editors) and building of understanding (by interested end-user readers) could be based.
B-trees in filesystems and FAT filesystem
[edit]I have corrected an incorrect application of the terms "logical block" and "physical block" in the section of B-trees in file systems. Almost nothing in modern file systems refers to physical blocks, as disks present an interface in terms of Logical Block Addresses (LBAs) or logical, not physical, CHS addresses. (Floppy disks still use physical CHS addresses, but are becoming rarer and rarer.) Furthermore, even DOS file systems like FAT12 and FAT16, considered primitive by most today, use logical addresses in the FAT which are relative to the data area of the file system volume and are not the physical addresses of the disk nor addresses relative to the start of the partition.
Therefore, I have changed this section of the article to use the terms "file block" and "disk block". I think it is clearly enough implied that a file block is a block addressed relative to a file, and it is obvious that such a block address is logical, not physical, so I have not used the word "logical" with "file block", and I have not explicitly defined what a file block address is. As for disk blocks, it is not really pertinent to the B-tree discussion whether they are physical or logical, so long as it is clear that disk block addresses address actual blocks on the disk, not abstract blocks in an abstract file, using an address scheme that is global over at least the part of the disk used by the file system. Again, I think implication suffices to expose this; anyone who has a basic understanding of the purpose, nature, and rudimentary fucntion of a file system should understand this, and anyone who does not will not understand, or probably be interested in, the discussion of B-trees in file systems anyway. The point is that file systems map file blocks--parts of a file--to disk blocks--units of storage on a disk--and that this is what B-trees (and File Allocation Tables, and other alternative concepts and structures) are used for in file systems. The physical/logical dichotomy here is both incorrect and overly narrow so as to confuse or at least complicate the essential concepts about B-trees in file systems.
- I think that the term "Disk Block" is still very problematic for a number of reasons, starting with the fact that in most OS environments and jargon it has a very specific meaning that is different from how it is used here. "File Block" is confusing because it is unfamiliar and unclear what it means. The term actually used by almost all implementers, practitioners and analysts of extant B-Tree indexing is "Bucket". This is the term that I would suggest be used in this article. RBarryYoung (talk) 15:24, 28 February 2013 (UTC)
Algorithms - deletion of a node
[edit]I had made more improvements in this section, but all my work was unceremoniously reverted. I've reinstated the parts I think actually weren't in dispute. The other parts, in this section, really need improvements, as I think others above have said. Anyway, my suggested edits are in the history, if someone wants to revive them.
–Stephen 141.158.233.114 (talk) 12:15, 23 July 2010 (UTC)
Comment
[edit]I have many issues with these edits, and that is why I reverted all of them.
While there are some improvements (such as FAT12 being 4080-2), there are many inaccuracies. In the B-tree, entries are rotated. Entries are not duplicated. There are unwarranted claims about the size of a file. There are extraneous comments about disk utilities.
I have reverted one of the reinstatements. I am considering reverting the FAT edits again.
Glrx (talk) 16:08, 23 July 2010 (UTC)
Where is the complexity?
[edit]There is a nice table at the top for complexity reference in big O notation but...nothing there.66.14.187.7 (talk) 18:10, 1 October 2010 (UTC)
English
[edit]"The seek time may be 0 to 20 or more milliseconds..." Everything takes 0 to 20 or more milliseconds. Leegee23 (talk) 11:33, 12 December 2011 (UTC)
Recent edit to Knuth's description of a B-tree
[edit]I'm tempted to revert/modify this edit]. The edit uses another reference to overrule what is ostensibly Knuth's definition. The literature is troubled with differing terminology. See B-tree#Terminology. Knuth's terminology has an unusual bottom level of the tree -- leaves are below lowest keys. "Carry information" is also troubling; in some variations, the associated information is present above the leaves; in others, only key information is above the leaf; in still others, key information is copied above but retained below. Ordinary leaves, however, would have either the associated data or a pointer to that data and carry information. In Knuth's definition, the leaves appear to be the records -- and therefore carry information. Glrx (talk) 18:54, 28 December 2011 (UTC)
Your analysis is correct and the edit should be reverted. karl m {{User electrical engineer}} (talk) 19:58, 28 December 2011 (UTC)
Request revision on the Rebalancing after deletion
[edit]Under section 5.3.3 Rebalancing after deletion there are
3. Append the first child of the right sibling as the last child of the deficient node
under
1. If the right sibling has more than the minimum number of elements
and
3. Insert the last child of the left sibling as the first child of the deficient node
under
2. Otherwise, if the left sibling has more than the minimum number of elements
From my understanding of B-Tree, these two steps make no sense as after you execute the first two steps described in that section. The separator and the first/last child of the sibling have been rotated. There should be no extra step to append or insert. I have no idea where this step comes from. But maybe I am wrong. I hope someone could further explain this.
Thank you. — Preceding unsigned comment added by Johnmave126 (talk • contribs) 10:47, 6 May 2013 (UTC)
- The section appears to describe the operation for a B+-tree rather than a B-tree. In the former, the parent only has a copy of the separator; there is not a full rotation in the B-tree sense. I've changed the section to use the B-tree rotation (and suggest that a left or right sibling may not exist). Glrx (talk) 16:20, 6 May 2013 (UTC)
Is 2-3 tree a B-tree?
[edit]By the definition in the book Introduction to Algorithms, a B-tree of order t is such that its every non-root node contains at least t children, and every node contains at most 2t children. The reason for this is that splitting a node into two nodes is possible only if it contains at least 2t nodes. This is a fundamental requirement for the data-structure algorithms on B-tree to work. In particular then, a 2-3 tree is not a B-tree; it's algorithms are different than those of B-trees. The text currently uses 2-3 trees as a recurring example; those examples should be changed to refer to the 2-3-4-tree, the smallest example of a B-tree. What do you think? --Kaba3 (talk) 18:15, 26 September 2013 (UTC)
- A 2–3 tree is a B-tree; it fits Knuth's definition. There are many definitions of order and many variations of the B-tree, but the basic algorithms are similar. If I take your definition literally, a 2–3 tree is B-tree of order 2 because non-root nodes contain at least 2 children and every node contains at most 3 children (which is less than 2t but does not violate 2t). Your second sentence is not the driving force behind balancing the tree; if a node may have nmax children, then there is no reason to split that node. The node is happy as it stands. There is also a subtle issue about keys/separators and children; a node has one more child than it has keys. A 2–3 tree has either one or two keys in each internal node. If an internal node of a 2–3 tree has 2 keys (3 children) already and I need to add a third key (a 4th child), then I can split the node into two nodes with one key each (total of 4 children) and the remaining key to the parent. Glrx (talk) 22:48, 27 September 2013 (UTC)
Animation
[edit]Hi. I am trying to develop a B-tree toy at [1]. This article already has a link to an animation applet. Since applets are becoming unfashionable, I hope to replace this applet some day. I would appreciate it a lot if you would try out the toy and tell me how it can be improved. Maybe we could replace the applet linked then. --Ysangkok (talk) 13:41, 3 March 2014 (UTC)
key - copies contradiction
[edit]"In the B+-tree, copies of the keys are stored in the internal nodes; the keys and records are stored in leaves" vs. " The keys act as separation values which divide its subtrees"
Since a B+-tree is a B-tree, one must be incorrect. The "copies" can't be in the internal nodes when the original keys are. — Preceding unsigned comment added by 68.183.92.186 (talk) 22:38, 13 May 2014 (UTC)
The B-tree uses all those
[edit]"The B-tree uses all of the ideas described above." (What ideas?) "In particular, a B-tree:" . . . "keeps the index balanced with an elegant recursive algorithm" (where is balancing mentioned above?)
-- the above unsigned note was added on 14 May 2014 by 68.183.92.186 (talk)
- Balancing is not mentioned because there is no balancing. The tree is balanced when it gets created and it remains balanced during insertion and during deletion of data – the tree is inherently balanced. Just as it is said: the algorithm 'keeps it balanced', not '(re)balances it after the operation destroys the balance' (like in other balanced trees, e.g. AVL trees). --CiaPan (talk) 17:06, 4 January 2016 (UTC)
2014-11-21 BS Comments
[edit]"Overview":
"In B-trees, internal (non-leaf)" - 'non-leaf' is not mentioned in the referred to article. Could you please choose a different term than 'non-leaf', or define 'non-leaf' in the referred to article. i.e. In referring to something to explain something else, that something does not exist to be able to follow what is trying to be explained.
"within some pre-defined range." - range of what?
There is a basic problem here of a lack of definition of what a node is. There is also a problem that "A node is a structure which may contain a value or condition, or represent a separate data structure (which could be a tree of its own)." is too vague or intangible to be meaningful. In particular, 'condition' above is gibberish (to the reader at this point in their reading). Better, a node [which has an address, i.e. a path (address) by which it can be found <- this 'address' is the missing piece or attribute of the explanation), contains a piece of data. That data could be a value (such as the number 4, or the word 'red'), or a pointer (address) of (i.e. to) another node. [Missing, or, unless, a btree node can (must?) contain at least a value, AND zero or more pointers - at most one to the left (by definition, decreasing?), AND (at most) one to the right (by definition, increasing?)]. So, more succinctly, a node (with/having an address) contains a piece of data. Which is to say, a node is an address/(data and/or left and right pointing) pair.
- now this (above) paragraph may be completely wrong, and that's fine, but in being wrong demonstrates that my (the reader's) purpose in coming to this article is not getting satisfied. Either the above is correct and the article does not so explain 'simplistically' (too conceptual / intangible too fast), or does not well explain soon and simplistically enough what is actually meant (is true / actually is).
"When data is inserted or removed from a node, its number of child nodes changes." is baffling. Probably because it is self-referencing. (It is becoming apparent to me that I don't understand the lingo, or at least the established/recognized lingo, and if I don't, then that says the lingo is not being well explained.) So this bit ("When data is inserted or removed from a node, its number of child nodes changes.") says to me that what should be said is a node is a address/data+pointer 'pair', and nodes (and this is the problem with the use of the same term 'node' twice) contain SETS of nodes (or node sets). Thus the referred to diagram does not make sense (node-wise) as written. However, if one says nodes contain sets of nodes, it starts to come together. (That one does not understand how a node could contain a set of nodes, being a circular description, is ok - one can suspend understanding, expecting such to be explained in subsequent text.) The basic problem with the diagram is that it does not show lines between the .#. pairs. Saying that differently - the 'problem' is either that the diagram is 'wrong' OR the explanation thus far is, since I don't understand / haven't followed.
- by the diagram, the above node = address/data+pointer pairs is wrong. By the diagram, it would appear more correct to say that nodes contain (ordered) sequences of data, WITH POINTERS BETWEEN EACH PAIR OF DATA TO NODES THAT DESCRIBE (COVER) the data/value space between those adjacent values.
So, said differently, one assumes, or soon comes to assume b-tree = binary-tree, which is to say, a node could have at most left and right pointers. By this diagram, a node could have many pointers, and this is not made clear enough early on. So, perhaps earlier in the article it can say a b-tree is not (necessarily) a tree of binary nodes, but a tree of nodes, period. Binary or otherwise. Saying (noting), although possibly considered as redundant, noting that being a tree, nodes have descendants, and by its address knows its ascendant (if the latter is true - does a node's address tell the node where its parent is?).
Summarizing all of the above - the article, within only a few sentences, is as clear as mud. Since the purpose of writing an article is to clear up the mud, perhaps some reworking could be done? [I came here to learn / read / understand, so I'm not the guy to do the rework. i.e. If I understood the material I could contribute to the material, but because I came here to learn the material I can't. I can say, having come here to learn the material - I don't understand. [Therefore purpose in being written hasn't been satisfied. (?)]
"but may waste some space, since nodes are not entirely full" - does not yet follow / make sense. One would assume that a node size is dynamic, therefore only as big as the data contained, and so can waste no space. For the statement to be true, nodes are of a pre-defined size - something that should be stated beforehand. That this is true in the diagram is only now apparent to me. Adding to the diagram, or it's description, '4-element nodes' would make grasping what's being said much more clear much more readily. (Assuming 'element' is the correct term, there.)
The '2–3_tree' diagram illustrates what I say above - there is a diagram depicting a node (to my mind). The diagram depicted within this (b-tree) article shows something (node) that I can't understand as being a node (as I can in the 2-3 article's diagram) - they appear different. Thus I have a mental image of two different things being called 'nodes', and I don't understand (it hasn't been explained in this article) how the term 'node' is accurate for both diagrams. ('One of these things is not like the other' - so, i.e., understanding is not present [same name for two obviously different things]).
Bs27975 (talk) 18:39, 21 November 2014 (UTC)
- Will be glad if someone takes care of these problems. 120.59.185.182 (talk) 07:32, 3 January 2016 (UTC)
The "The Database Problem" section needs cleanup
[edit]It currently reads very informally and feels out of place with the rest of the article. I'd suggest it be condensed and then split into a pro/con type section. Possibly moved down the page. I'm going to try to formalize the language where I can, but I'm not familiar enough with B-trees to do the rest myself. TheWisestOfFools (talk) 17:03, 29 October 2015 (UTC)
Undefined terms used in Overview
[edit]Some examples --
pre-defined range,
From the image it looks as if a single node can host a lot of data, a concept which the the article assumes that the reader knows about.
Joining or splitting of nodes is not defined. 120.59.185.182 (talk) 07:31, 3 January 2016 (UTC)
Generalization?
[edit]I think it is not correct to call B-tree the generalization of a binary search tree, as the B-tree requires all the leafs to be at the same depth, and BST does not. So in this way the BST is the generalization of B-tree???
Skrew (talk) 07:38, 24 May 2016 (UTC)
Nodes can have multiple keys, not just >2 children
[edit]The introduction mentions that, unlike a binary search tree, a B-Tree's nodes can have multiple children. But surely it should also mention that each node can have multiple keys. The B-Tree chapter in CLRS doesn't mention that until its third paragraph, but it seems to be important. The wikipedia page for 2-3 Trees does mention this. I wonder if the 2-3 Tree page should be combined with the B-Tree page, or if they should at least reference each other more. — Preceding unsigned comment added by Murraycu (talk • contribs) 12:02, 7 July 2016 (UTC)
Disadvantages entry is humbug, remove?
[edit]I don't have the time to verify this, only got one's confirmation that i'm correct on this one -
"maximum key length cannot be changed without completely rebuilding the database. This led to many database systems truncating full human names to 70 characters.", that is humbug!? Remove= — Preceding unsigned comment added by 14.199.34.215 (talk) 09:04, 7 April 2017 (UTC)
Initial construction by bulkloading
[edit](Referring the version oldid=817371125, 28 December 2017.)
The whole section #Initial construction, originating from revisions Special:Diff/174287958 in November 2007 and Special:Diff/260712701 in December 2008, is unsourced. Even worse, it is useless: even though it is a subsection to Algorithms, it actually presents no algorithm! It just describes a specific case, but it does not explain why the case is handled this way or how other cases would be handled.
For an example let's take twenty FIVE items instead of 24. Then we CAN'T add the key 25 to the fifth leaf node in the initial step, because the fifth key from each leaf is later detached to be used as a separating key in some ancestor node – and there is no sixth leaf node to be separated from the fifth one, so the key 25 would remain unused. We also CAN'T avoid adding the 25 key to the fifth leaf node and make a sixth node of it instead, because then the fifth node has no fifth key to be used as a separator between leafs No 5 and 6. Possibly we could detach the key 24 from the fifth node for this purpose, but the 'algorithm' does not provide such option. Actually, it doesn't provide any options, just one, pre-determined case...
Anyway, the sixth node would break the requirements, because a single key 25 in a 4-slots node would be using only 1/4 of the node's capacity, while it is required that each node is at least half-filled! An actual algorithm should calculate an optimum distribution of keys between leaf nodes while keeping enough spare keys to fulfill the filling requirements in internal nodes, too. And keeping internal nodes and leaf nodes of different size complicates things even further.
Should we rescue that bulksh*t somehow or just delete it from the article? --CiaPan (talk) 18:08, 30 December 2017 (UTC)
- Yes, the section has severe problems. It does suggest an algorithm, but the suggestion is poor/flawed. The text before he revision had a better toplevel viewpoint.
- From an algorithmic standpoint, doing single record insertions is sufficient to load a database. Sequentially inserting sorted data will leave a B-tree sparsely populated because the rightmost node is continually split leaving lots of half-filled left siblings behind.
- Before BTrees, it was common practice to reload/reorg a database to make it more efficient (get rid of overflow blocks).
- Bulkloading often does not want to maximally fill each page. If each page is full at loading, then most subsequent insertion will cause a split. Usually some compromise is chosen so an insertion won't cause a split and a deletion won't cause a merge. The issues are complicated.
- I'd keep the section and hope it gets improved.
- Glrx (talk) 18:52, 28 January 2018 (UTC)
- StackOverflow has several threads about B-trees, and some of them about bulkloading. The most complete I found is this answer by didierc to the question Is there any algorithm for bulk loading in B-Tree? by nbbk. However, it contains detailed hints only, not a fully specified algorithm. It also does not reference any outside sources, so from Wikipedia's point of view it should be considered original research, not appropriate as a source for us. --CiaPan (talk) 20:44, 29 January 2018 (UTC)
- I looked at the cited QA, and it seems wrong. WP:NOR is good policy. There's little difference between the B-Tree versions, and wrt loading one promotes the key and deletes the leaf if an ordinary B-Tree. Glrx (talk) 19:59, 26 February 2018 (UTC)
- @CiaPan and Glrx: I rewrote the section de novo. It still needs references (my bad), but it's shorter, simpler, and correct now. Is it at least an improvement? 23.83.37.241 (talk) 07:15, 4 April 2018 (UTC)
Etymology
[edit]User:EEng has cutdown a long-standing section on Etymology. I'll agree that the original section buried the lead (the DB world has long wondered what the "B" stands for; I wondered, but I never had the courage to ask McCreight in the summer of 1979). But EEng's version cuts it down to nothing and is contradictory. The first sentence says Bayer and McCreight "did not explain what, if anything, the B stands for"; the second sentence then states, "McCreight has said that ..." (referring to a 2013 quotation that describes Bayer and McCreight debating B for Boeing, Bayer, and balanced). The original should be restored and eventually improved; it is better than the contradiction. Glrx (talk) 02:02, 4 May 2018 (UTC)
- Here's what the section says:
Rudolf Bayer and Ed McCreight invented the B-tree while working at Boeing Research Labs in 1971 (Bayer & McCreight 1972), but did not explain what, if anything, the B stands for: Boeing, balanced, broad, bushy, and Bayer have been suggested.(Comer 1979, p. 123 footnote 1)[1][2] McCreight has said that "the more you think about what the B in B-trees means, the better you understand B-trees."[1]
- There's nothing contradictory about this, and it faithfully reflects the sources. For contrast, here's the prior, bloated version:
Rudolf Bayer and Ed McCreight invented the B-tree while working at Boeing Research Labs in 1971 (Bayer & McCreight 1972), but they did not explain what, if anything, the B stands for. Douglas Comer explains:
The origin of "B-tree" has never been explained by the authors. As we shall see, "balanced," "broad," or "bushy" might apply. Others suggest that the "B" stands for Boeing. Because of his contributions, however, it seems appropriate to think of B-trees as "Bayer"-trees. (Comer 1979, p. 123 footnote 1)
Donald Knuth speculates on the etymology of B-trees in his May, 1980 lecture on the topic "CS144C classroom lecture about disk storage and B-trees", suggesting the "B" may have originated from Boeing or from Bayer's name.[4]
Ed McCreight answered a question on B-tree's name in 2013:
Bayer and I were in a lunchtime where we get to think [of] a name. And ... B is, you know ... We were working for Boeing at the time, we couldn't use the name without talking to lawyers. So, there is a B. [The B-tree] has to do with balance, another B. Bayer was the senior author, who [was] several years older than I am and had many more publications than I did. So there is another B. And so, at the lunch table we never did resolve whether there was one of those that made more sense than the rest. What really lives to say is: the more you think about what the B in B-trees means, the better you understand B-trees."[5]
- Now then... what information enlightening to the reader is present in that rambling trip to nowhere that isn't in the pithy version? EEng 02:27, 4 May 2018 (UTC)
Floor and ceiling symbols
[edit]The article uses what seems to be floor and ceiling symbols/brackets eg: ˥ ˩. But I don't see how you can have a floor or ceiling when there is only one value between these symbols. Do these symbols have another meaning? Or have I missed something? FreeFlow99 (talk) 11:08, 27 January 2024 (UTC)
- @FreeFlow99: But why would you need more than one argument to Floor and ceiling functions??? --CiaPan (talk) 18:22, 27 January 2024 (UTC)
A Lacuna in Knuth's definition
[edit]While the existing definition accurately summarizes the properties of a B-tree, it omits an explicit statement about the minimum and maximum number of keys that can be stored in the leaves. This can potentially cause confusion, as the key count constraint for internal nodes applies equally to leaf nodes in B-trees. I propose appending the following clarification:
Each leaf node contains at least ⌈m/2⌉-1 and at most m−1 keys.
This addition would align the definition more closely with Knuth's original description and ensure consistency with the maximum key constraint already stated for internal nodes. Explicitly including this detail makes the definition complete and avoids ambiguity for readers who may be new to the concept of B-trees. MorningTwilight (talk) 14:11, 30 November 2024 (UTC)