less_retarded_wiki/tree.md
2025-05-22 19:52:56 +02:00

9.7 KiB

Tree

WIP

Tree is an abstract mathematical structure, adopted and frequently used as a data type and data structure in programming, which in simplified terms consists of nodes that form a loopless graph resembling an upside-down tree when drawn on paper. Slightly more presicely tree can be defined as a set of nodes of which each has assigned exactly one of the other nodes as its parent, except for the root node that has no parent, in such a way that there are no cycles (i.e. between any two nodes there always exists exactly one path). The definitions may vary slightly, for example in mathematics it's defined as an undirected graph whereas in computer science it may be seen as directed (because parents "point" to their children), but generally always the same idea underlies the definition: that of a hierarchical structure of nodes branching out from a single origin like a tree. A set of several disconnected trees is called a forest; additionally there also exist generalized notions and structures based on trees such as B and B+ trees whose leaf nodes may also be connected into a linked list, directed acyclic graphs where nodes may have more than one parent etc.

Tree is also a kind of very big plant that has trunk and branches and this kind of stuff. It is no coincidence the programming structure is also called a tree -- it's so because the structure is similar to the physical, real life tree and we conveniently borrow more terms with real life analogies (root, branches, leaves, pruning, forest, ...).

It's also possible to give a beautiful, recursive definition of a tree: tree is a node N0 that has a number (even zero) of children, each of which is a tree of which none share any node and none contains N0. I.e. tree is a node whose children are themselves also trees, just like a real life tree is a kind of composed of smaller and smaller versions of the big tree (which we call branches; see also fractal). In fact recursion is something inherently associated with trees: for example algorithms for traversing trees are typically recursive in nature.

Insofar as programming goes, the key characteristic of trees is their hierarchical structure, i.e. the fact they consist of "levels": the first level is the root node, the second its children, the third their children etc. A close to real life example of a tree might be the taxonomy tree used in biology to classify living organisms by dividing them into big groups and subsequent subgroups such as kingdom, family and species. As for their significance, trees are among the most essential structures in both programming and mathematics, they belong more or less to intermediate programming. The importance of trees can hardly be overstated, they see frequent use for example as an indexing structure that greatly accelerates searching bigger amounts of data; this is well exemplified for instance by octrees (N-ary tree with N = 8) in physics engines subdividing a big cubic portion of the 3D world into 8 smaller cubes, each of which is subsequently split in similar way and so on up to the level of small spatial cells -- this representation helps quickly decide which objects are in the proximity of given point and so resolve collisions quickly and efficiently. The same idea is used in 3D graphics to decide what's in the camera's view (it's a form of collision detection too), and will be especially fitting for voxel-based games where each voxel is the final leaf node of the octree.

       666
       / \
      /   \
     96   99
     /    /\
    /    /  \
   69   66  71
   /\       /\
  /  \     /  \
 6    9   7    1

Example of a binary tree of height 4. It's also a heap as each parent is greater in value than any of its children. It might be serialized as: (((6)69(9))96())666((66)99((7)71(1))).

Terminology: the first, topmost node without any parent is called root node. Nodes that have no children are called leaf nodes; nodes being neither a root nor a leaf are usually called internal nodes. We may also encounter terms such as subtrees and branches. Relationships between nodes are described by the same nouns used for family relationships, i.e.: parent node, child node, sibling node, ancestor node, descendant node etc., although some relationships are NOT in common use, e.g. "grandfather node", "cousin node" or "uncle node" (:D). Then we name properties such as the node depth (length of the path from the root to the node), tree height (maximum of all leaves' depths), tree size (total node count), tree breadth (leaf count) etc.

We classify trees by various properties they may have, for example their height, "density", purpose ("decision tree", "search tree" ...), constraints they satisfy ("heap", ...), what kind of value the nodes store and where (in all nodes, just leaves, ...) or attributes such as being "balanced". Arguably the most important kinds of trees to introduce are N-ary trees in which any node is allowed to have no more than N children. N-ary trees, and especially binary trees (N = 2), are frequently encountered in programming because (for simplicity and performance) nodes in computer memory have often preallocated a fixed number of pointers to their child nodes and this imposes a limit on the maximum number of children. Knowing that a tree is N-ary has additional advantages too, for instance it's possible to easily compute the maximum size a tree of given height will require in memory and so on. In case of N = 1 the tree degenerates into a linked list.

TODO: more more more

Programming

Let's begin this section with a little practical note: trees are a very popular subject of programming classes and they've become a kind of exercise playground for students who are made by their teachers to write all the common algorithms like adding nodes, deleting them, searching, traversing the tree in various ways and so on. It's really a quite convenient structure, neither too simple nor overly complicated, and one at which many concepts can be demonstrated. However the catch is that the presented and required "textbook" implementation of trees in these classes is almost always very impractical and quite bad, it's an educational way of implementing trees that's never used in practice. It normally goes like this: each tree node is a data type (or object or something) with a value (whatever's stored in the node, for example a number) and a list/array of pointers to children nodes, and the process of adding/removing nodes involves dynamic memory allocation per every node, i.e. when adding a node we allocate memory of the exact size of the node (with malloc etc.), then we store the new node there and connect it to the parent. During node removal we disconnect the node and free the allocated memory. In practice, and then also more so in LRS, this is a cookbook recipe that can be followed and understood by average "coders", but it's almost always a very suboptimal solution -- either overcomplicated and/or inefficient and slow (memory allocation per every operation is a performance killer, may also result in more cache unfriendly memory layout) and/or buggy (memory leaks, segfaults, ...) etc. It can practically always be done much better, but the exact way depends on the specific case at hand.

NOTE: from storage point of view it would actually be better if children kept pointers to their parents than the other way around, as nodes (except for the root of course) always have exactly one parent, so this would avoid all the mess with list/arrays of pointers. But the process of tree traversal goes from parents to children, so in code the pointers usually point in this direction. Still, a file format for storing trees may still consider going with the former option.

For example the memory allocation issue -- if we really DO need dynamic allocation, which is almost never the case -- can be improved by allocating by bigger blocks; let's say we'll always allocate a space for 128 nodes and then, once we run out of this space, we may allocated additional 128 node places and so on. This will result in less frequent malloc/free calls, i.e. faster code, and will also guarantee nodes will be closer together in memory, which is better for cache (if we additionally use realloc, we'll be keeping ALL the nodes in a continuous array in memory, which is ideal). However even this will be sometimes too complicated and we can do away just with static allocation, or maybe in some cases our tree is a static, precomputed structure (happens a lot with game levels and so on) that won't change at runtime and so we don't have to bother with allocation and adding nodes at all. Sometimes we can even get away with literally representing the whole tree as a single serialized ASCII string in memory without having to create a whole module with special Node and Tree types and object and methods and whatnot, there are scenarios where keeping it simple just works the best. Sometimes indices may be better than pointers etc. All of this is to say simply that we must consider the specific scenario we have at hand and choose the best implementation based on this.

TODO

See Also