Jump to content

Persistent data structure

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Ethan McCue (talk | contribs) at 20:43, 30 November 2018 (Remove unneeded specification on pure languages). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

In computing, a persistent data structure is a data structure that always preserves the previous version of itself when it is modified. Such data structures are effectively immutable, as their operations do not (visibly) update the structure in-place, but instead always yield a new updated structure. The term was introduced in Driscoll, Sarnak, Sleator, and Tarjans' 1986 article.[1]

A data structure is partially persistent if all versions can be accessed but only the newest version can be modified. The data structure is fully persistent if every version can be both accessed and modified. If there is also a meld or merge operation that can create a new version from two previous versions, the data structure is called confluently persistent. Structures that are not persistent are called ephemeral.[2]

These types of data structures are particularly common in logical and functional programming.[2]

Partial versus Full Persistence

In the partial persistence model, a programmer may query any previous version of a data structure, but may only update the latest version. This implies a linear ordering among each version of the data structure.[3] In fully persistent model, both updates and queries are allowed on any version of the data structure. In some cases the performance characteristics of querying or updating older versions of a data structure may be allowed to degrade, as is true with the Rope data structure.[4] In addition, a data structure can be referred to as confluently persistent if, in addition to being fully persistent, two versions of the same data structure can be combined to form a new version which is still fully persistent.[5]

Techniques for Preserving Previous Versions

Copy on Write

One method for creating a persistent data structure is to use a platform provided ephemeral data structure such as an array to store the data in the data structure and copy the entirety of that data structure using copy on write semantics for any updates to the data structure. This is a inefficient technique because the entire backing data structure must be copied for each write, leading to exponential performance characteristics if many writes are performed.[6]

Fat node

Fat node method is to record all changes made to node fields in the nodes themselves, without erasing old values of the fields. This requires that nodes be allowed to become arbitrarily “fat”. In other words, each fat node contains the same information and pointer fields as an ephemeral node, along with space for an arbitrary number of extra field values. Each extra field value has an associated field name and a version stamp which indicates the version in which the named field was changed to have the specified value. Besides, each fat node has its own version stamp, indicating the version in which the node was created. The only purpose of nodes having version stamps is to make sure that each node only contains one value per field name per version. In order to navigate through the structure, each original field value in a node has a version stamp of zero.

Complexity of fat node

With using fat node method, it requires O(1) space for every modification: just store the new data. Each modification takes O(1) additional time to store the modification at the end of the modification history. This is an amortized time bound, assuming modification history is stored in a growable array. For access time, right version at each node must be found as the structure is traversed. If "m" modifications were to be made, then each access operation would have O(log m) slowdown resulting from the cost of finding the nearest modification in the array.

Path copying

With the path copying method a copy of all nodes is made on the path to any node which is about to be modified. These changes must then be cascaded back through the data structure: all nodes that pointed to the old node must be modified to point to the new node instead. These modifications cause more cascading changes, and so on, until the root node is reached.

Complexity of path copying

With m modifications, this costs O(log m) additive lookup time. Modification time and space are bounded by the size of the structure, since a single modification may cause the entire structure to be copied. That is O(m) for one update, and thus O(n²) preprocessing time.

A combination

Sleator, Tarjan et al. came up with a way to combine the techniques of fat nodes and path copying, achieving O(1) access slowdown and O(1) modification space and time complexity.

In each node, we store one modification box. This box can hold one modification to the node—either a modification to one of the pointers, or to the node’s key, or to some other piece of node-specific data—and a timestamp for when that modification was applied. Initially, every node’s modification box is empty.

Whenever we access a node, we check the modification box, and compare its timestamp against the access time. (The access time specifies the version of the data structure that we care about.) If the modification box is empty, or the access time is before the modification time, then we ignore the modification box and just deal with the normal part of the node. On the other hand, if the access time is after the modification time, then we use the value in the modification box, overriding that value in the node. (Say the modification box has a new left pointer. Then we’ll use it instead of the normal left pointer, but we’ll still use the normal right pointer.)

Modifying a node works like this. (We assume that each modification touches one pointer or similar field.) If the node’s modification box is empty, then we fill it with the modification. Otherwise, the modification box is full. We make a copy of the node, but using only the latest values.(That is, we overwrite one of the node’s fields with the value that was stored in the modification box.) Then we perform the modification directly on the new node, without using the modification box. (We overwrite one of the new node’s fields, and its modification box stays empty.) Finally, we cascade this change to the node’s parent, just like path copying. (This may involve filling the parent’s modification box, or making a copy of the parent recursively. If the node has no parent—it’s the root—we add the new root to a sorted array of roots.)

With this algorithm, given any time t, at most one modification box exists in the data structure with time t. Thus, a modification at time t splits the tree into three parts: one part contains the data from before time t, one part contains the data from after time t, and one part was unaffected by the modification.

Complexity of the combination

Time and space for modifications require amortized analysis. A modification takes O(1) amortized space, and O(1) amortized time. To see why, use a potential function ϕ, where ϕ(T) is the number of full live nodes in T . The live nodes of T are just the nodes that are reachable from the current root at the current time (that is, after the last modification). The full live nodes are the live nodes whose modification boxes are full.

Each modification involves some number of copies, say k, followed by 1 change to a modification box. (Well, not quite—you could add a new root—but that doesn’t change the argument.) Consider each of the k copies. Each costs O(1) space and time, but decreases the potential function by one. (First, the node we copy must be full and live, so it contributes to the potential function. The potential function will only drop, however, if the old node isn’t reachable in the new tree. But we know it isn’t reachable in the new tree—the next step in the algorithm will be to modify the node’s parent to point at the copy. Finally, we know the copy’s modification box is empty. Thus, we’ve replaced a full live node with an empty live node, and ϕ goes down by one.) The final step fills a modification box, which costs O(1) time and increases ϕ by one.

Putting it all together, the change in ϕ is Δϕ =1− k.Thus, we’ve paid O(k +Δϕ)= O(1) space and O(k +Δϕ +1) = O(1) time

Examples of persistent data structures

Perhaps the simplest persistent data structure is the singly linked list or cons-based list, a simple list of objects formed by each carrying a reference to the next in the list. This is persistent because we can take a tail of the list, meaning the last k items for some k, and add new nodes on to the front of it. The tail will not be duplicated, instead becoming shared between both the old list and the new list. So long as the contents of the tail are immutable, this sharing will be invisible to the program.

Many common reference-based data structures, such as red–black trees,[7] stacks,[8] and treaps,[9] can easily be adapted to create a persistent version. Some others need slightly more effort, for example: queues, dequeues, and extensions including min-deques (which have an additional O(1) operation min returning the minimal element) and random access deques (which have an additional operation of random access with sub-linear, most often logarithmic, complexity).

There also exist persistent data structures which use destructive[clarification needed] operations, making them impossible to implement efficiently in purely functional languages (like Haskell outside specialized monads like state or IO), but possible in languages like C or Java. These types of data structures can often be avoided with a different design. One primary advantage to using purely persistent data structures is that they often behave better in multi-threaded environments.

Linked lists

Singly linked lists are the bread-and-butter data structure in functional languages.[10] In ML-derived languages and Haskell, they are purely functional because once a node in the list has been allocated, it cannot be modified, only copied or destroyed. Note that ML itself is not purely functional.

Consider the two lists:

xs = [0, 1, 2]
ys = [3, 4, 5]

These would be represented in memory by:

where a circle indicates a node in the list (the arrow out representing the second element of the node which is a pointer to another node).

Now concatenating the two lists:

zs = xs ++ ys

results in the following memory structure:

Notice that the nodes in list xs have been copied, but the nodes in ys are shared. As a result, the original lists (xs and ys) persist and have not been modified.

The reason for the copy is that the last node in xs (the node containing the original value 2) cannot be modified to point to the start of ys, because that would change the value of xs.

Trees

Consider a binary search tree,[10] where every node in the tree has the recursive invariant that all subnodes contained in the left subtree have a value that is less than or equal to the value stored in the node, and subnodes contained in the right subtree have a value that is greater than the value stored in the node.

For instance, the set of data

xs = [a, b, c, d, f, g, h]

might be represented by the following binary search tree:

A function which inserts data into the binary tree and maintains the invariant is:

 fun insert (x, E) = T (E, x, E)
   | insert (x, s as T (a, y, b)) =
        if x < y then T (insert (x, a), y, b)
        else if x > y then T (a, y, insert (x, b))
        else s

After executing

ys = insert ("e", xs)

we end up with the following:

Notice two points: first, the original tree (xs) persists. Second, many common nodes are shared between the old tree and the new tree. Such persistence and sharing is difficult to manage without some form of garbage collection (GC) to automatically free up nodes which have no live references, and this is why GC is a feature commonly found in functional programming languages.

Persistent hash array mapped trie

A persistent hash array mapped trie is a specialized variant of a hash array mapped trie that will preserve previous versions of itself on any updates. It is often used to implement a general purpose persistent map data structure.[11]

Hash array mapped tries were originally described in a 2001 paper by Phil Bagwell entitled "Ideal Hash Trees". This paper presented a mutable Hash table where "Insert, search and delete times are small and constant, independent of key set size, operations are O(1). Small worst-case times for insert, search and removal operations can be guaranteed and misses cost less than successful searches".[12] This data structure was then modified by Rich Hickey to be fully persistent for use in the Clojure programming language.[13]

Conceptually, hash array mapped tries work similar to any generic tree in that they store nodes hierarchically and retrieve them by following a path down to a particular element. The key difference is that Hash Array Mapped Tries first use a hash function to transform their lookup key into a (usually 32 or 64 bit) integer. The path down the tree is then determined by using slices of the binary representation of that integer to index into a sparse array at each level of the tree. The leaf nodes of the tree behave similar to the buckets used to construct hash tables and may or may not contain multiple candidates depending on hash collisions.[11]

Most implementations of persistent hash array mapped tries use a branching factor of 32 in their implementation. This means that in practice while insertions, deletions, and lookups into a persistent hash array mapped trie have a computational complexity of O(log n), for most applications they are effectively constant time, as it would require an extremely large number of entries to make any operation take more than a dozen steps.[14]

Usage in programming languages

Haskell

Haskell is a pure functional language and therefore does not allow for mutation. Therefore all data structures in the language are persistent, as it is impossible to not preserve the previous state of a data structure with functional semantics.[15] This is because any change to a data structure that would render previous versions of a data structure invalid would violate referential transparency.

In its standard library Haskell has efficient persistent implementations for linked lists[16], Maps (implemented as size balanced trees)[17], and Sets[18] among others.[19]

Clojure

Like many programming languages in the Lisp family, Clojure contains an implementation of a linked list, but unlike other dialects its implementation of a Linked List has enforced persistence instead of being persistent by convention.[20] Clojure also has syntax literals for efficient implementations of persistent vectors, maps, and sets based on persistent hash array mapped tries. These data structures implement the mandatory read-only parts of the Java collections framework.[21]

The designers of the Clojure language advocate the use of persistent data structures over mutable data structures because they have value semantics which gives the benefit of making them freely shareable between threads with cheap aliases, easy to fabricate, and language independent.[22]

These data structures form the basis of Clojure's support for parallel computing since they allow for easy retries of operations to sidestep data races and atomic compare and swap semantics.[23]

Elm

The Elm programming language is purely functional like Haskell, which makes all of its data structures persistent by necessity. It contains persistent implementations of linked lists as well as persistent arrays, dictionaries, and sets.[24]

Elm uses a custom virtual DOM implementation that takes advantage of the persistent nature of Elm data. As of 2016 it was reported by the developers of Elm that this virtual DOM allows the Elm language to render HTML faster than the popular JavaScript frameworks React, Ember, and Angular.[25]

JavaScript

The popular JavaScript frontend framework React is frequently used along with a state management system that implements the Flux architecture[26][27], a popular implementation of which is the JavaScript library Redux. The Redux library is inspired by the state management pattern used in the Elm programming language, meaning that it mandates that users treat all data as persistent.[28] As a result, the Redux project recommends that in certain cases users make use of libraries for enforced and efficient persistent data structures. This reportedly allows for greater performance than when comparing or making copies of regular JavaScript objects.[29]

One such library of persistent data structures Immutable.js is based on the data structures made available and popularized by Clojure and Scala.[30] It is mentioned by the documentation of Redux as being one of the possible libraries that can provide enforced immutability.[29]

Scala

The Scala programming language promotes the use of persistent data structures for implementing programs using "Object-Functional Style".[31] Scala contains implementations of many Persistent data structures including Linked Lists, Red–black trees, as well as persistent hash array mapped tries as introduced in Clojure.[32]

Garbage Collection

Because persistent data structures are often implemented in such a way that successive versions of a data structure share underlying memory[33] ergonomic use of such data structures generally requires some form of automatic garbage collection system such as reference counting or mark and sweep.[34] In some platforms where persistent data structures are used it is an option to not use garbage collection which, while doing so can lead to memory leaks, can in some cases have a positive impact on the overall performance of an application.[35]

See also

References

  1. ^ Driscoll JR, Sarnak N, Sleator DD, Tarjan RE (1986). Making data structures persistent. pp. 109–121. CiteSeerX 10.1.1.133.4630. doi:10.1145/12130.12142. ISBN 978-0-89791-193-1. {{cite book}}: |journal= ignored (help)
  2. ^ a b Kaplan, Haim (2001). "Persistent data structures". Handbook on Data Structures and Applications.
  3. ^ Conchon, Sylvain; Filliâtre, Jean-Christophe, "Semi-persistent Data Structures", Programming Languages and Systems, Springer Berlin Heidelberg, pp. 322–336, ISBN 9783540787389, retrieved 2018-11-30
  4. ^ Tiark, Bagwell, Philip Rompf, (2011). RRB-Trees: Efficient Immutable Vectors. OCLC 820379112.{{cite book}}: CS1 maint: extra punctuation (link) CS1 maint: multiple names: authors list (link)
  5. ^ Brodal, Gerth Stølting; Makris, Christos; Tsichlas, Kostas (2006), "Purely Functional Worst Case Constant Time Catenable Sorted Lists", Lecture Notes in Computer Science, Springer Berlin Heidelberg, pp. 172–183, ISBN 9783540388753, retrieved 2018-11-30
  6. ^ "Copy-on-write - Semantic Scholar". www.semanticscholar.org. Retrieved 2018-11-30.
  7. ^ Neil Sarnak; Robert E. Tarjan (1986). "Planar Point Location Using Persistent Search Trees" (PDF). Communications of the ACM. 29 (7): 669–679. Bibcode:1985CACM...28...22S. doi:10.1145/6138.6151.
  8. ^ Chris Okasaki. "Purely Functional Data Structures (thesis)" (PDF). {{cite journal}}: Cite journal requires |journal= (help)
  9. ^ Liljenzin, Olle (2013). "Confluently Persistent Sets and Maps". arXiv:1301.3388. Bibcode:2013arXiv1301.3388L. {{cite journal}}: Cite journal requires |journal= (help)
  10. ^ a b This example is taken from Okasaki. See the bibliography.
  11. ^ a b BoostCon (2017-06-13), C++Now 2017: Phil Nash "The Holy Grail!? A Persistent Hash-Array-Mapped Trie for C++", retrieved 2018-10-22
  12. ^ Phil, Bagwell (2001). "Ideal Hash Trees". {{cite journal}}: Cite journal requires |journal= (help)
  13. ^ "Are We There Yet?". InfoQ. Retrieved 2018-10-22.
  14. ^ Steindorfer, Michael J.; Vinju, Jurgen J. (2015-10-23). "Optimizing hash-array mapped tries for fast and lean immutable JVM collections". ACM SIGPLAN Notices. 50 (10): 783–800. doi:10.1145/2814270.2814312. ISSN 0362-1340.
  15. ^ "Haskell Language". www.haskell.org. Retrieved 2018-10-22.
  16. ^ "Data.List". hackage.haskell.org. Retrieved 2018-10-23.
  17. ^ "Data.Map.Strict". hackage.haskell.org. Retrieved 2018-10-23.
  18. ^ "Data.Set". hackage.haskell.org. Retrieved 2018-10-23.
  19. ^ "Performance/Arrays - HaskellWiki". wiki.haskell.org. Retrieved 2018-10-23.
  20. ^ "Clojure - Differences with other Lisps". clojure.org. Retrieved 2018-10-23.
  21. ^ "Clojure - Data Structures". clojure.org. Retrieved 2018-10-23.
  22. ^ "Keynote: The Value of Values". InfoQ. Retrieved 2018-10-23.
  23. ^ "Clojure - Atoms". clojure.org. Retrieved 2018-11-30.
  24. ^ "core 1.0.0". package.elm-lang.org. Retrieved 2018-10-23.
  25. ^ "blog/blazing-fast-html-round-two". elm-lang.org. Retrieved 2018-10-23.
  26. ^ "Flux | Application Architecture for Building User Interfaces". facebook.github.io. Retrieved 2018-10-23.
  27. ^ Mora, Osmel (2016-07-18). "How to handle state in React". React Ecosystem. Retrieved 2018-10-23.
  28. ^ "Read Me - Redux". redux.js.org. Retrieved 2018-10-23.
  29. ^ a b "Immutable Data - Redux". redux.js.org. Retrieved 2018-10-23.
  30. ^ "Immutable.js". facebook.github.io. Retrieved 2018-10-23.
  31. ^ "The Essence of Object-Functional Programming and the Practical Potential of Scala - codecentric AG Blog". codecentric AG Blog. 2015-08-31. Retrieved 2018-10-23.
  32. ^ ClojureTV (2013-01-07), Extreme Cleverness: Functional Data Structures in Scala - Daniel Spiewak, retrieved 2018-10-23
  33. ^ "Vladimir Kostyukov - Posts / Slides". kostyukov.net. Retrieved 2018-11-30.
  34. ^ "http://wiki.c2.com/?ImmutableObjectsAndGarbageCollection". wiki.c2.com. Retrieved 2018-11-30. {{cite web}}: External link in |title= (help)
  35. ^ "The Last Frontier in Java Performance: Remove the Garbage Collector". InfoQ. Retrieved 2018-11-30.

Further reading