Reference (computer science)

In computer science, a reference is an object containing information which refers to data stored elsewhere, as opposed to containing the data itself. Accessing the value referred to by a reference is called dereferencing it. References are fundamental to constructing many data structures and in exchanging information between different parts of a program.

Address analogy

A reference may be compared to the address of a house. It is a small identifier from which it is possible to find a potentially much larger object. Finding a house based on its address is analogous to dereferencing a reference.

In a more complicated example, suppose you leave a forwarding address in your old house each time you move. A person could visit your first house, then follow the forwarding address to the next house, and so on until they finally find your current house. This is analogous to how references are used in singly linked lists.

Another benefit of house addresses is that they're much easier to deal with than actual houses. Say you want to be able to easily locate people on your street based on their last name. One way to do this is to use a large crane to physically pick up and rearrange all the houses based on the last names of the residents. A much easier solution is to make a list of addresses of people on your street and sort it by their last names. References have the same benefit: it is possible to manipulate references to data without actually having to modify the data itself, which in some cases can be much more efficient.

Benefits of references

References increase flexibility in where objects can be stored, how they are allocated, and how they are passed between areas of code. As long as we can access a reference to the data, we can access the data through it, and the data itself need not be moved. They also make sharing of data between different code areas easier; each keeps a reference to it.

The mechanism of references, if varying in implementation, is a fundamental programming language feature common to nearly all modern programming languages. Even some languages that support no direct use of references have some internal or implicit use. For example, the call by reference calling convention can be implemented with either explicit or implicit use of references.

Pointers are the most primitive and error-prone but also one of the most powerful and efficient types of references, storing only the address of an object in memory. Smart pointers are opaque data structures that act like pointers but can only be accessed through particular methods.

A file handle, or handle is a type of reference used to abstract file content. It usually represents both the file itself, as when requesting a lock on the file, and a specific position within the file's content, as when reading a file.

Formal representation

More generally, a reference can be considered as a piece of data that allows unique retrieval of another piece of data. This includes primary keys in databases and keys in an associative array. If we have a set of data D, any well-defined (single-valued) function from D onto D ∪ {null} defines a type of reference, where null is the image of a piece of data not referring to anything meaningful.

An alternative representation of such a function is a directed graph called a reachability graph. Here, each datum is represented by a vertex and there is an edge from u to v if the datum in u refers to the datum in v. The maximum out-degree is one. These graphs are valuable in garbage collection, where they can be used to separate accessible from inaccessible objects.

External and internal storage

In many data structures, large, complex objects are composed of smaller objects. These objects are typically stored in one of two ways:

With internal storage, the contents of the smaller object are stored inside the larger object.
With external storage, the smaller objects are allocated in their own location, and the larger object only stores references to them.

Internal storage is usually more efficient, because there is a space cost for the references and dynamic allocation metadata, and a time cost associated with dereferencing a reference and with allocating the memory for the smaller objects. Internal storage also enhances locality of reference by keeping different parts of the same large object close together in memory. However, there are a variety of situations in which external storage is preferred:

If the data structure is recursive, meaning it may contain itself. This cannot be represented in the internal way.
If the larger object is being stored in an area with limited space, such as the stack, then we can prevent running out of storage by storing large component objects in another memory region and referring to them using references.
If the smaller objects may vary in size, it's often inconvenient or expensive to resize the larger object so that it can still contain them.
References are often easier to work with and adapt better to new requirements.

Some languages, such as Java and Scheme, do not support internal storage. In these languages, all objects are uniformly accessed through references.

Equality

In languages where all objects are accessed through references (like Java and Scheme), there becomes a need to test for two different types of equality:

whether two references reference the same object. In Java, this is done using the equality operator (==). In Scheme, this is done using the procedure eq?.
whether the objects referenced by two references are equal in some sense (e.g. their contents are the same). In Java, this is done using the .equals() method, which all reference types possess as it is inherited from Object. In Scheme, this is done with the procedure equal?.

Obviously the first type of equality implies the second, but the converse is not necessarily true.

External links

Pointer Fun With Binky Introduction to pointers in a 3 minute educational video - Stanford Computer Science Education Library