User:Jonathans/clique

this is an algorithm that finds every clique in an undirected graph; pretty much an exhaustive search of the nebulous realms of NP-completeness. when it completes, it has found not just every clique but also the size of largest clique (of course) and the number of colors neccessary for an optimal minimum colorization of the graph (they were shown to be the same by way of showing this problem is NPC but it is easy to see if you just think of how to color a complete graph of any size).

synopsis

i believe that, since this algorithm reduces the problem to set operations, if set operations could be O(1) -- which is to say that, if we had primitives for sets and set operations in our programming languages that our hardware could actually make "primitive" in a very real sense -- then this problem can be analyzed for both best and worst-case running times. that'd be nice, huh.

this algorithm is effectively public domain, sittin' here all waiting to be coded up and stuff. probably it is only capable of handling graphs up to 32, 64, maybe 128 nodes in size (guess why). poor little guy. were it to be of any merit or value, wikipedia has the timestamp right here for all interested parties to see exactly who wrote it and when. obviously i don't believe this is going to get me a nobel peace prize, but if someone else uses this to prove for once and for all either P=NP or P!=NP, then, my bases are covered. besides, the last time anyone looked at cliques with a gleam in their eye was 1972.

pseudocode

you may find it easier to read the next section first, while referring back to the code in this section intermittently, before reading the code straight through.

int get_minimum_colorization(graph g)
{
  k = 2;

  g = read_graph(argv[1]);

  // set of original neighbor sets for each node
  //
  neighbors = build_neighbors(g);

  // set containing the set of nodes in a clique (clique::nodes),
  // the set of neighbor sets (clique::neighbors)
  //
  cliques = build_cliques(g, neighbors);

  while(!empty(cliques))
  {
    c = cliques.pop();

    n = intersect_all(c.neighbors);

    if (empty(n))
      continue;

    k++;

    foreach(node v in n)
    {
      // find the union of the set containing the new node and the nodes already in this clique
      //
      tmp = union(c.nodes, set(v));

      // look to see if we have already found this clique with a checksum in N
      //
      if (find_clique(cliques, tmp))
        continue;

      // duplicate the clique in order to mutate it.
      //
      d = copy_clique(c);

      // remove the new node from existing neighbor sets from this clique.
      // note that this does not change the original neighbors table..
      //
      for(i = 0; i < d.neighbors.size; i++)
        d.neighbors[i] = difference(d.neighbors[i], set(v));

      // find the "original" neighbors of the new node, remove the clique's existing nodes,
      // and add the neighbor set to the set-of-sets of this clique's neighbors
      //
      d.neighbors = union(d.neighbors, set(difference(neighbors[v], c.nodes)));

      // update the clique's nodes to represent the new clique.
      //
      d.nodes = tmp;

      // put it "back" into the cliques table.
      //
      add_clique(cliques, d);
    }
  }

  return k;
}

theory

so how the crap does this work? well when you are given a graph, you are given all cliques of size two in a graph (the edges). so, think dynamic programming algorithm. if you could inspect your existing cliques to see if they belong to any larger cliques, you could eventually find the biggest one, especially if you can also say that, given a clique, if it is not part of any larger clique, and there may still exist larger cliques, you can throw everything you know about your given clique away.

the algorithm exhaustively searches the list of size-n cliques and tries to build size-(n+1) cliques out of each. if it can, it checks to see if the new cliques it has found have already been discovered, then memoizes the unique ones. when no more can be found, the tail of the memoized list (actually the list is empty, but the tail will be referenced by c) is the largest clique. there may be several of the same size as c but who cares, we were only looking for their size, not the sets themselves.

how can you make sure you didn't already find a clique? presumably, the reason you can perform O(1) set operations is the same reason you can do this swiftly; it would have something to do with being able to represent a set as a single number which fits into a register in your machine. just see if the number is already in the memoized list (which you can do swiftly with a neato hash like bob jenkins'). that beats a sorted-list comparison algorithm, don't it?

note that you'll operate on all the size-2 cliques before you can get to the size-3 ones, because they are queued (FIFO). this is a breadth-first search. this is an important point: if this algorithm does not find a size-3 clique, there cannot be a size-4 clique. as an aside, i propose that a depth-first search could be more effective for finding the largest clique if it could be given hints about where to look; for example, which size-2 cliques have the greatest number of neighbors? and when intersecting those neighbor sets, which clique provided the largest intersected neighbor set? stuff like that.

philosophy, ech

to wax all philosophic and GEB:EGB on you, consider the plausibility of the following statement: a (number|symbol) is stored in (a register|spontaneous temporary ganglia) in your (machine|brain). okay. whatever. we can handle 32 bits at a time, which means sets up to size 32: no clique can be greater than 32 nodes in size, no node can have more than 32 neighbors, and therefore no graph with more than 32 vertices can be operated on. whoops. the point is, though, that i think hofstatder is more-or-less right about chunking. i feel that P behaves like a proper subset of NP -- we can solve NP problems deterministically so long as they are limited in size -- so long as hardware is as limited as it is. i think that probably even our brains are physically limited and we merely approximate the infinite. he suggests -- very, very roughly -- we can conceive of it because we can't experience our own limitations (e.g., count our own neurons a priori). okay. so create a machine that knows sets, but not that its sets are stored as integers. or something. whatever. yikes.

example trace

an implementation detail: empty() is used loosely against both cliques and sets, but, a clique can be represented as a set if you consider the first element to always be the nodes in the clique and the second to be a set containing all of the neighbor sets. here is a two-node clique example where node 1 has neighbor set {2 3} and node 2 has neighbor set {1}.

{nodes neighbors} ::= {{1 2} {{2 3} {1}}}

note that before iterating it does not matter that a neighbor has in it a node from the clique; the first set intersection operation will take care of that (because the initial cliques are size-2 and undirected graphs tend not to have loopback edges, the intersection will rather casually remove the nodes of the clique from the shared neighbor set).

assume now that we have a four node graph, and the first clique looked like this:

{nodes neighbors} ::= {{1 2} {{2 3 4} {1 4}}}

the first iteration finds that n = {4}. now there must be a size-3 clique, {1 2 4}, but we can't just go creating them all willy-nilly. what if the set {2 4 1} was already found? it is the same clique!

tmp = union( {1 2}, {4} );
tmp = {1 2} ∪ {4}

find_cliques(cliques, {1 2 4}) will take care of our problem. it can even know where in the list to start searching, if it is sorted/indexed/hashed somehow. there is of course a function that maps sets into $N$ , but when the sets get bigger than 32 or 64 items then there simply aren't any machines capable of handling numbers that big. yet. when there are, however, this is fundamentally an O(1) check, too (there is probably a lot of theory in some grad-level algorithms course related to exactly what O(1) really means. once you slide around between RISC and CISC enough you find yourself in a gray area. again, i think it's solely dependent on hardware capabilities).

continuing on, we now permutate with a copy of the original clique. this is in case |n|>1, as each new tmp must of course be unique. the first thing to do is remove v from all of the neighbor sets in d:

d.neighbors = {{2 3 4} {1 4}}

{2 3 4}\{4} = {2 3}
{1 4}\{4} = {1}

so d.neighbors becomes: {{2 3} {1}}

now we have some of the neighbors of our new clique, but, we're missing node 4's neighbors, and our clique d doesn't actually have node 4 in it yet. i haven't yet talked about what neighbors node 4 has, but let us suppose it is a neighbor to every other node we have mentioned thus far. finishing out, and keep an eye out as i let slip from code to set notation:

d.neighbors = union({{2 3} {1}}, set(neighbors[4] \ {1 2}));
d.neighbors = union({{2 3} {1}}, set({1 2 3} \ {1 2}));
d.neighbors = {{2 3} {1}} ∪ {{1 2 3} \ {1 2}}
d.neighbors = { {2 3} {1} {3} }
d.nodes = tmp
d.nodes = {1 2 4}

we have now found our size-3 clique, incremented k appropriately, and reinserted d into the clique list. i will leave it as an exercise to the reader to divine and draw this graph and try out the next edge in the clique list.

time complexity

so first of all, i want to point out a few obvious facts.

the "first pass" on the outer while() loop will always run in time |E| * f(m) where m is the number of size-3 cliques found, and the timing of the innermost loop is hidden away by f.
even in a complete graph there must be fewer size-n+1 cliques than size-n cliques, for any reasonable value of n.

inside the loop there are a number of fairly complex steps, most of which i have asserted before as O(1). the inner for() loop, however, performs k-1 steps of its body. this is followed by one more very similar operation on the same set of neighbor sets, so i suggest calling it a bundle of k time.

this is an interesting prospect because as the size of cliques is decreasing from |E| to zero, k is increasing from two to the final value of k, which (loosely) has an upper bound of size |V|.

furthermore, just given both |V| and |E|, we can presume to know an even tighter ceiling bound on the clique size -- there could not be any clique of size three in a graph of three nodes if there are only two edges, and so on. surely there is a formula to calculate it.