# Graph Codes

## Introduction

Many computational problems use data types that are variations on an abstract graph. In its simplest form, an abstract graph can be thought of as a collection of objects called nodes, any distinct pair of which may or may not be connected.

In order to keep track of things, we usually label the nodes, that is, number them from 1 to N. Generally, the labels are unimportant. The interesting features of the graph (such as whether it has a circuit, or is connected) exist independent of the node labels.

Since we store a graph with node numberings, it is possible to create two representations of the same graph which look quite different. And, in a more realistic setting, if we have a particular graph, and we are searching through a collection of other graphs to see if any match it, we might not realize we have a match, because the node labels differ - although the underlying graphs are the same.

Suppose our job, then, is to determine whether two graphs represent the same physical object, ignoring the artificial differences that might come about simply from node labeling. It's hardly possible to draw the graphs and compare them; even in real life, that's not going to be a reliable technique! But how can we store two graphs in a computer, and have it determine whether they are the same?

A useful technique is to come up with some kind of identification number which reflects only the important information in a graph; we might try to do this in such a way so that two graphs have the same ID if and only if they represent the same physical object. Thus, we can convert the problem of geometric comparison to one of numeric comparison, something the computer will have no problem with.

We really do need to think about this problem, because of certain biomedical applications we will be dealing with.

We will be consider a collection of objects called "chemical subsystems". Each such object will be a set of atoms with some chemical bonds between certain pairs of atoms. We will start with a standard set of such objects, and then we will "encounter" new objects. Many of these new objects will be essentially identical to one of the standard objects. Our task is then, for each new object, to determine if it is the same as one of the standard ones.

To do this in an automatic way, we will try to reformulate our problem as a kind of abstract graph problem, and then come up with a numeric code that allows us to compare two graphs.

To carry out this task, we need to generalize the notion of a graph to that of a directed color multigraph. Then we consider the notion of how two objects can be "essentially identical" or "isomorphic". We then discuss how to set up an "identification number", here called a code, that can settle the question of when two objects are the same.

## Concepts from Graph Theory

Only some of the properties of a chemical subsystem need to be considered to carry out a comparison. These qualities can be abstracted and conveniently represented in terms of an abstract data structure called a directed color multigraph. The properties of this data structure can be defined in terms of the following sequence of definitions:

• A graph is a set of nodes N, indexed by the integers, and a symmetric adjacency relationship EDGE(*,*) defined for pairs of distinct nodes and returning the value 0 or 1. When EDGE(N1,N2) is 1, we say there is an edge between nodes N1 and N2.
• A multigraph is a set of nodes N and a symmetric adjacency relationship EDGES(*,*) defined for pairs of distinct nodes and returning a nonnegative integer value. EDGES(N1,N2) is the number of edges between the nodes N1 and N2.
• A directed graph is a set of nodes N and a directed adjacency relationship ARC(*,*) defined for pairs of distinct nodes and returning the value 0 or 1. ARC(N1,N2) is 1 if there is an arc from node N1 to N2.
• A directed multigraph is a set of nodes N and a directed adjacency relationship ARCS(*,*) defined for pairs of distinct nodes and returning a nonnegative integer value. ARCS(N1,N2) is the number of arcs from node N1 to N2.
• A color function is a function COLOR(*) associated with a set of nodes; for each node N, COLOR(N) is the color of that node. For convenience, it may be assumed that COLOR(N) is actually an integer, the index of the color of the node in some list of colors. The actual values of COLOR are not important. It is only important that it can always be determined if the colors of two nodes are the same or distinct.
• A directed color multigraph is a directed multigraph with the addition of a color function defined on the nodes. As long as we are not allowing a node to be connected to itself, the value of the color can be stored in the diagonal entry of the ARCS function.

## Representing a Chemical Subsystem

A directed color multigraph can be used to represent a chemical subsystem. The representation will have the property that two chemical subsystems are "the same" exactly when the corresponding directed color multigraphs are "the same".

The relevant properties of a chemical subsystem are

• the chemical identity of each component;
• the number of bonds between each pair of components.
• the direction of each bond between components.
To represent these properties in a directed color multigraph, we naturally use nodes for the atomic components, giving each node a color corresponding to its chemical identity; an arc is drawn between any two nodes for each chemical bond between the corresponding atoms; and the arc is directed from the electron-donating toward the electron-accepting atom.

## Isomorphism

The numbering of the nodes in a directed color multigraph is considered to be arbitrary, and not a fundamental feature of the object. Indeed, if we were to draw a directed color multigraph on a sheet of paper, and ask two different people to create the corresponding abstract arrays, the form of these quantities would differ, depending on the order in which the nodes and arcs were numbered. Thus, it is possible that two directed color multigraphs G1 and G2, which are "different" when considered as data objects, may represent identical physical objects; in this case, the differences between the data objects are of no physical significance. This is precisely the relationship that needs to be determined. Such a relationship is called an isomorphism.

Consider two directed color multigraphs G1 and G2, with associated sets of nodes N1 and N2, color functions COLOR1 and COLOR2, and directed adjacency relationships ARCS1 and ARCS2. We say that G1 and G2 are isomorphic if there is a renumbering of the nodes of G1 which makes all the properties of the two objects correspond. In particular, there must be a renumbering of the nodes, or more precisely, a permutation P that can be applied to the indexes of the nodes of G1, so that

• the nodes of G1 and G2 may be put into a correspondence:
N1(P(I)) <=> N2(I)
• the colors of corresponding nodes are equal:
COLOR1(N1(P(I)) = COLOR2(N2(I)) for all I;
• the number of directed arcs from one node to another correspond:
ARCS1(N1(P(I)),N1(P(J)) = ARCS2(N2(I),N2(J)) for every I and J.

The question of determining whether two objects are isomorphic usually requires a great deal of computation and checking. On the other hand, it is often possible to quickly recognize certain cases in which isomorphism is impossible, in which case the isomorphism question can be answered very quickly. This concept will now be formalized.

A partial signature function for a set of objects is a function f(G) which has the property that if G1 and G2 are isomorphic, then f(G1) = f(G2).

It is possible for objects that are not isomorphic to have equal values of f, but if two objects have distinct values of f, they cannot be isomorphic. Simple examples of partial signature functions for our objects include:

• the number of nodes;
• the number of edges;
• the number of colors;
• the maximum node degree;
• the ordered sequence of node degrees.

In some cases, a partial signature function may in fact exactly distinguish between isomorphic and nonisomorphic objects. In such a case, we can remove the qualifer "partial" from the name.

A signature function for a set of objects is a function f(G) which has the property that f(G1) = f(G2) if and only if G1 and G2 are isomorphic.

## Isomorphism Determination

Is it possible to determine whether two objects, G1 and G2, are isomorphic? There is surely one way, known as the brute force method. If G1 and G2 are isomorphic, then the only thing disguising this fact is that the nodes of G1 have been permuted. It is possible to consider every possible permutation of the nodes of G1, checking whether any such permutation makes the relevant properties of the two graphs match up.

The brute force method is not considered an effective procedure, because the amount of work required to solve the problem increases explosively as the problem size increases. To estimate the amount of work, measure the problem size in terms of the number of nodes N. Then the number of permutations of those nodes is the factorial of N, represented as "N!". If we make the reasonable assumption that there is some constant quantity C of work to examine whether any particular permutation makes the graphs coincide, then the total amount of work required for a single isomorphism determination is of the order of C * N!.

For the general graph isomorphism problem, no algorithm is known that has a significantly better performance than the brute force method, and in fact, the problem is considered to be NP hard, that is, inherently unsolvable for sufficiently large problems.

To determine whether two directed color multigraphs are isomorphic will require, as part of the computation, that we verify that the underlying graphs are isomorphic. Hence, this problem is in some sense at least as hard as the graph isomorphism problem, and we should only expect to be able to solve problems with a limited number of nodes, or with other properties that allow the determination to be made more rapidly.

The adjacency matrix for a graph of N nodes is an N by N matrix A whose offdiagonal element A(I,J) is 1 if there is an undirected arc between the nodes I and J, and whose diagonal element A(I,I) is zero. For this simplest of examples of a graph, the adjacency matrix is symmetric, has a zero diagonal, and the offdiagonal elements are either 0 or 1.

The adjacency matrix completely summarizes the information in the data structure of the directed color multigraph. Hence, we can determine the question of isomorphism by comparing adjacency matrices.

If two graphs have the same adjacency matrix, they are isomorphic. But if they have different adjacency matrices, they may or may not be isomorphic. That is because any relabeling of the nodes of the graph will result in a new adjacency matrix, which almost surely looks different than the original one. So, if we suspect that two adjacency matrices represent the same graph, we need to somehow factor out the effect of node relabelings.

The adjacency matrix for a directed color multigraph is an extension of the idea for a standard graph; the offdiagonal element A(I,J) is the number of arcs directed from node I to node J; each diagonal elements A(I,I) is the index of the color of node I. And the adjacency matrix thus summarizes all the information about the directed color multigraph; two DCM's with the same adjacency matrix must be isomorphic; two isomorphic graphs may have different adjacency matrices, but one can be transformed into the other by finding a suitable node relabeling.

## Defining a Code

We first need to define an order relation on N by N integer matrices. We need to choose an ordering so that we can compare two matrices. The particular ordering we choose here will greatly simplify our backtracking procedure later.

When comparing two matrix indices, we will say that

(I1,J1) < (I2,J2)
if either:
• max(I1,J1) < max(I2,J2), or
• max(I1,J1) = max(I2,J2), but I1 < I2.
This defines a particular ordering of the entries of a matrix, which is suggested by the following diagram for the 5 by 5 case:
```         1  2  5 10 17
3  4  7 12 19
6  8  9 14 21
11 13 15 16 23
18 20 22 24 25
```

We now say that matrix A is less than or equal to matrix B if, under the given index ordering, either every entry of the two matrices is equal, or else, at the first entry (I,J) for which the matrices differ, it is the case that A(I,J) < B(I,J).

For a given directed color multigraph, we now consider each possible ordering O of the nodes. That is, we consider the entire class of node labelings, and the corresponding adjacency matrices. For each ordering, the corresponding adjacency matrix A(O). All of these adjacency matrices define the same graph, that is, they are all isomorphic. We now define the directed color multigraph code to be the maximum element A* of all the adjacency matrices defined in this way. You should be able to convince yourself that this procedure defines a unique element A*.

Now we note that this directed color multigraph code is a signature function. That is, two directed color multigraphs are isomorphic if and only if their codes are equal. Why is this, exactly?

It should be obvious that if the directed color multigraph G1 has code A*, then G1 is isomorphic to G*, the directed color multigraph whose adjacency matrix is A*. And so if G2 also has code A*, then G2 is isomorphic to G, and hence to G2. So equal codes imply isomorphic graphs.

If G1 has code A* (the adjacency matrix of some graph G*), and G2 has code A** (the adjacency matrix of some graph G**), and these codes are not equal, then it must be the case that there is no node relabeling which transforms G1 to G2. Why is this? If there were such a relabeling, then we already know we can relabel G1 to get to G*, and we can relabel G2 to get to G**, therefore we would be able, in the obvious way, to relabel G* to get to G1, then to G2, and then to G**. And if that were the case, then A* and A** belong to the same equivalence class of adjacency matrices, and they are not equal, and yet they are both the "maximum" values. This contradicts the construction of A* and A**. Therefore, distinct codes imply distinct (nonisomorphic) graphs.

Thus, by considering the equivalence class of all adjacency matrices corresponding to permutations of a graph, and taking the "maximum" one as the code for the graph, we have "factored out" the effect of node labeling. Isomorphism determination can be done by simply computing the codes of the two graphs.

Of course, in general, for large graphs, the computation of the graph code can be horrendously expensive. The only thing we want to say here is that this is a procedure that is guaranteed to produce a determination of the isomorphism question in an automatic way.

## Code Computation

The problem of isomorphism determination has now been recast into the problem of computing the signature function, that is, the directed color multigraph code. One way to do this, in turn, would be to generate every possible permutation of the nodes of the graph, in order to find the maximal adjacency matrix. It has been noted that such an approach becomes computationally infeasible, since the associated work load increases like the factorial function, as the number of nodes increases.

While a brute force approach will still work for small problems, it is necessary to come up with another approach for problems with a moderate number of nodes, and to identify any features of our problem that make a factorial work load less likely.

Rather than a brute force computation, we will use backtracking. In other words, we will start with the identity permutation of the nodes, and compute the associated adjacency matrix, which will be our initial estimate for the directed color multigraph code. Then we will consider all permutations in which the first node is swapped with the second. It is possible to compute a portion of the adjacency matrix that will result, and in some cases, it is possible to determine if this adjacency matrix is going to be smaller than our current best. If so, we immediately abandon the current search (which actually represents an enormous number of possibilities), and move to the next possibility.

In most cases, a backtracking approach is extremely efficient when compared to a brute force method. The cases where backtracking cannot offer much of an improvement will occur typically when the underlying graph is close to maximally connected, that is, almost every pair of nodes is connected. However, this will not happen in our problems, because the individual chemical components have a fixed and low valence. The total valence of any component (and correspondingly, the maximum degree of a node) will be 4 or 5.

Other features that will help us avoid the factorial explosion include the "color" of the nodes, the differing valences of the chemical constituents, and the fact that the typical total number of components in an object will be less than 40.

There are two advantages to the use of the code. First, it shows an organized and systematic way to approach the isomorphism question using a convenient data structure. Secondly, it allows us to precompute codes for any special or preferred objects.

The precomputation of codes is extremely useful in the case at hand. We expect to classify a set of standard objects, and then analyze, one at a time, many newly encountered objects. We expect each of the new objects to be equal to one of the standards. If we compute the code for the new object, we can inexpensively compare it to the precomputed codes of each of the standards, and we have our answer. Thus, a new object can be classified at the cost of "half" of a single isomorphism comparison, rather than the potential cost of carrying out a full isomorphism comparison between the new object and every one of the standards.

## Reference:

1. Frank Harary,
Graph Theory,
ISBN: 0201410338,
LC: QA166.H37.
2. Stephen Locke,
Graph Theory web pages,
http://www.math.fau.edu/locke/graphthe.html
3. Steven Skiena,
The Algorithm Design Manual,
Springer, 1997,
ISBN: 0-387-94860-0,
LC: QA76.9.A43S55.

FORTRAN software for computing signature codes for graphs, multigraphs, directed graphs, directed multigraphs, color graphs, color digraphs, and color directed multigraphs is available in the CODEPACK library at ../f_src/codepack/codepack.html .