Dakota State University
Madison, SD
57024-1799
(605) 256-5166
rolfe@alpha.dsu.edu
http://www.dsu.edu/~rolfe/index.html
Current information
Timothy J. Rolfe
Professor, Computer Science
Eastern Washington University
Cheney, WA 99004-2412
(509) 358-2065
Electronic mail: TRolfe@ewu.edu
WWW home page: http://penguin.ewu.edu/~trolfe/
* Copyright 1995 by the Small College Computing Symposium
Permission to make printed or digital copies of all or part of this material for educational or personal use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies include this notice on the first page.
Appears in SCCS: Proceedings of the 28th Annual Small College Computing Symposium (SCCS, 1995), pp. 201-10.
This paper is also available as a Word file
ABSTRACT
Placing queens on a chessboard is a classic use of backtracking to speed up a worse than exponential-time algorithm. After the discussion of the basic problem and its solution, two algorithm optimizations are presented (both optimizations together increase the processing speed by an order of magnitude for sufficiently large boards), along with a symmetry constraint on acceptable board configurations.
The fully optimized algorithm is then used to show three separate approaches to using parallel processing to further speed the solution: (1) using fork() on a UNIX multiprocessor, (2) using a shared-memory multiprocessor (Silicon Graphics 4D/380), and (3) programming in a message-passing distributed-memory environment (PVM on RS/6000 computers).
The Queens Problem is that of positioning queens on an otherwise empty chessboard so that no queen is under attack by another. It is a problem that has been around since before Gauss [1], and is used by Horowitz and Sahni [2] as their specimen problem for backtracking algorithms. Niklaus Wirth also uses it [3], with an optimization on Horowitz and Sahni's solution. The problem can be generalized to that of positioning N queens on an NxN grid [4]. This paper will introduce a new optimization for the sequential solution, propose an additional constraint on acceptable solutions, and introduce parallel implementations. (Should anyone be interested in examining the code itself, contact may be made through the above addresses, or through the author's World-Wide Web homepage also shown above.)
Potentially the worst implementation of the Queens problem is to consider all potential ways of placing N queens into N^2 board positions --- that is, the combination of N^2 positions taken N at a time. Still, by the very nature of a queen's movements in attacking other positions, we know that there can only be one queen on a row, reducing the search tree to N^N leaves, a considerable improvement. Standard solution algorithms add back- tracking at this point: starting from the first row, position a queen in each column position and then recursively call the algorithm to position the queens on the next row down. Within each row, check each column position for attack by a queen in a higher row on the board and only proceed with the recursion if this column position is not itself under attack. As a result, at each level of recursion one knows that there is a valid board above the current row.
As implemented by Horowitz and Sahni [2], the process of checking for the validity of a position within a given row explicitly checks each previous row. The representation of a board configuration is given by a vector Board in which each cell gives the column position of the queen on that row. The check is thus an O(N) operation and is itself quite short. Row is gives the row being checked, Idx, the rows above:
for (Idx = 0; Idx < Row; Idx++)
if ( Board[Idx] == Board[Row] ||
abs(Board[Row]-Board[Idx]) == (Row-Idx) )
return 0;
return 1;
The implementation by Niklaus Wirth [3] introduces a significant
optimization: the validity check can be an O(1) operation, provided one
maintains three additional vectors. There can be only one queen in each column,
so that a length-N Boolean vector can in a single check indicate whether a
column already contains a queen. One can similarly check for positions on the
diagonals (-45 deg, Row-Col = constant) and on the anti-diagonals (+45 deg,
Row+Col = constant). On an NxN board, there are 2N-1 diagonals and 2N-1
antidiagonals.
There is, however, yet a further optimization: by symmetry, if there is only one queen per row, there also can be only one per column. Consequently, each successful solution to the problem will be a permutation vector of the column positions, and the number of candidate solutions (before backtracking) changes to N! from N^N. Hence the solutions of the Queens Problem can be obtained by generating permutation vectors recursively [5] and at each step checking for diagonal and antidiagonal attack from above. As soon as a partial permutation is found invalid, the algorithm can proceed with the next partial permutation at the present recursive level. The effect is to remove successively more column positions for checking as the recursive depth increases, ones that would be immediately rejected if checked.
To the problem of generating all valid solutions to the Queens Problem we can add a further constraint: that of generating all unique solutions, retaining only one solution from each set of solutions that are equivalent under rotations of the board or reflections of the board in a mirror --- potentially eight equivalent solutions (if the solution is not symmetric under rotation by 90 deg or by 180 deg; by the nature of the queens movements, there can be no solution symmetric under reflection in a mirror since that would require queens to be within the same columns, the same rows, the same diagonals, or the same antidiagonals).
In Figures 1 and 2, queens that are indistinguishable from each other on rotation of the board are given the same numeric index. For example, the queens along the edges of the 5x5 chessboard in Figure 1 are indistinguishable on rotation by 90 deg and so are all labeled with a "1". On the other hand, in the solution shown in Figure 2 none of the queens are equivalent on rotation; hence, all have different numeric indices.
. 1 . . . . . 1 . . .
. . . 2 . . . . . . 1
. . . . . 3 . . 2 . .
3 . . . . . 1 . . . .
. . 2 . . . . . . 1 .
. . . . 1 .
Symmetric on 90 deg rotation
Symmetric on 180 deg rotation
Original Vertical mirror
1 . . . . . . . . 1
. . 2 . . . . 2 . .
. . . . 3 3 . . . .
. 4 . . . . . . 4 .
. . . 5 . . 5 . . .
90 degree rotation Anti-diagonal mirror
. . . . 1 . . 3 . .
. 4 . . . 5 . . . .
. . . 2 . . . . 2 .
5 . . . . . 4 . . .
. . 3 . . . . . . 1
180 degree rotation Horizontal mirror
. 5 . . . . . . 5 .
. . . 4 . . 4 . . .
3 . . . . . . . . 3
. . 2 . . . . 2 . .
. . . . 1 1 . . . .
270 degree rotation Diagonal mirror
. . 3 . . 1 . . . .
. . . . 5 . . . 4 .
. 2 . . . . 2 . . .
. . . 4 . . . . . 5
1 . . . . . . 3 . .
If one is interested in just the number of unique solutions, there is
no need to retain the already-derived solutions. Given a candidate solution, one
must generate its equivalent solutions under the possible rotations and
reflections and verify that this equivalent set has not already been tallied.
While it might appear necessary to check the equivalent solutions with those
already derived, the very representation of the solutions gives an implicit
check: each candidate solution is represented by a permutation vector, and one
can easily find an "alphabetic" order in permutation vectors. If any equivalent
solution checked is "alphabetically" less than the candidate solution, then that
candidate can be discarded. The first implementation of the Queens Problem was used to check the effects of the two optimizations mentioned above: the O(N) validity check versus the O(1) validity check (Optimization 1 in the table), and the N^N implementation versus the N! implementation (Optimization 2 in the table). The table below shows the time required for running various board sizes on an notebook PC computer with a 25 MHz Cyrix 486 microprocessor under a program compiled using Borland C++. The data are also shown in Figure 3. Note that Figure 3 uses a logarithmic vertical axis, and that the horizontal axis begins at 9 rather than at the origin.
Optimization 1's speed-up approaches 2:1 for large N; optimization 2's approaches 5:1. Together they give a speed-up approaching an order of magnitude for large N. Figure 4 shows these ratios. The performance remains, unfortunately, worse than exponential in N. (There is a small symmetry artifact in these data: because of the rejection of mirror images, one needs to check only the left half of the first row; so when N is odd one checks (N+1)/2 positions , but when N is even one checks N/2. For example, both for N=11 and N=12 one must check six queen positions in the first row.)
N No Opt. Opt. Opt.s
Queens Opt.s 1 2 1 & 2
9 0.8240 0.7690 0.2750 0.2200
10 4.2310 3.4070 1.1540 0.9340
11 23.4070 17.1430 6.4290 4.8350
12 141.6480 94.1760 32.6920 22.9670
13 881.2640 537.9120 205.7690 135.7690
14 5780.5490 3268.5170 1180.2200 730.3300
15 39886.2070 21345.3301 8370.5498 4886.5933
Parallel Implementations
The Queens Problem lends itself well to parallel solution: the generation of all solutions with a queen in the first column of the first row does not require any information from the solutions with the queen in another column of the first row --- the inherent ordering of permutations frees us from needing to check complete solutions obtained from different starting configurations.
None of the parallel execution environments to which the author has access is
massively parallel. Consequently the parallel implementations all proceed
sequentially in rows below the first row. In a massively parallel environment,
every recursive step of the algorithm can generate parallel processes.
UNIX fork() on a Multiprocessor
The easiest parallel solution comes from using the UNIX fork() function on a multiple processor machine. If a process has an open file before the fork(), both the parent and child processes share that file. After the fork(), the value returned from the fork() function allows each process to identify itself as either child process (zero) or parent process (non- zero; specifically, the child Process ID) . It is then easy to have the parent process immediately begin processing with the present permutation vector while the child process remains in the loop to generate the next permutation vector --- which will be its to evaluate after it has itself forked.
In the following code segment, the parent process breaks from the loop while the child process generates the next initial configuration and returns to the top of the loop to engender its own child process. The last child generated ends up processing the configuration with the top row queen in the middle of the board --- any solution with the top row queen on the right will have a mirror image with the queen on the left and consequently will be rejected.
/* By symmetry, we only need to check half of the 1st row */
Lim = (Size-1) / 2;
for (Idx = 0; Idx < Lim; )
{
ChildID = fork();
if (ChildID != 0) break;
Idx++;
Temp = Board[0]; /* Swap Board[0] */
Board[0] = Board[Idx]; /* with */
Board[Idx] = Temp; /* Board[Idx] */
}
/* The last-generated process will handle Board[0] == Lim */
The shared output file can then accumulate the results for each process.
Upon completion, all processes but the original (for which Board[0] = 0)
terminates and the original process performs the final processing on the partial
results collected in the shared file.
The beauty of the fork() implementation is that all of the interprocess communication comes through a shared file and the actual parallel execution on the multiple processors of the machine is handled by the UNIX multitasking. One need not concern oneself with whether the multiprocessor is a shared memory machine or a distributed memory machine. In addition, it is easy to develop and debug the code on a uniprocessor, should one have only limited access to a multiprocessor.
The multiprocessor used for specimen runs was an eight-processor Silicon Graphics 4D/380 computer at the University of Chicago; of the eight, six are 25 MHz processors and two are 33 MHz processors. Active processes on the system are assigned to processors in a chance fashion; consequently CPU times for multiple runs of exactly the same task will vary. At the time when the following runs were performed (a Sunday afternoon), the system was showing a load average between 4.2 and 4.5. The speed-up column shows the ratio of the uniprocessor clock time with the multiprocessor clock time.
N processor Clock Execution
Queens time time environment Speed-up
10 0.2617 0.3125 uniprocessor
10 0.2942 0.2812 5 processes 1.111
11 1.2000 1.1875 uniprocessor
11 1.4575 0.5000 6 processes 2.375
12 7.0283 7.0312 uniprocessor
12 6.9433 1.4375 6 processes 4.891
13 38.2325 38.3125 uniprocessor
13 39.4925 6.3125 7 processes 6.069
14 222.3358 221.7500 uniprocessor
14 217.5893 34.5938 7 processes 6.410
14 218.5983 35.0938 7 processes 6.319
For the multiple-process runs, the processor time reported is the total
across all the processes. The clock time function differs from the CPU time
function in granularity and, apparently, in precision. The final run with 14
queens was repeated to give some indication of reproducibility of the times. (It
would appear that the 14-queen uniprocessor run had very few cycles on the 33MHz
processors so that the 14-queen speed-up reported is artificially high.) For the
shorter runs, it is obvious that the overhead of the forked execution swamps any
benefit from the parallel execution; the longer runs, however, show significant
speed-ups.
Given the initial load average of about 4.5, one might expect approximately
3.5 idle processors; the speed-ups of 6 are initially surprising, until one
realizes that by running 7 compute-bound processes in the multitasking mix one
is grabbing a disproportionate amount of computing recourses compared with the
other (uniprocessor) users. (On this system another factor enters in: the load
comes predominantly from long-running compute-bound processes, which users are
encouraged to run at lower priorities, while the above results come from
normal-priority runs.)
The shared memory multiprocessor used is the same Silicon Graphic computer at the University of Chicago. Under the Silicon Graphics implementation of m_fork(), multiple threads are generated all executing the same subprogram. The programmer has the responsibility for identifying segments of the code that must be executed by only one thread at a time --- the critical sections.
In the implementation of the Queens Problem there are two critical sections: (1) identification of the column position for the queen in the first row, and (2) the final processing of each complete solution identified. The Silicon Graphics functions m_lock() and m_unlock() make the implementation of critical sections quite simple. The first thread to execute m_lock() is not blocked but continues with the next statement; subsequent threads executing m_lock() are blocked until the thread currently in the critical section executes an m_unlock(). Thus the scheduling of the first-row queen's column position gives a very short critical section:
m_lock();
Top = NextUp++;
m_unlock();
where NextUp is a shared variable (declared static in the subprogram) and
identifies the next column position to be used for a queen. Given that, it is a
simple matter to generate the initial permutation vector. The final processing
of a completed solution is also quite brief: m_lock();
Nunique++;
Ntotal = Ntotal + Nequiv;
m_unlock();
where Nequiv is the number of separate equivalent solutions (2, 4, or 8),
and Nunique and Ntotal are shared global variables. The one thing that the
author discovered (to his great frustration) is that in a shared memory
implementation one must be extremely careful of declaring local function
variables as static within a function used by the parallel threads. Consequently
the scratch vectors required for checking equivalent solutions must be passed
within each thread to avoid the overhead of multiple calls to malloc() and
free().
The function m_synch() insures that all threads have finished their processing before the originating process begins the final handling of the generated solutions.
The above implementation allows a range of parallel threads, with each thread potentially calculating from more than one starting configuration: if there are still starting configurations to be processed, each thread as it finishes loops back to re-enter the scheduling critical section. For the Queens Problem, this provides a coarse-grained parallelism --- there are only N/2 starting configurations. Synchronization means that the observed clock time will be that of the last thread completing.
The Silicon Graphics computer used for the shared memory runs is the same system as that described above for the UNIX fork() runs. They were done at a diffrent time than the fork() runs, and the load average was in the neighborhood of 4.5 to 5 (on the eight-processor computer).
14 Queens, 7 tasks
CPU Clock Ratio Individual CPU times by tread . . .
Total Time T(1)/T(n) 1 2 3 4 5 6 7
37.34 38.250 1.000 37.25
42.48 22.000 1.739 17.07 21.16
48.23 18.500 2.068 13.67 13.32 17.16
47.64 13.375 2.860 6.43 11.59 12.21 11.27
49.66 12.500 3.060 11.92 6.65 11.80 7.19 11.63
55.02 12.250 3.122 6.98 6.22 10.73 7.50 7.64 11.28
49.40 9.375 4.080 6.34 5.81 6.59 7.16 7.46 7.16 7.43
These runs show well the expected granularity. There are seven tasks
(starting configurations) to be processed. With two threads, the slow thread
must process four tasks; with three threads, the slow thread must process three;
with four, five, and six threads, the slow thread must process two; with seven
threads, the slow thread is still only processing one starting configuration.
Given the system load, we find that the required clock time somewhat exceeds the
CPU time for the slowest thread. A thread waiting at an m_lock() does expend CPU
time --- which, with the speed discrepancies among processors, may help explain
the variability in CPU times reported. Distributed Memory with Message Passing --- PVM
PVM (Parallel Virtual Machine) [6, 7, 8] is a software package developed at Oak Ridge National Laboratories to combine networked heterogeneous computers into a single distributed computing system through message passing --- the "Virtual Machine" of its name.
The programming model in use here is that of a host (or master) process providing tasks to node (or slave) processes. In the context of the Queens Problem implementation, the host process initializes the node processes and then sends each node an initial board configuration to be processed. If there are fewer node processes than initial configurations, the node processes are sent new configurations as they report their results. Once all initial configurations have been processed, the host sends a shut-down message to the node processes and then finishes processing the accumulated information.
In the following code segment, passing an initial configuration to a node is embedded in the Launch() subprogram. On the other end, the Finished() subprogram blocks until some node returns results. It then processes those partial results and returns the identity of the node that is available for another configuration.
Active = 0; /* Number of node processes active */
NextUp = 0; /* Column position for row-0 queen */
Node = 0; /* Node process receiving a configuration */
Lim = (Size+1)/2; /* We only need do the left half */
while (1)
{
while ( NextUp < Lim && Active < Nnodes )
{
if ( NextUp > 0)
{
Temp = Board[0]; /* Swap Board[0] */
Board[0] = Board[NextUp]; /* with */
Board[NextUp] = Temp; /* Board[NextUp] */
}
/* Launch sends the problem to the indicated Node */
Launch (Board, Size, Node);
Node++ ; NextUp++ ; Active++ ;
}
if (Active == 0) /* If after launch attempt, there */
break; /* are no active nodes, we're done */
/* Finished() processes one node's returned information */
Node = Finished(); /* So Node is ready for Launch */
Active--; /* if NextUp < Lim, while loop launches one */
}
Several runs were performed on a set of RS/6000 computers in William
Reinhardt's research group at the University of Washington. Three computers
(ringb, ringc, and ringd) were RS/6000 model 550 computers; five computers
(valkyrie4 through valkyrie8) were RS/6000 model 350 computer. All have the same
processor and clock speed. The rings have 64KB data caches; the valkyries, 32KB
data caches. With the programs used here, the valkyries are approximately 1%
faster than the ring computers when all machines are completely unloaded. When
the results below were obtained, however, the interactive load and local network
traffic generated more significant variations in performance. The individual
process CPU times clearly show some nodes executing more configurations than
other nodes. The clock time required is tied tightly to the slowest node. For
convenience, the single-node run was used as the basis for a speed-up estimate.
Again we see the same granularity effects as we did in the shared memory runs.
14 Queens, 7 tasks
CPU Clock Ratio Individual CPU times by process . . .
Total Time T(1)/T(n) 1 2 3 4 5 6 7
79.30 79.500 1.000 79.30
78.81 45.000 1.767 44.98 33.80
78.38 33.000 2.409 32.82 22.44 23.11
79.20 23.250 3.419 21.72 22.69 11.62 23.14
78.55 22.500 3.533 21.27 22.42 11.50 11.55 11.80
78.56 21.625 3.676 21.58 10.77 11.50 11.48 11.59 11.62
78.56 12.125 6.557 9.78 10.85 11.49 11.50 11.64 11.61 11.68
In addition, there was a set of runs (with 11 through 16 queens), first
with only one node and then with the maximum number of nodes required for the
problem size. Again, the environment was the set of RS/6000 computers described
above (three model 550s and five model 350s). Again, a single-node run is used
to estimate the time required for the optimal agorithm for speed-up purposes.
No CPU Clock Individual CPU times by process . . .
Queens Total Time 1 2 3 4 5 6 7 8
11 0.5 0.1 0.1 0.1 0.1 0.1 0.1 0.1
12 2.4 0.5 0.4 0.4 0.4 0.4 0.4 0.4
13 14.3 2.3 1.8 2.0 2.1 2.1 2.1 2.1 2.1
14 78.5 11.8 9.8 10.9 11.5 11.5 11.6 11.6 11.6
15 539.9 71.6 59.4 63.3 68.0 68.5 69.6 70.3 70.4 70.4
16 3378.0 451.3 349.0 391.4 423.8 429.2 439.6 444.4 450.6 450.1
No CPU Clock Ratio Computing
Queens Total Time T(1)/T(n) Environment
11 0.51 0.625 1 node
11 0.52 0.125 5.000 6 nodes
12 2.37 2.500 1 node
12 2.41 0.500 5.000 6 nodes
13 14.23 14.375 1 node
13 14.31 2.250 6.389 7 nodes
14 78.35 78.500 1 node
14 78.54 11.750 6.681 7 nodes
15 534.05 537.75 1 node
15 539.89 71.625 7.508 8 nodes
16 3371.3 3379.0 1 node
16 3378.0 451.25 7.488 8 nodes
Even with only one task on a processor, there is variability in the CPU
time required based on the initial first-row queen position. The clock time
required for a run is, of course, tied to the execution time of the slowest node
process. For instance, the speed-up estimate for the 16-queens run is 7.5 rather
than 8. The Queens Problem is one whose solution can easily be set into a variety of
parallel processing environments. It displays many of the performance features
one would expect in a more applicable problem except that there is no
appreciable communication overhead --- the very characteristic that makes it
easy to parallelize.
When I began working on the fork() implementation of the Queens Problem, the
first multiprocessor runs were performed on an Encore Multimax computer at the
University of Minnesota; I thank William Murray for providing access to
umn-cs.cs.UMn.edu over a holiday week-end in the Fall of 1990. I wish to thank
R. Michael Townsend, Director of Advanced Research Systems at the University of
Chicago for access to rainbow.UChicago.edu, the eight- processor Silicon
Graphics computer used for the fork() and the shared memory runs reported above.
Thanks also go to William Reinhardt of the University of Washington for access
to Valkyrie1.chem.Washington.edu and her siblings, the RS/6000 computers on
which the PVM runs were made and which are supported by NSF grant CHE91-20206 to
the University of Washington. Finally I wish to thank Brian Carlson, my friend
and colleague at Dakota State University, for many fruitful discussions on the
Queens Problem and for his comments on this paper while it was in preparation.
Note, however, that the runs reported in this paper were performed with version 2.4.2, the version installed on the RS/6000 computers used.
Figure 1 and Figure 2 are
embedded in the text.
Figure 3: Time required on a
25 MHz 486SLC laptop computer.
Figure 4: Speed-up ratios
compared with unoptimized version.
In this version of the paper, two figures are added from the presentation at SCCS-95 in Sioux Falls, displaying the "worse than exponential" behavior of the solutions to this problem. If the problem is exponential --- O(k^N) --- then the ratio of T(N)/T(N-c) will be a constant; if the ratio is an increasing function, then the behavior is worse than exponential. Because only half of the column positions in the first row are checked, the number of starting configurations changes on the odd values of N.
Figure 5: Adjacent time
ratios [T(N) / T(N-1)] for each algorithm.
Figure 6: "Near-adjacent"
time ratios [T(N) / T(N-2)] for each algorithm.
To examine actual C code for implementing these Queens Problem solutions, click here.