Dakota State University

Madison, SD
57024-1799

(605) 256-5166

rolfe@alpha.dsu.edu

http://www.dsu.edu/~rolfe/index.html

__Current information__

Timothy J. Rolfe

Professor, Computer Science

Eastern Washington University

Cheney, WA 99004-2412

(509) 358-2065

Electronic mail: `TRolfe@ewu.edu`

WWW home page: `http://penguin.ewu.edu/~trolfe/`

* Copyright 1995 by the Small College Computing Symposium

Permission to make printed or digital copies of all or part of this material for educational or personal use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies include this notice on the first page.

Appears in **SCCS: Proceedings of the 28th Annual Small College Computing Symposium**
(SCCS, 1995), pp. 201-10.

This paper is also available as a Word file

- Abstract
- Introduction
- Solution Generation
- Optimization Results
- Parallel Implementations
- UNIX fork() on a Multiprocessor
- Shared Memory
- Distributed Memory with Message Passing --- PVM
- Conclusions
- Acknowledgements
- References
- Figures
- Code

Placing queens on a chessboard is a classic use of backtracking to speed up a worse than exponential-time algorithm. After the discussion of the basic problem and its solution, two algorithm optimizations are presented (both optimizations together increase the processing speed by an order of magnitude for sufficiently large boards), along with a symmetry constraint on acceptable board configurations.

The fully optimized algorithm is then used to show three separate approaches to using parallel processing to further speed the solution: (1) using fork() on a UNIX multiprocessor, (2) using a shared-memory multiprocessor (Silicon Graphics 4D/380), and (3) programming in a message-passing distributed-memory environment (PVM on RS/6000 computers).

The Queens Problem is that of positioning queens on an otherwise empty chessboard so that no queen is under attack by another. It is a problem that has been around since before Gauss [1], and is used by Horowitz and Sahni [2] as their specimen problem for backtracking algorithms. Niklaus Wirth also uses it [3], with an optimization on Horowitz and Sahni's solution. The problem can be generalized to that of positioning N queens on an NxN grid [4]. This paper will introduce a new optimization for the sequential solution, propose an additional constraint on acceptable solutions, and introduce parallel implementations. (Should anyone be interested in examining the code itself, contact may be made through the above addresses, or through the author's World-Wide Web homepage also shown above.)

Potentially the worst implementation of the Queens problem is to consider all potential ways of placing N queens into N^2 board positions --- that is, the combination of N^2 positions taken N at a time. Still, by the very nature of a queen's movements in attacking other positions, we know that there can only be one queen on a row, reducing the search tree to N^N leaves, a considerable improvement. Standard solution algorithms add back- tracking at this point: starting from the first row, position a queen in each column position and then recursively call the algorithm to position the queens on the next row down. Within each row, check each column position for attack by a queen in a higher row on the board and only proceed with the recursion if this column position is not itself under attack. As a result, at each level of recursion one knows that there is a valid board above the current row.

As implemented by Horowitz and Sahni [2], the process of checking for the validity of a position within a given row explicitly checks each previous row. The representation of a board configuration is given by a vector Board in which each cell gives the column position of the queen on that row. The check is thus an O(N) operation and is itself quite short. Row is gives the row being checked, Idx, the rows above:

for (Idx = 0; Idx < Row; Idx++) if ( Board[Idx] == Board[Row] || abs(Board[Row]-Board[Idx]) == (Row-Idx) ) return 0; return 1;The implementation by Niklaus Wirth [3] introduces a significant optimization: the validity check can be an O(1) operation, provided one maintains three additional vectors. There can be only one queen in each column, so that a length-N Boolean vector can in a single check indicate whether a column already contains a queen. One can similarly check for positions on the diagonals (-45 deg, Row-Col = constant) and on the anti-diagonals (+45 deg, Row+Col = constant). On an NxN board, there are 2N-1 diagonals and 2N-1 antidiagonals.

There is, however, yet a further optimization: by symmetry, if there is only one queen per row, there also can be only one per column. Consequently, each successful solution to the problem will be a permutation vector of the column positions, and the number of candidate solutions (before backtracking) changes to N! from N^N. Hence the solutions of the Queens Problem can be obtained by generating permutation vectors recursively [5] and at each step checking for diagonal and antidiagonal attack from above. As soon as a partial permutation is found invalid, the algorithm can proceed with the next partial permutation at the present recursive level. The effect is to remove successively more column positions for checking as the recursive depth increases, ones that would be immediately rejected if checked.

To the problem of generating all valid solutions to the Queens Problem we can add a further constraint: that of generating all unique solutions, retaining only one solution from each set of solutions that are equivalent under rotations of the board or reflections of the board in a mirror --- potentially eight equivalent solutions (if the solution is not symmetric under rotation by 90 deg or by 180 deg; by the nature of the queens movements, there can be no solution symmetric under reflection in a mirror since that would require queens to be within the same columns, the same rows, the same diagonals, or the same antidiagonals).

In Figures 1 and 2, queens that are indistinguishable from each other on rotation of the board are given the same numeric index. For example, the queens along the edges of the 5x5 chessboard in Figure 1 are indistinguishable on rotation by 90 deg and so are all labeled with a "1". On the other hand, in the solution shown in Figure 2 none of the queens are equivalent on rotation; hence, all have different numeric indices.

. 1 . . . . . 1 . . . . . . 2 . . . . . . 1 . . . . . 3 . . 2 . . 3 . . . . . 1 . . . . . . 2 . . . . . . 1 . . . . . 1 . Symmetric on 90 deg rotation Symmetric on 180 deg rotation

Original Vertical mirror 1 . . . . . . . . 1 . . 2 . . . . 2 . . . . . . 3 3 . . . . . 4 . . . . . . 4 . . . . 5 . . 5 . . . 90 degree rotation Anti-diagonal mirror . . . . 1 . . 3 . . . 4 . . . 5 . . . . . . . 2 . . . . 2 . 5 . . . . . 4 . . . . . 3 . . . . . . 1 180 degree rotation Horizontal mirror . 5 . . . . . . 5 . . . . 4 . . 4 . . . 3 . . . . . . . . 3 . . 2 . . . . 2 . . . . . . 1 1 . . . . 270 degree rotation Diagonal mirror . . 3 . . 1 . . . . . . . . 5 . . . 4 . . 2 . . . . 2 . . . . . . 4 . . . . . 5 1 . . . . . . 3 . .If one is interested in just the number of unique solutions, there is no need to retain the already-derived solutions. Given a candidate solution, one must generate its equivalent solutions under the possible rotations and reflections and verify that this equivalent set has not already been tallied. While it might appear necessary to check the equivalent solutions with those already derived, the very representation of the solutions gives an implicit check: each candidate solution is represented by a permutation vector, and one can easily find an "alphabetic" order in permutation vectors. If any equivalent solution checked is "alphabetically" less than the candidate solution, then that candidate can be discarded.

The first implementation of the Queens Problem was used to check the effects of the two optimizations mentioned above: the O(N) validity check versus the O(1) validity check (Optimization 1 in the table), and the N^N implementation versus the N! implementation (Optimization 2 in the table). The table below shows the time required for running various board sizes on an notebook PC computer with a 25 MHz Cyrix 486 microprocessor under a program compiled using Borland C++. The data are also shown in Figure 3. Note that Figure 3 uses a logarithmic vertical axis, and that the horizontal axis begins at 9 rather than at the origin.

Optimization 1's speed-up approaches 2:1 for large N; optimization 2's approaches 5:1. Together they give a speed-up approaching an order of magnitude for large N. Figure 4 shows these ratios. The performance remains, unfortunately, worse than exponential in N. (There is a small symmetry artifact in these data: because of the rejection of mirror images, one needs to check only the left half of the first row; so when N is odd one checks (N+1)/2 positions , but when N is even one checks N/2. For example, both for N=11 and N=12 one must check six queen positions in the first row.)

N No Opt. Opt. Opt.s Queens Opt.s 1 2 1 & 2 9 0.8240 0.7690 0.2750 0.2200 10 4.2310 3.4070 1.1540 0.9340 11 23.4070 17.1430 6.4290 4.8350 12 141.6480 94.1760 32.6920 22.9670 13 881.2640 537.9120 205.7690 135.7690 14 5780.5490 3268.5170 1180.2200 730.3300 15 39886.2070 21345.3301 8370.5498 4886.5933

The Queens Problem lends itself well to parallel solution: the generation of all solutions with a queen in the first column of the first row does not require any information from the solutions with the queen in another column of the first row --- the inherent ordering of permutations frees us from needing to check complete solutions obtained from different starting configurations.

None of the parallel execution environments to which the author has access is
massively parallel. Consequently the parallel implementations all proceed
sequentially in rows below the first row. In a massively parallel environment,
every recursive step of the algorithm can generate parallel processes.

*UNIX fork() on a Multiprocessor *

The easiest parallel solution comes from using the UNIX fork() function on a multiple processor machine. If a process has an open file before the fork(), both the parent and child processes share that file. After the fork(), the value returned from the fork() function allows each process to identify itself as either child process (zero) or parent process (non- zero; specifically, the child Process ID) . It is then easy to have the parent process immediately begin processing with the present permutation vector while the child process remains in the loop to generate the next permutation vector --- which will be its to evaluate after it has itself forked.

In the following code segment, the parent process breaks from the loop while the child process generates the next initial configuration and returns to the top of the loop to engender its own child process. The last child generated ends up processing the configuration with the top row queen in the middle of the board --- any solution with the top row queen on the right will have a mirror image with the queen on the left and consequently will be rejected.

/* By symmetry, we only need to check half of the 1st row */ Lim = (Size-1) / 2; for (Idx = 0; Idx < Lim; ) { ChildID = fork(); if (ChildID != 0) break; Idx++; Temp = Board[0]; /* Swap Board[0] */ Board[0] = Board[Idx]; /* with */ Board[Idx] = Temp; /* Board[Idx] */ } /* The last-generated process will handle Board[0] == Lim */The shared output file can then accumulate the results for each process. Upon completion, all processes but the original (for which Board[0] = 0) terminates and the original process performs the final processing on the partial results collected in the shared file.

The beauty of the fork() implementation is that all of the interprocess communication comes through a shared file and the actual parallel execution on the multiple processors of the machine is handled by the UNIX multitasking. One need not concern oneself with whether the multiprocessor is a shared memory machine or a distributed memory machine. In addition, it is easy to develop and debug the code on a uniprocessor, should one have only limited access to a multiprocessor.

The multiprocessor used for specimen runs was an eight-processor Silicon Graphics 4D/380 computer at the University of Chicago; of the eight, six are 25 MHz processors and two are 33 MHz processors. Active processes on the system are assigned to processors in a chance fashion; consequently CPU times for multiple runs of exactly the same task will vary. At the time when the following runs were performed (a Sunday afternoon), the system was showing a load average between 4.2 and 4.5. The speed-up column shows the ratio of the uniprocessor clock time with the multiprocessor clock time.

N processor Clock Execution Queens time time environment Speed-up 10 0.2617 0.3125 uniprocessor 10 0.2942 0.2812 5 processes 1.111 11 1.2000 1.1875 uniprocessor 11 1.4575 0.5000 6 processes 2.375 12 7.0283 7.0312 uniprocessor 12 6.9433 1.4375 6 processes 4.891 13 38.2325 38.3125 uniprocessor 13 39.4925 6.3125 7 processes 6.069 14 222.3358 221.7500 uniprocessor 14 217.5893 34.5938 7 processes 6.410 14 218.5983 35.0938 7 processes 6.319For the multiple-process runs, the processor time reported is the total across all the processes. The clock time function differs from the CPU time function in granularity and, apparently, in precision. The final run with 14 queens was repeated to give some indication of reproducibility of the times. (It would appear that the 14-queen uniprocessor run had very few cycles on the 33MHz processors so that the 14-queen speed-up reported is artificially high.) For the shorter runs, it is obvious that the overhead of the forked execution swamps any benefit from the parallel execution; the longer runs, however, show significant speed-ups.

Given the initial load average of about 4.5, one might expect approximately
3.5 idle processors; the speed-ups of 6 are initially surprising, until one
realizes that by running 7 compute-bound processes in the multitasking mix one
is grabbing a disproportionate amount of computing recourses compared with the
other (uniprocessor) users. (On this system another factor enters in: the load
comes predominantly from long-running compute-bound processes, which users are
encouraged to run at lower priorities, while the above results come from
normal-priority runs.)

The shared memory multiprocessor used is the same Silicon Graphic computer at the University of Chicago. Under the Silicon Graphics implementation of m_fork(), multiple threads are generated all executing the same subprogram. The programmer has the responsibility for identifying segments of the code that must be executed by only one thread at a time --- the critical sections.

In the implementation of the Queens Problem there are two critical sections: (1) identification of the column position for the queen in the first row, and (2) the final processing of each complete solution identified. The Silicon Graphics functions m_lock() and m_unlock() make the implementation of critical sections quite simple. The first thread to execute m_lock() is not blocked but continues with the next statement; subsequent threads executing m_lock() are blocked until the thread currently in the critical section executes an m_unlock(). Thus the scheduling of the first-row queen's column position gives a very short critical section:

m_lock(); Top = NextUp++; m_unlock();where NextUp is a shared variable (declared static in the subprogram) and identifies the next column position to be used for a queen. Given that, it is a simple matter to generate the initial permutation vector. The final processing of a completed solution is also quite brief:

m_lock(); Nunique++; Ntotal = Ntotal + Nequiv; m_unlock();where Nequiv is the number of separate equivalent solutions (2, 4, or 8), and Nunique and Ntotal are shared global variables. The one thing that the author discovered (to his great frustration) is that in a shared memory implementation one must be extremely careful of declaring local function variables as static within a function used by the parallel threads. Consequently the scratch vectors required for checking equivalent solutions must be passed within each thread to avoid the overhead of multiple calls to malloc() and free().

The function m_synch() insures that all threads have finished their processing before the originating process begins the final handling of the generated solutions.

The above implementation allows a range of parallel threads, with each thread potentially calculating from more than one starting configuration: if there are still starting configurations to be processed, each thread as it finishes loops back to re-enter the scheduling critical section. For the Queens Problem, this provides a coarse-grained parallelism --- there are only N/2 starting configurations. Synchronization means that the observed clock time will be that of the last thread completing.

The Silicon Graphics computer used for the shared memory runs is the same system as that described above for the UNIX fork() runs. They were done at a diffrent time than the fork() runs, and the load average was in the neighborhood of 4.5 to 5 (on the eight-processor computer).

14 Queens, 7 tasks

CPU Clock Ratio Individual CPU times by tread . . . Total Time T(1)/T(n) 1 2 3 4 5 6 7 37.34 38.250 1.000 37.25 42.48 22.000 1.739 17.07 21.16 48.23 18.500 2.068 13.67 13.32 17.16 47.64 13.375 2.860 6.43 11.59 12.21 11.27 49.66 12.500 3.060 11.92 6.65 11.80 7.19 11.63 55.02 12.250 3.122 6.98 6.22 10.73 7.50 7.64 11.28 49.40 9.375 4.080 6.34 5.81 6.59 7.16 7.46 7.16 7.43These runs show well the expected granularity. There are seven tasks (starting configurations) to be processed. With two threads, the slow thread must process four tasks; with three threads, the slow thread must process three; with four, five, and six threads, the slow thread must process two; with seven threads, the slow thread is still only processing one starting configuration. Given the system load, we find that the required clock time somewhat exceeds the CPU time for the slowest thread. A thread waiting at an m_lock() does expend CPU time --- which, with the speed discrepancies among processors, may help explain the variability in CPU times reported.

*Distributed Memory with Message Passing --- PVM *

PVM (Parallel Virtual Machine) [6, 7, 8] is a software package developed at Oak Ridge National Laboratories to combine networked heterogeneous computers into a single distributed computing system through message passing --- the "Virtual Machine" of its name.

The programming model in use here is that of a host (or master) process providing tasks to node (or slave) processes. In the context of the Queens Problem implementation, the host process initializes the node processes and then sends each node an initial board configuration to be processed. If there are fewer node processes than initial configurations, the node processes are sent new configurations as they report their results. Once all initial configurations have been processed, the host sends a shut-down message to the node processes and then finishes processing the accumulated information.

In the following code segment, passing an initial configuration to a node is embedded in the Launch() subprogram. On the other end, the Finished() subprogram blocks until some node returns results. It then processes those partial results and returns the identity of the node that is available for another configuration.

Active = 0; /* Number of node processes active */ NextUp = 0; /* Column position for row-0 queen */ Node = 0; /* Node process receiving a configuration */ Lim = (Size+1)/2; /* We only need do the left half */ while (1) { while ( NextUp < Lim && Active < Nnodes ) { if ( NextUp > 0) { Temp = Board[0]; /* Swap Board[0] */ Board[0] = Board[NextUp]; /* with */ Board[NextUp] = Temp; /* Board[NextUp] */ } /* Launch sends the problem to the indicated Node */ Launch (Board, Size, Node); Node++ ; NextUp++ ; Active++ ; } if (Active == 0) /* If after launch attempt, there */ break; /* are no active nodes, we're done */ /* Finished() processes one node's returned information */ Node = Finished(); /* So Node is ready for Launch */ Active--; /* if NextUp < Lim, while loop launches one */ }Several runs were performed on a set of RS/6000 computers in William Reinhardt's research group at the University of Washington. Three computers (ringb, ringc, and ringd) were RS/6000 model 550 computers; five computers (valkyrie4 through valkyrie8) were RS/6000 model 350 computer. All have the same processor and clock speed. The rings have 64KB data caches; the valkyries, 32KB data caches. With the programs used here, the valkyries are approximately 1% faster than the ring computers when all machines are completely unloaded. When the results below were obtained, however, the interactive load and local network traffic generated more significant variations in performance. The individual process CPU times clearly show some nodes executing more configurations than other nodes. The clock time required is tied tightly to the slowest node. For convenience, the single-node run was used as the basis for a speed-up estimate. Again we see the same granularity effects as we did in the shared memory runs.

14 Queens, 7 tasks

CPU Clock Ratio Individual CPU times by process . . . Total Time T(1)/T(n) 1 2 3 4 5 6 7 79.30 79.500 1.000 79.30 78.81 45.000 1.767 44.98 33.80 78.38 33.000 2.409 32.82 22.44 23.11 79.20 23.250 3.419 21.72 22.69 11.62 23.14 78.55 22.500 3.533 21.27 22.42 11.50 11.55 11.80 78.56 21.625 3.676 21.58 10.77 11.50 11.48 11.59 11.62 78.56 12.125 6.557 9.78 10.85 11.49 11.50 11.64 11.61 11.68In addition, there was a set of runs (with 11 through 16 queens), first with only one node and then with the maximum number of nodes required for the problem size. Again, the environment was the set of RS/6000 computers described above (three model 550s and five model 350s). Again, a single-node run is used to estimate the time required for the optimal agorithm for speed-up purposes.

No CPU Clock Individual CPU times by process . . . Queens Total Time 1 2 3 4 5 6 7 8 11 0.5 0.1 0.1 0.1 0.1 0.1 0.1 0.1 12 2.4 0.5 0.4 0.4 0.4 0.4 0.4 0.4 13 14.3 2.3 1.8 2.0 2.1 2.1 2.1 2.1 2.1 14 78.5 11.8 9.8 10.9 11.5 11.5 11.6 11.6 11.6 15 539.9 71.6 59.4 63.3 68.0 68.5 69.6 70.3 70.4 70.4 16 3378.0 451.3 349.0 391.4 423.8 429.2 439.6 444.4 450.6 450.1 No CPU Clock Ratio Computing Queens Total Time T(1)/T(n) Environment 11 0.51 0.625 1 node 11 0.52 0.125 5.000 6 nodes 12 2.37 2.500 1 node 12 2.41 0.500 5.000 6 nodes 13 14.23 14.375 1 node 13 14.31 2.250 6.389 7 nodes 14 78.35 78.500 1 node 14 78.54 11.750 6.681 7 nodes 15 534.05 537.75 1 node 15 539.89 71.625 7.508 8 nodes 16 3371.3 3379.0 1 node 16 3378.0 451.25 7.488 8 nodesEven with only one task on a processor, there is variability in the CPU time required based on the initial first-row queen position. The clock time required for a run is, of course, tied to the execution time of the slowest node process. For instance, the speed-up estimate for the 16-queens run is 7.5 rather than 8.

The Queens Problem is one whose solution can easily be set into a variety of
parallel processing environments. It displays many of the performance features
one would expect in a more applicable problem except that there is no
appreciable communication overhead --- the very characteristic that makes it
easy to parallelize.

When I began working on the fork() implementation of the Queens Problem, the
first multiprocessor runs were performed on an Encore Multimax computer at the
University of Minnesota; I thank William Murray for providing access to
umn-cs.cs.UMn.edu over a holiday week-end in the Fall of 1990. I wish to thank
R. Michael Townsend, Director of Advanced Research Systems at the University of
Chicago for access to rainbow.UChicago.edu, the eight- processor Silicon
Graphics computer used for the fork() and the shared memory runs reported above.
Thanks also go to William Reinhardt of the University of Washington for access
to Valkyrie1.chem.Washington.edu and her siblings, the RS/6000 computers on
which the PVM runs were made and which are supported by NSF grant CHE91-20206 to
the University of Washington. Finally I wish to thank Brian Carlson, my friend
and colleague at Dakota State University, for many fruitful discussions on the
Queens Problem and for his comments on this paper while it was in preparation.

- Doug Cooper and Michael Clancy,
**Oh! Pascal!**(2nd edition; W. W. Norton, 1985), pp. 367-72. - Ellis Horowitz and Sartaj Sahni,
**Fundamentals of Computer Algorithms**(Computer Science Press, 1978), pp. 324-39. - Niklaus Wirth,
**Algorithms and Data Structures**(Prentice-Hall, 1986), pp. 153-57. Here Wirth also notes his paper Program development by stepwise refinement, Comm. ACM, Vol. 14 (1971), 221-27. - Thomas L. Naps,
**Introduction to Data Structures and Algorithm Analysis**(2nd edition; West Publishing Company, 1992), pp. 460-61. The extension from eight queens to N queens is a natural one, however, and the bulk of the implementations were done before I read Naps' treatment. - On the generation of permutation vectors in "alphabetic" order, see
Timothy J. Rolfe,
*Generation of permutations with non-unique elements*, SIGNUM Newsletter, Vol. 23, No. 2 (Spring 1988), pp. 24-28. - Timothy J. Rolfe, "PVM: an affordable parallel processing environment",
**SCCS Proceedings: 27th Annual Small College Computing Symposium**(1994), pp. 118-25. - A. Beuelin, J. J. Dongara, G. A. Geist, W. Jiang, R. Manchek, K. Moore,
and V. S. Sunderam,
*The PVM project*(Oak Ridge National Laboratories: 15 Feb 1993). This is available by anonymous ftp as /netlib/pvm3/writeup.ps.Z from netlib.att.com.Note, however, that the runs reported in this paper were performed with version 2.4.2, the version installed on the RS/6000 computers used.

- G. A. Geist, A. Beuelin, J. J. Dongara, W. Jiang, R. Manchek, K. Moore,
and V. S. Sunderam,
**PVM 3 User's Guide and Reference Manual**(Oak Ridge National Laboratories: 21 Sep 1994). This is available by anonymous ftp as /netlib/pvm3/ug.ps.Z from netlib.att.com.

Figure 3: Time required on a 25 MHz 486SLC laptop computer.

Figure 4: Speed-up ratios compared with unoptimized version.

In this version of the paper, two figures are added from the presentation at SCCS-95 in Sioux Falls, displaying the "worse than exponential" behavior of the solutions to this problem. If the problem is exponential --- O(k^N) --- then the ratio of T(N)/T(N-c) will be a constant; if the ratio is an increasing function, then the behavior is worse than exponential. Because only half of the column positions in the first row are checked, the number of starting configurations changes on the odd values of N.

Figure 5: Adjacent time ratios [T(N) / T(N-1)] for each algorithm.

Figure 6: "Near-adjacent" time ratios [T(N) / T(N-2)] for each algorithm.

To examine actual C code for implementing these Queens Problem solutions, click here.