ant colony optimization (ACO) meta-heuristic [18], which defines a particular class of ant A laboratory colony of Argentine ants (Iridomyrmex humilis) is given. Ant Colony Optimization (ACO) [31, 32] is a recently proposed metaheuristic ap- The first example of such an algorithm is Ant System (AS) [29, 36, 37, 38]. colony optimization is the foraging behavior of real ant colonies. An example of a Gaussian kernel PDF consisting of five separate Gaussian functions.

Author: | SHEENA BILLUPS |

Language: | English, Spanish, French |

Country: | Kuwait |

Genre: | Health & Fitness |

Pages: | 643 |

Published (Last): | 29.09.2016 |

ISBN: | 533-8-49866-631-2 |

ePub File Size: | 25.77 MB |

PDF File Size: | 9.13 MB |

Distribution: | Free* [*Regsitration Required] |

Downloads: | 42872 |

Uploaded by: | ZENAIDA |

Ant Colony Optimization. Marco Dorigo. Thomas Stützle. A Bradford Book. The MIT Press. Cambridge, Massachusetts. London, England. Ant Colony Optimization (ACO) is a paradigm for designing metaheuristic algo- . be performed by single ants, such as the invocation of a local optimization. real ant colonies are solving shortest path problems. Ant Colony Optimization takes elements from real ant behavior to solve more complex problems than real .

At the solution element level, the main operations that are considered for parallelization are the state transition rule and solution evaluation. On the other hand, Delisle et al. Message passing vs. The ant's movement is based on 4-connected pixels or 8-connected pixels. Journal of Systems Architecture, Gambardella et M.

For each execution, computed time comes from the last colony that finishes its search and tour length comes from the colony that found the best solution. We first notice that this implementation is quite scalable. In fact, speedups are relatively close to the number of cores in all configurations. Also, as each core performs the computations associated with a whole ant colony, workload is considerably large in the parallel region.

The ratio between parallelism costs and total execution time per core is then greatly reduced. Table 3 provides results obtained with multiple cooperating colonies. Every 10 iterations, the global best solution is used for the global pheromone update. For the remaining iterations, each colony uses its own best known solution to update its pheromone structure. We first note that the exchange strategy does not significantly hurt the execution time as speedups are still excellent with up to 8 processors.

Still, when 4 and 8 processors are used, most efficiency measures are slightly inferior to the ones obtained with independent colonies. Algorithmic Models and Hardware Implementations 11 http: Multiple independent colonies: Multiple cooperating colonies - Global best exchange each 10 cycles: Concerning solution quality, the reader may observe that in all cases, the average tour length obtained with multiple cooperating colonies is closer to the optimal solution than with independent colonies or sequential execution.

In most cases, the minimum solution found is also better. It shows that the information exchange scheme, while simple, is benefical to global global global solution quality. Ants are associated to blocks and solution elements are associated to threads. As it is shown below, ants may communicate with the relatively slow device memory of the GPU and solution elements may do so with the faster, shared memory of a multiprocessor.

As the ACO is not parallelized at the colony and iteration levels, their execution remain sequential and memory structure is not specified.

This implementation is global local. Device memory is relatively large in size but slow in access time. The global and local memory spaces are specific regions of the device memory that can be accessed in read and write modes. On the other hand, local memory stores automatic data structures that consume more registers than available.

It is composed of a constant memory cache, a texture memory cache, a shared memory and registers. Constant and texture caches are linked to the constant and texture memories that are physically located in the device memory. Consequently, they are accessible in read-only mode by the SPs and faster in access time than the rest of the device memory. The constant memory is very limited in size whereas texture memory size can be adjusted in order to occupy the available device memory.

All SPs can read and write in their local shared memory, which is fast in access time but small in size. It is divided into memory banks of bits words that can be accessed simultaneously. This implies that parallel requests for memory addresses that fall into the same memory bank cause the serialization of accesses [33]. Registers are the fastest memories available on a GPU but involve the use of slow local memory when too many are used.

Moreover, accesses may be delayed due to register read-after-write dependencies and register memory bank conflicts. Algorithmic Models and Hardware Implementations 13 http: It is based on the concept of kernels, which are functions written in C executed in parallel by a given number of CUDA threads.

These threads are grouped together into blocks that are distributed on the GPU SMs to be executed independently of each other. However, the number of blocks that an SM can process at the same time active blocks is restricted and depends on the quantity of registers and shared memory used by the threads of each block.

Threads within a block can cooperate by sharing data through the shared memory and by synchronizing their execution to coordinate memory accesses. In a block, the system groups threads typically 32 into warps which are executed simultaneously on successive clock cycles. The number of threads per block must be a multiple of its size to maximize efficiency. Much of the global memory latency can then be hidden by the thread scheduler if there are sufficient independent arithmetic instructions that can be issued while waiting for the global memory access to complete.

Consequently, the more active blocks there are per SM, and also active warps, the more the latency can be hidden. It is important to note that in the context of GPU execution, flow control instructions if, switch, do, for, while can affect the efficiency of an algorithm.

In fact, depending on the provided data, these instructions may force threads of a same warp to diverge, in other words, to take different paths in the program. In that case, execution paths must be serialized, increasing the total number of instructions executed by this warp. In the parallel ants general strategy, ants of a single colony are distributed to processing elements in order to execute tour constructions in parallel.

On a conventional CPU architecture, the concept of processing element is usually associated to a single-core processor or to one of the cores of a multi-core processor. A dedicated thread of a given block is then in charge of managing the tour construction of an ant, but an additional level of parallelism, the solution element level, may be exploited in the computation of the state transition rule.

In fact, an ant evaluates several candidates before selecting the one to add to its current solution. As these evaluations can be done in parallel, they are assigned to the remaining threads of the block. However, as only one ant is assigned to a block and so to an SM, taking advantage of the shared-memory is possible. Data needed to compute the ant state transition rule is then stored in this memory that is faster and accessible by all threads that participate in the computation.

Most remaining issues encountered in the GPU implementation of the parallel ants general strategy are related to memory management. As it was mentioned before, these accesses may be reduced by storing the related data structures in shared memory. However, in the case of ACO, the three central data structures are the pheromone matrix, the penalty matrix typically the transition cost between all pairs of solution elements and the candidates lists, which are needed by all ants of the colony while being too large typically ranging from O n to O n2 in size to fit in shared memory.

They are then kept in global memory. Minimums and averages are computed from 25 trials for problems with less than cities and from 10 trials for larger instances.

An effort is made to keep the algorithm and parameters as close as possible to the original MMAS. Following the guidelines of Barr and Hickman [36] and Alba [37], the relative speedup metric is computed on mean execution times to evaluate the performance of the proposed implementation.

The implementation uses a number of blocks equal to the number of ants, each one of them being composed of a number of threads equal to the size of candidate lists, in that case Also, the number of iterations is set with the intent of globally keeping the same global number of tour constructions for each experiment. A first step in our experiments is to compare solution quality obtained by sequential and parallel versions of the algorithm. Table 4 presents average tour length, best tour length and closeness to the optimal solution for each problem.

The reader may note the similarity between the results obtained by our sequential implementation and the ones provided by the authors of the original MMAS [35] , as well as their significant closeness to optimal solutions. A second step is to evaluate and compare the reduction of execution time that is obtained with the GPU parallelization strategy.

Table 4 shows the speedups obtained for each problem. The reader may notice that speedups are ranging from 6. This shows that distributing ants to blocks and sharing the computation of the state transition rule between several threads of a block is efficient.

Also, speedup generally increases with problem size, indicating the good scalabilty of the strategy. However, a slight decrease is encountered with the cities problem. In that case, the large workload and data structures imply memory access latencies and bank conflicts costs that grow faster than the benefits of parallelizing available work.

Associated to the combined effect of the increasing number of blocks required to perform computations and a limited number of active blocks per SM, performance gains become less significative. Algorithmic Models and Hardware Implementations 15 http: GPU implementation: Conclusion The main objective of this chapter was to provide a new algorithmic model to formalize the implementation of Ant Colony Optimization on high performance computing platforms.

The proposed taxonomy managed to capture important features related to both the algorithmic structure of ACO and the architecture of parallel computers. Case studies were also presented in order to illustrate how this classification translates into real applications. Finally, with its synthesized literature review and experimental study, this chapter served as an overview of current works on parallel ACO.

Still, as it is the case in the field of parallel metaheuristics in general, much can still be done for the effective use of state-of-the-art parallel computing platforms. For example, maximal exploitation of computing resources often requires algorithmic configurations that do not let ACO perform an effective exploration and exploitation of the search space.

On the other hand, parallel performance is strongly influenced by the combined effects of parameters related to the metaheuristic, the hardware technical architecture and the granularity of the parallelization.

As it becomes clear that the future of computers no longer relies on increasing the performance on a single computing core but on using many of them in a hybrid system, it becomes desirable to adapt optimization tools for parallel execution on many kinds of architectures.

We believe that the global acceptance of parallel computing in optimization systems requires algorithms and software that are not only effective, but also usable by a wide range of academicians and practitioners. References [1] M. Dorigo and T. Bullnheimer, G. Kotsis, and C. Parallelization strategies for the ant system. De Leone, A.

Murli, P. Pardalos, and G. Kluwer, Dordrecht, Parallelisation strategies for ant colony optimization. Eiben, T. Schwefel, and M. Springer-Verlag, New York, Pedemonte, S. Nesmachnow, and H. A survey on parallel ant colony optimization. Applied Soft Computing, Delisle, M. Krajecki, M. Gravel, and C. Parallel implementation of an ant colony optimization metaheuristic with openmp. Gravel, M. Krajecki, C.

A shared memory parallel implementation of ant colony optimization. Comparing parallelization of an aco: Message passing vs. Blesa, C. Blum, A. Roli, and M. Springer-Verlag Berlin Heidelberg, Gravel, and M. Multi-colony parallel ant colony optimization on smp and multi-core computers. IEEE, Parallel ant colony optimization on graphics processing units. Arabnia, S. Chiu, G.

Gravvanis, M. Ito, K. Joe, H. Nishikawa, and A. Journal of Parallel and Distributed Computing, page doi: Algorithmic Models and Hardware Implementations 17 http: Talbi, O. Roux, C. Fonlupt, and D. Parallel ant colonies for the quadratic assignment problem. Future Generation Computer Systems, 17 4: Randall and A. A parallel implementation of ant colony optimization. Journal of Parallel and Distributed Computing, 62 9: Islam, P.

Thulasiraman, and R. A parallel ant colony optimization algorithm for all-pair routing in manets. Craus and L. Parallel framework for ant-like algorithms. Doerner, R. Hartl, S. Benker, and M. Parallel cooperative savings based ant colony optimization - multiple search and decomposition approaches. Parallel Processing Letters, 16 3: Middendorf, F. Reischle, and H.

Multi colony ant algorithms. Journal of Heuristics, 8 3: Chu and A. Parallel ant colony optimization for 3d protein structure prediction using the hp lattice model.

Nedjah, L. Alba, editors, Parallel Evolutionary Computations, volume 22 of Studies in Computational Intelligence, chapter 9, pages — Springer, Manfrin, M. Birattari, T. Parallel ant colony optimization for the traveling salesman problem. Ellabib, P. Calamai, and O. Exchange strategies for multiple ant colony system. Information Sciences, 5: Alba, G. Leguizamon, and G. Two models of parallel aco algorithms for the minimum tardy task problem. Scheuermann, K.

So, M. Guntsch, M. Middendorf, O. Diessel, H. ElGindy, and H. Fpga implementation of population-based ant colony optimization. Applied Soft Computing, 4: Scheuermann, S. Janson, and M. Hardware-oriented ant colony optimization. Journal of Systems Architecture, Catala, J. Jaen, and J. Strategies for accelerating ant colony optimization algorithms on graphical processing units.

IEEE Press, Wang, J. Dong, and C. Implementation of ant colony algorithm based on gpu. Banissi, M. Sarfraz, J. Zhang, A. Ursyn, W. Jeng, M. Bannatyne, J. Zhang, L. San, and M. New Advances and Trends, pages 50— Parallel ant system for traveling salesman problem on gpus. Zhu and J. Parallel ant colony for nonlinear function optimization with graphics hardware acceleration. Li, X. Hu, Z. Pang, and K. A parallel ant colony optimization algorithm based on fine-grained model with gpu-acceleration.

Cecilia, J. Garcia, A. Nisbet, M. Amos, and M. Weis and A. Using xmpp for ad-hoc grid computing - an application example using parallel ant colony optimisation. A grid ant colony algorithm for the orienteering problem. From Design to Implementation.

Wiley Publishing, Foster and C. The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, Dorigo and L. Ant colony system: Max-min ant system. Future Generation Computer Systems, 16 8: Barr and B. Reporting computational experiments with parallel algorithms: Parallel evolutionary algorithms can achieve super-linear performance. Information Processing Letters, 82 1: Angelo, Jaqueline S. Augusto and Douglas A.

Barbosa Helio J. Barbosa Additional Additionalinformation informationis is available at the available endend at the of the chapter of the chapter http: Introduction Ant colony optimization ACO is a population-based metaheuristic inspired by the collective behavior of ants which is used for solving optimization problems in general and, in particular, those that can be reduced to finding good paths through graphs.

In ACO a set of agents artificial ants cooperate in trying to find good solutions to the problem at hand [1]. Ant colony algorithms are known to have a significant ability of finding high-quality solutions in a reasonable time [2].

In this line, a significant amount of research has been done in order to reduce computation time and improve the solution quality of ACO algorithms by using parallel computing. Due to the independence of the artificial ants, which are guided by an indirect communication via their environment pheromone trail and heuristic information , ACO algorithms are naturally suitable for parallel implementation. Parallel computing has become attractive during the last decade as an instrument to improve the efficiency of population-based methods.

One can highlight different reasons to parallelize an algorithm: In the literature one can find many possibilities on how to explore parallelism, and the final performance strongly depends on both the problem they are applied to and the hardware available [3]. In the last years, several works were devoted to the implementation of parallel ACO algorithms [4]. Most of these use clusters of PCs, where the workload is distributed to multiple computers [5].

More recently, the emergence of parallel architectures such as multi-core processors and graphics processing units GPU allowed new implementations of parallel ACO algorithms in order to speedup the computational performance.

Angelo distribution, et al. The massively parallel architecture of the GPUs makes them more efficient than general-purpose CPUs when large amount of independent data need to be processed in parallel.

The main type of parallelism in ACO algorithms is the parallel ant approach, which is the parallelism at the level of individual ants. Other steps of the ACO algorithms are also considered for speeding up their performance, such as the tour construction, evaluation of the solution and the pheromone update procedure.

The purpose of this chapter is to present a survey of the recent developments for parallel ant colony algorithms on GPU devices, highlighting and detailing parallelism strategies for each step of an ACO algorithm. Ant colonies, and other insects that live in colony, present interesting characteristics by the view of the collective behavior of those entities. Some characteristics of social groups in swarm intelligence are widely discussed in [6].

Among them, ant colonies in particular present a highly structured social organization, making them capable of self-organizing, without a centralized controller, in order to accomplish complex tasks for the survival of the entire colony [2]. Those capabilities, such as division of labor, foraging behavior, brood sorting and cooperative transportation, inspired different kinds of ant colony algorithms.

The first ACO algorithm was inspired on the capability of ants to find the shortest path between a food source and their nest. In all those examples ants coordinate their activities via stigmergy [7], which is an indirect communication mediated by modifications on the environment.

While moving, ants deposit pheromone chemical substance on the ground to mark paths that may be followed by other members of the colony, which then reinforce the pheromone on that path. This behavior leads to a self-reinforcing process that results in path marked by high concentration of pheromone while less used paths tend to have a decreasing pheromone level due to evaporation. Combinatorial problems In combinatorial optimization problems one wants to find discrete values for solution variables that lead to the optimal solution with respect to a given objective function.

An interesting characteristic of combinatorial problems is that they are easy to understand but very difficult to be solved [2]. The first developed ACO algorithm, called Ant System [1, 9], was initially applied to the TSP, then later improved and applied to many kinds of optimization problems [10]. The TSP can be symmetric or asymmetric. Using distances associated with each arc as cost values, in the symmetric TSP the distance between cities i and j is the same as between j and i, i.

Graphics Processing Unit Until recently the only viable choice as a platform for parallel programming was the conventional CPU processor, be it single- or multi-core. Usually many of them were arranged either tightly as multiprocessors, sharing a single memory space, or loosely as multicomputers, with the communication among them done indirectly due to the isolated memory spaces. The parallelism provided by the CPU is reasonably efficient and still very attractive, particularly for tasks with low degree of parallelism, but a new trendy platform for parallel computing has emerged in the past few years, the graphics processing unit, or simply the GPU architecture.

The beginning of the GPU architecture dates back to a couple of decades ago when some primitive devices were developed to offload certain basic graphics operations from the CPU.

Graphics operations, which end up being essentially the task to determine the right color of each individual pixel per frame, are in general both independent and specialized, allowing a high degree of parallelism to be explored.

However, doing such operations on conventional CPU processors, which are general-purpose and back then were exclusively sequential, is slow and inefficient. The advantage of parallel devices designed for such particular purpose was then becoming progressively evident, enabling and inviting a new world of graphics applications.

One of those applications was the computer game, which played an important role on the entire development history of the GPU. As with other graphics applications, games involve computing and displaying—possibly in parallel—numerous pixels at a time. But differently from other graphics applications, computer games were always popular among all range of computer users, and thus very attractive from a business perspective. Better and visually appealing games sell more, but they require more computational power.

This demand, as a consequence, has been pushing forward the GPU development since the early days, which in turn has been enabling the creation of more and more complex games.

Of course, in the meantime the CPU development had also been advancing, with the processors becoming progressively more complex, particularly due to the addition of cache memory hierarchies and many specific-purpose control units such as branch prediction, speculative and out-of-order execution, and so on [11]. The response from the industry to continually raise the computational power was to migrate from the sequential single-core to the parallel multi-core design.

Although the nowadays multi-core CPU processors perform fairly well, the decades of accumulative architectural optimizations toward sequential tasks have led to big and complex CPU cores, hence restricting the amount of them that could be packed on a single processor—not more than a few cores. As a consequence, the current CPU design cannot take advantage of workloads having high degree of parallelism, in other words, it is inefficient for massive parallelism.

Contrary to the development philosophy of the CPU, because of the requirements of graphics operations the GPU took since its infancy the massive parallelism as a design goal.

Filling the processor with numerous ALUs1 means that there is not much die area left for anything else, such as cache memory and control units. The benefit of this design choice is two-fold: As one can expect, the GPU reaches its peak of efficiency when the device is fully occupied, that is, when there are enough parallel tasks to utilize each one of the thousands of ALUs, as commonly found on a modern GPU.

Besides being highly parallel, this feature alone would not be enough to establish the GPU architecture as a compelling platform for mainstream high-performance computation.

In the early days, the graphics operations were mainly primitive and thus could be more easily and efficiently implemented in hardware through fixed, i. But again, such operations were becoming increasingly more complex, particularly in visually-rich computer games, that the GPU was forced to switch to a programmable architecture, where it was possible to execute not only strict graphics operations, but also arbitrary instructions.

The union of an efficient massively parallel architecture with the general-purpose capability has created one of the most exciting processor, the modern GPU architecture, outstanding in performance with respect to power consumption, price and space occupied. The following section will introduce the increasingly adopted open standard for heterogeneous programming, including of course the GPU, known as OpenCL. Open Computing Language — OpenCL An interesting fact about the CPU and GPU architectures is that while the CPU started as a general-purpose processor and got more and more parallelism through the multi-core design, the GPU did the opposite path, that is, started as a highly specialized parallel processor and was increasingly endowed with general-purpose capabilities as well.

In other words, these architectures have been slowly converging into a common design, although each one still has—and probably will always have due to fundamental architectural differences—divergent strengths: In these days, most of the processors are, to some extent, both parallel and general purpose; therefore, it should be possible to come along with a uniform programming interface to target such different but fundamentally related architectures.

This is the main idea behind OpenCL, a platform for uniform parallel programming of heterogeneous systems [14]. OpenCL is an open standard managed by a non-profit organization, the Khronos Group [14], that is architecture- and vendor-independent, so it is designed to work across multiple devices from different manufactures.

The two main goals of OpenCL are portability and efficiency. Portability is achieved by the guarantee that every supported device conforms with a common set of functionality defined by the OpenCL specification [15].

Host code The tasks performed by the host portion usually involve: Kernel code Since it implements the parallel decomposition of a given problem—a parallel strategy—, the kernel is usually the most critical aspect of an OpenCL program and so care should be taken in its design. The OpenCL kernel is similar to the concept of a procedure in a programming language, which takes a set of input arguments, performs computation on them, and writes back the result. The main difference is that an OpenCL kernel is a procedure that, when launched, actually multiple instances of them are spawned simultaneously, each one assigned to an individual execution unit of a parallel device.

The total number of work-items is referred to as global size, and defines the level of decomposition of the problem: Figure 1 illustrates the concept of a mapping between the compute and data domains. So, how could one connect the compute and data domains? Example of a mapping between the compute and data domains. A pseudo-OpenCL kernel implementing such strategy is presented in Algorithm 1.

Algorithm 1: For instance, in a 2-D 0 and global 1 , where the first could be domain a work-item would have two identifiers, globalid id mapped to index the row and the second the column of a matrix.

The reasoning is analogous for a 3-D domain range. Communication and Synchronization There are situations in which it is desirable or required to allow work-items to communicate and synchronize among them.

For efficiency reasons, such operations are not arbitrarily allowed among work-items across the whole N-D domain. All the work-items within a work-group are free to communicate and synchronize with each other.

The number of work-items per work-group is given by the parameter local size, which in practice determines how the global domain is partitioned. Again, the OpenCL runtime provides means that allow each work-group and work-item to identify themselves.

A work-group is identified with respect to the global N-D domain through groupid , and a work-item is identified locally within its work-group via localid. Compute Device Abstraction In order to provide a uniform programming interface, OpenCL abstracts the architecture of a parallel compute device, as shown in Figure 2.

There are two fundamental concepts in this abstraction, the compute and memory hierarchies. Abstraction of a parallel compute device architecture [16]. Not coincidentally this partitioning matches the software abstraction of work-groups and work-items. In fact, OpenCL guarantees that a work-group is entirely executed on a single compute unit whereas work-items are executed by processing elements. Nowadays GPUs usually have thousands of processing elements clustered in a dozen of 6 There are two main reasons why those operations are restricted: Therefore, to fully utilize such devices, there is needed at the very least this same amount of work-items in flight—however, the optimal amount of work-items in execution should be substantially more than that in order to the device have enough room to hide latencies [17, 18].

As for the memories, OpenCL exposes three memory spaces; from the more general to the more specific: Review of the literature In the last few years, many works have been devoted to parallel implementations of ACO algorithms in GPU devices, motivated by the powerful massively parallel architecture provided by the GPU.

The strategies applied to the GPU were based on the intrinsically data-parallelism provided by the vertex processor and the fragment processor. Both strategies performed similarly with respect to the quality of the obtained solutions.

In [20], the authors implemented a parallel MMAS using multiple colonies, where each colony is associated with a work-group and ants are associated with work-items within each work-group.

In the parallel implementation the CPU initializes the pheromone trails, parameters, and also controls the iteration process, while the GPU is responsible for running the main steps of the algorithm: The parallel GPU version was 2 to 32 times faster than the sequential version, whereas the solutions quality of the parallel version outperformed all the three MMAS serial versions. This technique requires the use of barrier synchronization in order to ensure consistency of memory.

In the work described in [21] the authors implemented a parallel ACO algorithm with a pattern search procedure to solve continuous functions with bound constraints. The parallel method was compared with a serial CPU implementation.

The computational experiments showed acceleration values between and almost in the parallel GPU implementation. On the other hand, both the parallel and serial versions obtained satisfactory results. However, regarding the solution quality under a time limit of one second, the parallel version outperformed the sequential one in most of the test problems. As a side note, the results could have been ever better if the authors had generated the random numbers directly on the GPU instead of pre computing them on the CPU.

The authors proposed an algorithm implementation which arranges the data into large scale matrices, taking advantage of the fact that the integration of MATLAB with the Jacket accelerator handles matrices on the GPU more naturally and efficiently than it could do with other data types.

The speedup values had been growing with the number of TSP nodes, but when the number of nodes reached the growth could not be sustained and slowed down drastically due to the frequent data-transfer operations between the CPU and GPU.

In [23], the authors make use of the GPU parallel computing power to solve pathfinding in games. The ACO algorithm proposed was implemented on a GPU device, where the parallelism strategies follow a similar strategy to the one presented in [19]. In this strategy, ants works in parallel to obtain a solution to the problem.

The author intended to study the algorithm scalability when large size problems are solved, against a corresponding implementation on a CPU. The hardware architecture was not available but the computational experiments showed that the GPU version was 15 times faster than its corresponding CPU implementation. In [24] an ACO algorithm was proposed for epistasis7 analysis. In order to tackle large scale problems, the authors proposed a multi-GPU parallel implementation consisting of one, three and six devices.

The experiments show that the results generated by the GPU implementation outperformed two other sequential versions in almost all trials and, when the dataset increased, the GPU performed faster than the other implementations. Besides the initialization process, all the algorithm steps are performed on the GPU, and all data pheromone matrix, set of solutions, etc. Therefore, no data was needed to be transferred between the CPU and GPU, only the best-so-far solution which checks if the termination condition is satisfied.

Birattari, N. Meuleau, et M. Dorigo, Model-based search for combinatorial optimization: A critical survey , Annals of Operations Research, vol. Ojha, A. Abraham and V. Dorigo, V. Maniezzo, et A. Colorni, Ant system: Martens, M. De Backer, R. Haesen, J. Vanthienen, M. Snoeck, B. Pfahring, "Multi-agent search for open scheduling: Bauer, B. Bullnheimer, R. Hartl and C. Strauss, "Minimizing total tardiness on a single machine using ant colony optimization," Central European Journal for Operations Research and Economics, vol.

Merkle and M. Middendorf, " An ant algorithm with a new pheromone evaluation rule for total tardiness problems ," Real World Applications of Evolutionary Computing, vol. Merkle, M. Middendorf and H. Blum, " ACO applied to group shop scheduling: Price and M. Gravel, " Comparing an ACO algorithm with other heuristics for the single machine scheduling problem with sequence-dependent setup times ," Journal of the Operational Research Society, vol.

Donati, V. Darley, B.

Toth, D. Vigo, " Models, relaxations and exact approaches for the capacitated vehicle routing problem ," Discrete Applied Mathematics, vol. Belenguer, and E. Ralphs, "Parallel branch and cut for capacitated vehicle routing," Parallel Computing, vol. Salhi and M. Sari, " A multi-level composite heuristic for the multi-depot vehicle fleet mix problem ," European Journal for Operations Research, vol.

Angelelli and M. Speranza, " The periodic vehicle routing problem with intermediate facilities ," European Journal for Operations Research, vol. Ho and D. Nanry and J. Barnes, " Solving the pickup and delivery problem with time windows using reactive tabu search ," Transportation Research Part B, vol.

Bent and P. Bachem, W. Hochstattler and M. Malich, " The simulated trading heuristic for solving vehicle routing problems ," Discrete Applied Mathematics, vol. Hong and Y. Park, "A heuristic for bi-objective vehicle routing with time window constraints," International Journal of Production Economics, vol. Rusell and W. Chiang, "Scatter search for the vehicle routing problem with time windows," European Journal for Operations Research, vol. Donati, R. Montemanni, N. Casagrande, A.

Rizzoli, L. Yagiura, T. Ibaraki and F. Aardal, S. Koster, C. Mannino and Antonio. Sassano, "Models and solution techniques for the frequency assignment problem," A Quarterly Journal of Operations Research, vol. Liang and A. Leguizamon and Z. Hadji, M. Rahoual, E. Talbi and V. Cordone and F. Maffioli," Colored Ant System and local search to design local telecommunication networks ," Applications of Evolutionary Computing: Proceedings of Evo Workshops, vol.

Blum and M. Leguizamon, Z. Michalewicz and Martin Schutz, " An ant system for the maximum independent set problem ," Proceedings of the Argentinian Congress on Computer Science, vol. Okobiah, S. Mohanty, and E. Sarkar, P. Ghosal, and S. Antenna synthesis based on the ant colony optimization algorithm. Meshoul and M Batouche, " Ant colony system with extremal dynamics for point matching and pose estimation ," Proceedings of the 16th International Conference on Pattern Recognition, vol.

Nezamabadi-pour, S. Saryazdi, and E. Rashedi, " Edge detection using ant algorithms ", Soft Computing, vol. Edge detection using ant colony search algorithm and multiscale contrast enhancement. SMC IECON ' Mathematical Problems in Engineering. Caro and M. Dorigo " AntNet: Martens, B. Baesens, T. Fawcett " Editorial Survey: Parpinelli, H. Lopes and A.

A Freitas, " An ant colony algorithm for classification rule discovery ," Data Mining: A heuristic Approach, pp. Chen, J. Applications and Reviews Vol.

Picard, A. Revel, M. Picard, M. Cord, A. Revel, " Image Retrieval over Networks: Optimization of energy supply networks using ant colony optimization PDF. This paper has highly influenced other papers. This paper has 6, citations. From This Paper Topics from this paper. Explore Further: Swarm intelligence Problem solving. Citations Publications citing this paper. Sort by: Influence Recency. Highly Influenced.

Efficient optimization methods for freeway management and control Zhe Cong Chennupati Gopinath