Geographical distance: Difference between revisions

Revision as of 16:37, 1 February 2014

The parallel-TEBD is a version of the TEBD algorithm adapted to run on multiple hosts. The task of parallelizing TEBD could be achieved in various ways.

As a first option, one could use the OpenMP API (this would probably be the simplest way to do it), using preprocessor directives to decide which portion of the code should be parallelized. The drawback of this is that one is confined to Symmetric multiprocessing (SMP) architectures and the user has no control on how the code is parallelized. An Intel extension of OpenMP, called Cluster OpenMP [2], is a socket-based implementation of OpenMP which can make use of a whole cluster of SMP machines; this spares the user of explicitly writing messaging code while giving access to multiple hosts via a distributed shared-memory system. The OpenMP paradigm (hence its extension Cluster OpenMP as well) allows the user a straightforward parallelization of serial code by embedding a set of directives in it.

The second option is using the Message Passing Interface (MPI) API. MPI can treat each core of the multi-core machines as separate execution host, so a cluster of, let's say, 10 compute nodes with dual-core processors will appear as 20 compute nodes, on which the MPI application can be distributed. MPI offers the user more control over the way the program is parallelized. The drawback of MPI is that is not very easy to implement and the programmer has to have a certain understanding of parallel simulation systems.

For the determined programmer the third option would probably be the most appropriate: to write ones own routines, using a combination of threads and TCP/IP sockets to complete the task. The threads are necessary in order to make the socket-based communication between the programs non-blocking (the communication between programs has to take place in threads, so that the main thread doesn't have to wait for the communication to end and can execute other parts of the code). This option offers the programmer complete control over the code and eliminates any overhead which might come from the use of the Cluster OpenMP or MPI libraries.

This article introduces the conceptual basis of the implementation, using MPI-based pseudo-code for exemplification, while not restricting itself to MPI - the same basic schema could be implemented with the use of home-grown messaging routines.

Introduction

The TEBD algorithm is a good candidate for parallel computing because the exponential operators used to calculate the time-evolution factorize under the Suzuki-Trotter expansion. A detailed presentation of the way TEBD works is given in the main article. Here we concern ourselves only with its parallel implementation.

Implementation

For our purposes, we will use the canonical form of the MPS as introduced by Guifré Vidal in his original papers. Hence, we will write the function of state $|\Psi \rangle$ as:

$|\Psi \rangle =\sum \limits _{i_{1},..,i_{N}=1}^{M}\sum \limits _{\alpha _{1},..,\alpha _{N-1}=0}^{\chi }\Gamma _{\alpha _{1}}^{[1]i_{1}}\lambda _{\alpha _{1}}^{[1]}\Gamma _{\alpha _{1}\alpha _{2}}^{[2]i_{2}}\lambda _{\alpha _{2}}^{[2]}\Gamma _{\alpha _{2}\alpha _{3}}^{[3]i_{3}}\lambda _{\alpha _{3}}^{[3]}\cdot ..\cdot \Gamma _{\alpha _{N-2}\alpha _{N-1}}^{[{N-1}]i_{N-1}}\lambda _{\alpha _{N-1}}^{[N-1]}\Gamma _{\alpha _{N-1}}^{[N]i_{N}}|{i_{1},i_{2},..,i_{N-1},i_{N}}\rangle$

This function describes a N-point lattice which we would like to compute on P different compute nodes. Let us suppose, for the sake of simplicity, that N=2k*P, where k is an integer number. This means that if we distribute the lattice points evenly among the compute nodes (the easiest scenario), an even number of lattice points 2k is assigned to each compute node. Indexing the lattice points from 0 to N-1 (note that the usual indexing is 1,N) and the compute nodes from 0 to P-1, the lattice points would be distributed as follows among the nodes:

 NODE 0: 0, 1, ..., 2k-1
 NODE 1: 2k, 2k+1, ..., 4k-1
 ...
 NODE m: m*2k, ..., (m+1)*2k - 1
 ...
 NODE P-1: (P-1)*2k, ..., N-1

Using the canonical form of the MPS, we define $\lambda _{\alpha _{l}}^{[l]}$ as "belonging" to node m if m*2k ≤ l ≤ (m+1)*2k - 1. Similarly, we use the index l to assign the ${\Gamma }'s$ to a certain lattice point. This means that $\Gamma _{\alpha _{0}}^{[0]i_{0}}$ and $\Gamma _{\alpha _{l-1}\alpha _{l}}^{[l]i_{l}},l=1,2k-1$ , belong to NODE 0, as well as $\lambda _{\alpha _{l}}^{[l]},l=0,2k-2$ . A parallel version of TEBD implies that the computing nodes need to exchange information among them. The information exchanged will be the MPS matrices and singular values lying at the border between neighbouring compute nodes. How this is done, it will be explained below.

The TEBD algorithm divides the exponential operator performing the time-evolution into a sequence of two-qubit gates of the form:

$e^{{\frac {-i\delta }{\hbar }}H_{k,k+1}}.$

Setting the Planck constant to 1, the time-evolution is expressed as:

$|\Psi (t+\delta )\rangle =e^{{-i\delta }{\frac {F}{2}}}e^{{-i\delta }G}e^{{-i\delta }{\frac {F}{2}}}|\Psi (t)\rangle ,$

where H = F + G,

$F\equiv \sum _{k=0}^{{\frac {N}{2}}-1}(H_{2k,2k+1})=\sum _{k=0}^{{\frac {N}{2}}-1}(F_{2k}),$

$G\equiv \sum _{k=0}^{{\frac {N}{2}}-2}(H_{2k+1,2k+2})=\sum _{k=0}^{{\frac {N}{2}}-2}(G_{2k+1}).$

What we can explicitly compute in parallel is the sequence of gates $e^{{-i}{\frac {\delta }{2}}F_{2k}},e^{{-i\delta }{G_{2k+1}}}.$ Each of the compute node can apply most of the two-qubit gates without needing information from its neighbours. The compute nodes need to exchange information only at the borders, where two-qubit gates cross them, or just need information from the other side. We will now consider all three sweeps, two even and one odd and see what information has to be exchanged. Let us see what is happening on node m during the sweeps.

First (even) sweep

The sequence of gates that has to be applied in this sweep is:

$e^{{-i}{\frac {\delta }{2}}F_{m*2k}},e^{{-i}{\frac {\delta }{2}}F_{m*2k+2}},...,e^{{-i}{\frac {\delta }{2}}F_{(m+1)*2k-2}}$

Already for computing the first gate, process m needs information from its lowest neighbour, m-1. On the other side, m doesn't need anything from its "higher" neighbour, m+1, because it has all the information it needs to apply the last gate. So the best strategy for m is to send a request to m-1, postponing the calculation of the first gate for later, and continue with the calculation of the other gates. What m does is called non-blocking communication. Let's look at this in detail. The tensors involved in the calculation of the first gate are:^[1]

↑ Guifré Vidal, Efficient Classical Simulation of Slightly Entangled Quantum Computations, PRL 91, 147902 (2003)[1]

[vidal-1] Guifré Vidal, Efficient Classical Simulation of Slightly Entangled Quantum Computations, PRL 91, 147902 (2003)[1]

[1]

@@ Line 1: / Line 1: @@
-They're always ready to help, and they're always making changes to the site to make sure you won't have troubles in the first place. You may discover this probably the most time-consuming part of building a Word - Press MLM website. The Word - Press Dashboard : an administrative management tool that supports FTP content upload  2. s ultimately easy to implement and virtually maintenance free. It's as simple as hiring a Wordpress plugin developer or learning how to create what is needed. <br><br>
+The '''parallel-TEBD''' is a version of the [[TEBD]] algorithm adapted to run on multiple hosts. The task of parallelizing ''TEBD'' could be achieved in various ways.
+* As a first option, one could use the '''[[OpenMP]]''' [[API]] (this would probably be the simplest way to do it), using preprocessor directives to decide which portion of the code should be parallelized. The drawback of this is that one is confined to [[Symmetric multiprocessing]] (SMP) architectures and the user has no control on how the code is parallelized. An Intel extension of ''OpenMP'', called '''Cluster OpenMP''' [http://software.intel.com/en-us/articles/cluster-openmp-for-intel-compilers], is a socket-based implementation of ''OpenMP'' which can make use of a whole cluster of ''SMP'' machines; this spares the user of explicitly writing messaging code while giving access to multiple hosts via a [[Distributed shared memory|distributed shared-memory]] system. The OpenMP paradigm (hence its extension Cluster OpenMP as well) allows the user a straightforward parallelization of serial code by embedding a set of directives in it.
-As you know today Word - Press has turn out to be a tremendously popular open source publishing and blogging display place. If you wish to sell your services or products via internet using your website, you have to put together on the website the facility for trouble-free payment transfer between customers and the company. There are number of web services that offer Word press development across the world. These four plugins will make this effort easier and the sites run effectively as well as make other widgets added to a site easier to configure. Many times the camera is following Mia, taking in her point of view in almost every frame. <br><br>It is also popular because willing surrogates,as well as egg and sperm donors,are plentiful. To sum up, ensure that the tactics are aiming to increase the ranking and attracting the maximum intended traffic in the major search engines. all the necessary planning and steps of conversion is carried out in this phase, such as splitting, slicing, CSS code, adding images, header footer etc. Thousands of plugins are available in Word - Press plugin's library which makes the task of selecting right set of plugins for your website a very tedious task. Converting HTML to Word - Press theme for your website can allow you to enjoy the varied Word - Press features that aid in consistent growth your online business. <br><br>It has become a more prevalent cause of infertility and the fertility clinic are having more and more couples with infertility problems. And, make no mistake,India's Fertility Clinics and IVF specialists are amongst the best in the world,and have been for some time. Normally, the Word - Press developers make a thorough research on your website goals and then ingrain the most suitable graphical design elements to your website. There are many advantages of hiring Wordpress developers for Wordpress project development:. The popularity of Word - Press has increased the demand for Word - Press themes and these themes sells like hot cake on the internet. <br><br>Yet, overall, less than 1% of websites presently have mobile versions of their websites.  If you have any concerns regarding where and how you can utilize [http://miniURL.fouiner.info/wordpress_dropbox_backup_434952 wordpress dropbox backup], you can contact us at the internet site. Sanjeev Chuadhary is an expert writer who shares his knowledge about web development through their published articles and other resource. In simple words, this step can be interpreted as the planning phase of entire PSD to wordpress conversion process. This is because of the customization that works as a keystone for a SEO friendly blogging portal website. Likewise, professional publishers with a multi author and editor setup often find that Word - Press lack basic user and role management capabilities.
+* The second option is using the [[Message Passing Interface]] ('''MPI''') API. MPI can treat each core of the multi-core machines as separate execution host, so a cluster of, let's say, 10 compute nodes with dual-core processors will appear as 20 compute nodes, on which the MPI application can be distributed. MPI offers the user more control over the way the program is parallelized. The drawback of MPI is that is not very easy to implement and the programmer  has to have a certain understanding of parallel simulation systems.
+* For the determined programmer the third option would probably be the most appropriate: to write ones own routines, using a combination of '''[[Thread (computer science)|threads]]''' and '''[[Internet socket|TCP/IP sockets]]''' to complete the task. The threads are necessary in order to make the socket-based communication between the programs non-blocking (the communication between programs has to take place in threads, so that the main thread doesn't have to wait for the communication to end and can execute other parts of the code). This option offers the programmer complete control over the code and eliminates any overhead which might come from the use of the Cluster OpenMP or MPI libraries.
+This article introduces the conceptual basis of the implementation, using ''MPI''-based pseudo-code for exemplification, while not restricting itself to MPI - the same basic schema could be implemented with the use of home-grown messaging routines.
+==Introduction==
+The TEBD algorithm is a good candidate for [[parallel computing]] because the exponential operators used to calculate the time-evolution factorize under the Suzuki-Trotter expansion. A detailed presentation of the way TEBD works is given in the [[TEBD|main article]]. Here we concern ourselves only with its parallel implementation.
+==Implementation==
+For our purposes, we will use the canonical form of the MPS as introduced by [[Guifré Vidal]] in his original papers. Hence, we will write the function of state <math>| \Psi \rangle </math> as:
+<math>| \Psi \rangle=\sum\limits_{i_1,..,i_N=1}^{M}\sum\limits_{\alpha_1,..,\alpha_{N-1}=0}^{\chi}\Gamma^{[1]i_1}_{\alpha_1}\lambda^{[1]}_{\alpha_1}\Gamma^{[2]i_2}_{\alpha_1\alpha_2}\lambda^{[2]}_{\alpha_2}\Gamma^{[3]i_3}_{\alpha_2\alpha_3}\lambda^{[3]}_{\alpha_3}\cdot..\cdot\Gamma^{[{N-1}]i_{N-1}}_{\alpha_{N-2}\alpha_{N-1}}\lambda^{[N-1]}_{\alpha_{N-1}}\Gamma^{[N]i_N}_{\alpha_{N-1}} | {i_1,i_2,..,i_{N-1},i_N} \rangle</math>
+This function describes a '''N'''-point lattice which we would like to compute on '''P''' different compute nodes. Let us suppose, for the sake of simplicity, that N=2k*P, where k is an integer number. This means that if we distribute the lattice points evenly among the compute nodes (the easiest scenario), an even number of lattice points 2k is assigned to each compute node. Indexing the lattice points from 0 to N-1 (note that the usual indexing is 1,N) and the compute nodes from 0 to P-1, the lattice points would be distributed as follows among the nodes:
+  NODE 0: 0, 1, ..., 2k-1
+  NODE 1: 2k, 2k+1, ..., 4k-1
+  ...
+  NODE m: m*2k, ..., (m+1)*2k - 1
+  ...
+  NODE P-1: (P-1)*2k, ..., N-1
+Using the canonical form of the MPS, we define <math>\lambda^{[l]}_{\alpha_l}</math> as "belonging" to node m if m*2k ≤ l ≤ (m+1)*2k - 1. Similarly, we use the index l to assign the <math>{\Gamma}'s</math> to a certain lattice point. This means that
+<math>\Gamma^{[0]i_0}_{\alpha_{0}}</math> and <math>\Gamma^{[l]i_l}_{\alpha_{l-1}\alpha_{l}},  l=1,2k-1</math>, belong to NODE 0, as well as <math>\lambda^{[l]}_{\alpha_l}, l = 0,2k-2</math>. A parallel version of TEBD implies that the computing nodes need to exchange information among them. The information exchanged will be the MPS matrices and singular values lying at the border between neighbouring compute nodes. How this is done, it will be explained below.
+The TEBD algorithm divides the exponential operator performing the time-evolution into a sequence of two-qubit gates of the form:
+<math> e^{\frac{-i\delta}{\hbar}H_{k,k+1}}.</math>
+Setting the Planck constant to 1, the time-evolution is expressed as:
+<math>| \Psi(t+\delta) \rangle = e^{{-i\delta}\frac{F}{2}}e^{{-i\delta}G}e^{{-i\delta}\frac{F}{2}}|\Psi(t) \rangle,</math>
+where H = F + G,
+<math>F \equiv \sum_{k=0}^{\frac{N}{2}-1}(H_{2k,2k+1}) = \sum_{k=0}^{\frac{N}{2}-1}(F_{2k}),</math>
+<math>G \equiv \sum_{k=0}^{\frac{N}{2}-2}(H_{2k+1,2k+2}) = \sum_{k=0}^{\frac{N}{2}-2}(G_{2k+1}).</math>
+What we can explicitly compute in parallel is the sequence of gates <math> e^{{-i}\frac{\delta}{2}F_{2k}}, e^{{-i\delta}{G_{2k+1}}}.</math>
+Each of the compute node can apply most of the two-qubit gates without needing information from its neighbours. The compute nodes need to exchange information only at the borders, where two-qubit gates cross them, or just need information from the other side. We will now consider all three sweeps, two even and one odd and see what information has to be exchanged. Let us see what is happening on node ''m'' during the sweeps.
+=== First (even) sweep ===
+The sequence of gates that has to be applied in this sweep is:
+<math>e^{{-i}\frac{\delta}{2}F_{m*2k}}, e^{{-i}\frac{\delta}{2}F_{m*2k + 2}},...,e^{{-i}\frac{\delta}{2}F_{(m+1)*2k - 2}}</math>
+Already for computing the first gate, process ''m'' needs information from its lowest neighbour, ''m-1''. On the other side, ''m'' doesn't need anything from its "higher" neighbour, ''m+1'', because it has all the information it needs to apply the last gate. So the best strategy for ''m'' is to send a request to ''m-1'', postponing the calculation of the first gate for later, and continue with the calculation of the other gates. What ''m'' does is called [[Non-blocking I/O|non-blocking communication]]. Let's look at this in detail. The tensors involved in the calculation of the first gate are:<ref name=vidal>[[Guifré Vidal]], ''Efficient Classical Simulation of Slightly Entangled Quantum Computations'', PRL 91, 147902 (2003)[http://www.citebase.org/cgi-bin/citations?id=oai:arXiv.org:quant-ph/0301063]</ref>
+<references/>
+[[Category:Computational physics]]
+[[Category:Distributed algorithms]]

Geographical distance: Difference between revisions

Revision as of 16:37, 1 February 2014

Introduction

Implementation

First (even) sweep

Navigation menu

Search