/************************************************************************ MODULE: BasicThreadPool SUMMARY: A simple thread pool class BasicThreadPool, as well as some higher-level macros which facilitite simple parallel for loops. ***************************************************************************/ // ********************** Simple parallel for loops ************************** // // We begin with a description of the higher-level macros for writing simple // parallel for loops. These facilitaties are activated only when NTL is // configured with NTL_THREAD_BOOST=on (which implies NTL_THREADS=on). // However, code that uses these facilties should still compile and run // correctly even when NTL_THREAD_BOOST=off, or even when NTL_THREADS=off, so // this is the simplest way to write parallel for loops across a range of // compile-time and run-time environments. Note that if NTL_THREADS=on, C++11 // features are reqired, but when NTL_THREADS=off, these features are not // required, so the code should compile on older C++ compilers. // // Here is a simple recipe for writing parallel for loop. // // At the start of program execution, your program should execute SetNumThreads(nt); // You can choose nt to be any positive integer, but for best results, it // should correspond to the number of available cores on your machine. // [NOTE: if NTL_THREAD_BOOST=off, this function is still defined, but does // nothing.] // // Now consider the following routine: void mul(ZZ *x, const ZZ *a, const ZZ *b, long n) { for (long i = 0; i < n; i++) mul(x[i], a[i], b[i]); } // We can parallelize it as follows: void mul(ZZ *x, const ZZ *a, const ZZ *b, long n) { NTL_EXEC_RANGE(n, first, last) for (long i = first; i < last; i++) mul(x[i], a[i], b[i]); NTL_EXEC_RANGE_END } // NTL_EXEC_RANGE and NTL_EXEC_RANGE_END are macros that just "do the right // thing". If there are nt threads available, the interval [0..n) will be // partitioned into (up to) nt subintervals, and a different thread will be // used to process each subinterval. You still have to write the for loop // yourself: the macro just declares and initializes variables "first" and // "last" (or whatever you want to call them) of type long that represent the // subinterval [first..last) to be processed by one thread. // // Note that the current thread participates as one of the nt available // threads, and that the current thread will wait for all other participating threads // to finish their task before proceeding. The current thread can be identified // as the one with first == 0. // // Withing the "body" of this construct, you can freely reference any variables // that are visible at this point. This is implemented using the C++ lambda // feature (capturing all variables by reference). // // This construct will still work even if threads are disabled, in which case // it runs single-threaded with first=0 and last=n. // // Note that the code within the EXEC_RANGE body could call other routines that // themselves attempt to execute an EXEC_RANGE: if this happens, the latter // EXEC_RANGE will detect this and run single-threaded. // // You may wish to do other things within the EXEC_RANGE body than just execute // a loop. One thing you may want to do is to declare variables. Another // thing you may want to do is setup a local context for a ZZ_p modulus (or // other type of modulus). Here is an example of doing this: void mul(ZZ_p *x, const ZZ_p *a, const ZZ_p *b, long n) { ZZ_pContext context; context.save(); NTL_EXEC_RANGE(n, first, last) context.restore(); for (long i = first; i < last; i++) mul(x[i], a[i], b[i]); NTL_EXEC_RANGE_END } // Another useful function is AvailableThreads(), which will return the number // of available threads. If threads or thread boosting is not enabled, this // will return 1. Even if thread boosting is enabled, this may return 1 if for // whatever reason, the thread pool is not available for use (for example, // SetNumThreads was never called, or the thread pool is already active). // // A lower-level set of tools is available, which allow you to simply run a // specified number of threads. Assuming nt <= AvailableThreads(), the code NTL_EXEC_INDEX(nt, index) ... code ... NTL_EXEC_INDEX_END // will execute the body on nt different threads, each with a unique index in // the range [0..nt). A variable named "index" (or whatever name you specify) // of type long will hold the given index. Just as with EXEC_RANGE, the current // thread will participate as one of the nt threads, and will always be // assigned an index of 0. // // This tool is useful if you need to manage memory a bit more carefully. For // example, the following code will compute an inner product using all // available threads: ZZ InnerProd(const ZZ *a, const ZZ *b, long n) { PartitionInfo pinfo(n); long cnt = pinfo.NumIntervals(); Vec<ZZ> acc; acc.SetLength(cnt); NTL_EXEC_INDEX(cnt, index) long first, last; pinfo.interval(first, last, index); ZZ& sum = acc[index]; sum = 0; for (long i = first; i < last; i++) MulAddTo(sum, a[i], b[i]); NTL_EXEC_INDEX_END ZZ sum; sum = 0; for (long i = 0; i < cnt; i++) sum += acc[i]; return sum; } // This example also illustrates the class PartitionInfo, which is useful for // partitioning a large interval into smaller intervals (it is used internally // by EXEC_RANGE). The constructor takes a single argument (in this example n) // and computes a partition of [0..n) into nearly equally sized subintervals. // The method NumIntervals() returns the number of subintervals, and the method // interval(first, last, index) sets first and last according to the endpoints // of the subinterval [first..last) with the given index. // // So in this example, cnt threads will run, each accumulating a sum into a // corresponding element of the vector acc, and afterwords, these elements are // summed. // // Note that if threads are not enabled or otherwise unavailable, the above // code will compile and run correctly (just using one thread). // // Finally, there is a "guarded" version of NTL_EXEC_RANGE called // NTL_GEXEC_RANGE. This allows one to dynamically "guard" against parallel // execution. For example, on very small problems the runtime overhead of a // parallel for loop may not be worthwhile, or in other situations parallel // execution could cause incorrect behavior. See below for details. // ************************** Thread Pools ****************************** // // The above facilities are built on top of a more general thread pool class, // which you may use for your own purposes. // // You create a thread pool by constructing a BasicThreadPool object. For // example: long nthreads = 4; BasicThreadPool pool(nthreads); // creates a thread pool of 4 threads. These threads will exist until the // destructor for pool is called. // // The simplest way to use a thread pools is as follows. Suppose you have a // task that consists of sz subtasks, indexed 0..sz-1. Then you can write: pool.exec_range(sz, [&](long first, long last) { for (long i = first; i < last; i++) { ... code to process subtask i ... } } ); // The second argument to exec_range is a C++11 "lambda". The "[&]" indicates // that all local variables in the calling context are captured by reference, // so the lambda body can reference all visible local variables directly. // C++11 provides other methods for capturing local variables. The interval // [0..sz) is partitioned into subintervals of the form [first..last), which // are processed by the code in the supplied lambda. // // A lower-level interface is also provided. One can write: pool.exec_index(cnt, [&](long index) { ... code to process index i ... } ); // This will activate exactly cnt threads with indices 0..cnt-1, and execute // the given code on each index. The parameter cnt must not exceed nthreads, // otherwise an error is raised. // ==================================================================== // // NOTES: // // When one activates a thread pool with nthreads threads, the *current* thread // (the one activating the pool) will also participate in the computation. // This means that the thread pool only contains nthreads-1 other threads. // // If, during an activation, any thread throws an exception, it will be caught // and rethrown in the activating thread when all the threads complete. If // more than one thread throws an exception, the first one that is caught is // the one that is rethrown. // // Methods are also provided for adding, deleting, and moving threads in and // among thread pools. // // If NTL_THREADS=off, the corresponding header file may be included, but the // BasicThreadPool class is not defined. // // Unlike most classes in NTL, the BasicThreadPool is not relocatable and hence // cannot be used in a Vec. One should first wrap it in a pointer class, such // as UniquePtr. // class BasicThreadPool: provided basic functionality for thread pools class BasicThreadPool { private: BasicThreadPool(const BasicThreadPool&); // disabled void operator=(const BasicThreadPool&); // disabled public: explicit BasicThreadPool(long nthreads); // creates a pool with nthreads threads, including the current thread // (so nthreads-1 other threads get created) template<class Fct> void exec_range(long sz, const Fct& fct); // activate by range (see example usage above) template<class Fct> void exec_index(long cnt, const Fct& fct); // activate by index (see example usage above) void add(long n = 1); // add n threads to the pool long NumThreads() const; // return number of threads (including current thread) void remove(long n = 1); // remove n threads from the pool void move(BasicThreadPool& other, long n = 1) // move n threads from other pool to this pool bool active() const; // indicates an activation is in process: invoking any of the methods // exec_index, exec_range, add, remove, move, or the destructor // whie active will raise an error template<class Fct> static void relaxed_exec_range(BasicThreadPool *pool, long sz, const Fct& fct); // similar to pool->exec_range(sz, fct), but will still work even // if !pool or pool->active(), using just the current thread template<class Fct> static void relaxed_exec_index(BasicThreadPool *pool, long cnt, const Fct& fct); // similar to pool->exec_index(cnt, fct), but will still work even // if !pool or pool->active(), provided cnt <= 1, using just the current thread }; // THREAD BOOSTING FEATURES: void SetNumThreads(long nt); // convenience routine to set NTL's thread pool. // If called more than once, the old thread pool is destroyed and // replaced by a new one. // If NTL_THREAD_BOOST=off, then this is still defined, but does nothing. long AvailableThreads(); // Number of threads currently availble to use in NTL's thread pool. This is // always at least 1 (for the current thread). // If NTL_THREAD_BOOST=off, then this is still defined, and always returns 1. BasicThreadPool *GetThreadPool(); void ResetThreadPool(BasicThreadPool *pool = 0); BasicThreadPool *ReleaseThreadPool(); // Routines to get and set NTL's thread pool. The interfaces parallel NTL's // UniquePtr class, and indeed, behind the scenes, NTL's thread pool is stored // as a UniquePtr<BasicThreadPool>. // These are only declared when NTL_THREAD_BOOST=on. #define NTL_EXEC_RANGE(sz, first, last) ... #define NTL_EXEC_RANGE_END ... #define NTL_EXEC_INDEX(cnt, index) ... #define NTL_EXEC_INDEX_END ... // convenience macros to implement "parallel for loops" using NTL's thread // pool. See examples above for usage. If NTL_THREAD_BOOST=off, then these // are still defined, and code will run on a single thread #define NTL_GEXEC_RANGE(seq, sz, first, last) ... #define NTL_GEXEC_RANGE_END ... // "guarded" version of NTL_EXEC_RANGE: if seq evaluates to true, the code runs // on a single thread. This is useful in avoiding situations where the // overhead of a parallel loop is too high. If seq evaluates to the constant // true, a good compiler will optimize code to run on a single thread, with no // overhead. #define NTL_IMPORT(x) // To be used in conjunction with NTL_EXEC_RANGE and friends. When // NTL_THREAD_BOOST=on, this will copy the variable named x from the enclosing // scope to a local copy. This should only be used for types with cheap // copies, such as scalars and pointers. In some situations, this allows the // compiler to optimize a bit more aggressively. One or more of these may be // placed right after an NTL_EXEC_RANGE. // When NTL_THREAD_BOOST=off, this is still defined, and does nothing. // class PartitionInfo: A helper class to facilitate partitioning an interval // into subintervals. NOTE: this class is available, even when // NTL_THREAD_BOOST=off. class PartitionInfo { public: explicit PartitionInfo(long sz, long nt = AvailableThreads()); // partitions [0..sz) into at most nt subintervals. sz may be 0 or // negative, in which case the number of subintervals is 0. long NumIntervals() const; // return the number of subintervals void interval(long& first, long& last, long i) const; // [first..last) is the ith interval, where i in [0..NumInvervals()). No // range checking is performed. };