Lai–Robbins lower bound

Lai–Robbins lower bound

The Lai–Robbins lower bound gives an asymptotic lower bound on the regret that any uniformly good algorithm must incur in the stochastic multi-armed bandit problem. The original result was proved by Tze Leung Lai and Herbert Robbins in 1985 for parametric exponential families. Later work extended the statement to more general classes of distributions. == Multi-armed bandit problem == The multi-armed bandit problem (MAB) is a sequential game in which the player must trade off exploration (to learn) and exploitation (to earn). The player chooses among K {\displaystyle K} actions (arms) with unknown distributions ν = ( ν 1 , … , ν K ) {\displaystyle \nu =(\nu _{1},\dots ,\nu _{K})} . The player is assumed to know a class of distributions D {\displaystyle {\mathcal {D}}} such that for every k {\displaystyle k} one has ν k ∈ D {\displaystyle \nu _{k}\in {\mathcal {D}}} (for example, D {\displaystyle {\mathcal {D}}} may be the family of Gaussian or Bernoulli distributions). At each round t = 1 , … , T {\displaystyle t=1,\dots ,T} the player selects (pulls) an arm a t {\displaystyle a_{t}} and observes a reward X t ∼ ν a t {\displaystyle X_{t}\sim \nu _{a_{t}}} . We denote N a ( t ) := ∑ s = 1 t 1 { a s = a } {\displaystyle N_{a}(t):=\sum _{s=1}^{t}\mathbf {1} _{\{a_{s}=a\}}} the number of times arm a {\displaystyle a} has been pulled in the first t {\displaystyle t} rounds, μ ( ν ) := ( μ 1 , … , μ K ) {\displaystyle \mu (\nu ):=(\mu _{1},\dots ,\mu _{K})} the vector of arm means, where μ k = E X ∼ ν k [ X ] {\displaystyle \mu _{k}=\mathbb {E} _{X\sim \nu _{k}}[X]} , μ ∗ := max a μ a {\displaystyle \mu ^{}:=\max _{a}\mu _{a}} the highest mean Δ a := μ ∗ − μ a ≥ 0 {\displaystyle \Delta _{a}:=\mu ^{}-\mu _{a}\geq 0} the gap of arm a {\displaystyle a} . An arm a {\displaystyle a} with μ a = μ ∗ {\displaystyle \mu _{a}=\mu ^{}} is called an optimal arm; otherwise it is a suboptimal arm. The goal is to minimize the regret at horizon T {\displaystyle T} , defined by R T := ∑ a = 1 K Δ a E [ N a ( T ) ] . {\displaystyle R_{T}:=\sum _{a=1}^{K}\Delta _{a}\,\mathbb {E} [N_{a}(T)].} Intuitively, the regret is the (expected) total loss compared to always playing an optimal arm: regret = ∑ a ( cost of playing a ) × ( times a is played ) . {\displaystyle {\text{regret}}=\sum _{a}\ ({\text{cost of playing }}a)\times ({\text{times }}a{\text{ is played}}).} An MAB algorithm is a (possibly randomized) policy that, at each round t {\displaystyle t} , choose an arm a_t by using the observations received from previous turns. === Intuitive example === Suppose a farmer must choose, each year, one of K {\displaystyle K} seed varieties to plant. Each variety k {\displaystyle k} has an unknown average yield μ k {\displaystyle \mu _{k}} . If the farmer knew the best variety (with mean μ ∗ {\displaystyle \mu ^{}} ) he would plant it every year; in reality he must try varieties to learn which is best. The cumulative regret after T {\displaystyle T} years measures the total expected loss in yield due to imperfect knowledge. Remarks The model above is the stochastic MAB; there also exist adversarial variants. One may consider a fixed-horizon setting (known T {\displaystyle T} ) or an anytime setting (unknown T {\displaystyle T} ). == Lai–Robbins lower bound == The theorem gives the right amount of time we should pull a suboptimal arm k {\displaystyle k} to distinguish whether we are in the instance with ν k {\displaystyle \nu _{k}} or with ν ~ k {\displaystyle {\tilde {\nu }}_{k}} where ν ~ k {\displaystyle {\tilde {\nu }}_{k}} is such that μ ~ k > μ ∗ {\displaystyle {\tilde {\mu }}_{k}>\mu ^{}} . Knowning a lower bound on the number of pull of every suboptimal arm gives a lower bound on the regret as only suboptimal arms contribute to the regret. Before stating the formal theorem we need to define what is a consistent algorithm. === Consistency (uniformly good algorithms) === Let D {\displaystyle {\mathcal {D}}} be a class of probability distributions and consider K {\displaystyle K} arms with reward distributions ν = ( ν 1 , … , ν K ) ∈ D K {\displaystyle \nu =(\nu _{1},\dots ,\nu _{K})\in {\mathcal {D}}^{K}} . An algorithm is said to be consistent (also called uniformly good) on D K {\displaystyle {\mathcal {D}}^{K}} if, for every instance ν ∈ D K {\displaystyle \nu \in {\mathcal {D}}^{K}} , the expected regret R T ( ν ) {\displaystyle R_{T}(\nu )} grows subpolynomially: ∀ α > 0 , R T ( ν ) = o ( T α ) as T → ∞ {\displaystyle \forall \alpha >0,\qquad R_{T}(\nu )=o(T^{\alpha })\quad {\text{as }}T\to \infty } This assumption excludes algorithms that perform well on some instances but incur linear regret on others. === Formal lower bound === For any suboptimal arm a {\displaystyle a} . For a distribution ν a ∈ D {\displaystyle \nu _{a}\in {\mathcal {D}}} and a threshold x {\displaystyle x} , define K inf ( ν a , x , D ) := inf { KL ⁡ ( ν a , ν ′ ) : ν ′ ∈ D , μ ′ > x } {\displaystyle {\mathcal {K}}_{\inf }(\nu _{a},x,{\mathcal {D}}):=\inf {\Bigl \{}\operatorname {KL} (\nu _{a},\nu '):\nu '\in {\mathcal {D}},\ \mu '>x{\Bigr \}}} where KL ⁡ ( ⋅ , ⋅ ) {\displaystyle \operatorname {KL} (\cdot ,\cdot )} denotes the Kullback-Leibler divergence. Then, for any algorithm consistent on D K {\displaystyle {\mathcal {D}}^{K}} and for every instance ν ∈ D K {\displaystyle \nu \in {\mathcal {D}}^{K}} , every suboptimal arm a {\displaystyle a} satisfies E ν [ N a ( T ) ] ≥ ln ⁡ T K inf ( ν a , μ ∗ , D ) + o ( ln ⁡ T ) {\displaystyle \mathbb {E} _{\nu }[N_{a}(T)]\geq {\frac {\ln T}{{\mathcal {K}}_{\inf }(\nu _{a},\mu ^{},{\mathcal {D}})}}+o(\ln T)} Consequently, the regret satisfies R T ( ν ) ≥ ( ∑ a : μ a < μ ∗ Δ a K inf ( ν a , μ ∗ , D ) ) ln ⁡ T + o ( ln ⁡ T ) {\displaystyle R_{T}(\nu )\geq \left(\sum _{a:\,\mu _{a}<\mu ^{}}{\frac {\Delta _{a}}{{\mathcal {K}}_{\inf }(\nu _{a},\mu ^{},{\mathcal {D}})}}\right)\ln T+o(\ln T)} The original 1985 paper established this result for exponential families; later work showed that the bound holds under much weaker assumptions on D {\displaystyle {\mathcal {D}}} . === Intuition === Consistency imposes that, for every ν {\displaystyle \nu } , the number of pulls of an optimal arm must be large. This means that μ ∗ {\displaystyle \mu ^{}} is estimated very accurately. The goal is to determine, for a suboptimal arm k {\displaystyle k} , how many samples are needed to be confident, with the appropriate level of confidence, that μ k < μ ∗ {\displaystyle \mu _{k}<\mu ^{}} . To do so, we use what is called the most confusing instance: an instance close to ν {\displaystyle \nu } such that arm k {\displaystyle k} is optimal. We define it as ν ~ {\displaystyle {\tilde {\nu }}} such that, for all a ≠ k {\displaystyle a\neq k} , ν ~ a = ν a {\displaystyle {\tilde {\nu }}_{a}=\nu _{a}} , and ν ~ k {\displaystyle {\tilde {\nu }}_{k}} is chosen so that μ ~ k > μ ∗ {\displaystyle {\tilde {\mu }}_{k}>\mu ^{}} . The objective is to determine how many samples of arm k {\displaystyle k} are required to distinguish whether we are in the instance with ν k {\displaystyle \nu _{k}} or with ν ~ k {\displaystyle {\tilde {\nu }}_{k}} in terms of KL {\displaystyle \operatorname {KL} } distance. == Algorithms achieving the Lai–Robbins lower bound == Several algorithms are known to achieve the Lai–Robbins asymptotic lower bound under specific assumptions on the reward distribution class D {\displaystyle {\mathcal {D}}} . The following list summarizes a non-exhaustive list of algorithms matching the lower bound. == Extension to other problems == === Structured bandit === A more complexe is structured bandit where we know that the mean of each arm is in a set with some restriction. In this case we can prove a smaller lower bound that use the knowledge of this set. === Best arm identification (BAI) === A similar result has been proved for best arm identification, which is the same game except that, instead of minimizing the regret, the goal is to identify the best arm with probability 1 − δ {\displaystyle 1-\delta } using as few rounds as possible. === Reinforcement Learning (RL) === Similar results have been proved for regret minimization in average-reward reinforcement learning. The order is also ln ⁡ T {\displaystyle \ln T} , with a constant that depends on the problem.

List of COBOL software and tools

This is a list of software and programming tools for the COBOL programming language, which includes compilers, IDEs, build tools, testing, frameworks, and related projects. == Compilers and runtimes == Fujitsu NetCOBOL — COBOL compiler for Windows, Linux, and mainframes GnuCOBOL — open-source COBOL compiler translating COBOL to C and then compiling with GCC IBM COBOL — mainframe COBOL compiler for IBM z/OS and IBM i platforms Micro Focus COBOL — commercial COBOL compiler and runtime for enterprise systems FairCom RTG – A commercial real-time database and runtime solution developed by FairCom Corporation. It provides integration with COBOL applications for transaction processing and modernization projects, and is used in enterprise environments requiring high-performance data management. == Integrated development environments == Eclipse IDE — with COBOL plugin support, Micro Focus or Bitlang extensions. IBM Developer for z/OS — IDE for COBOL and PL/I mainframe development Micro Focus Visual COBOL — IDE integration for Visual Studio, Visual Studio Code, and Eclipse OpenCOBOLIDE — open-source lightweight IDE for GnuCOBOL Visual Studio Code — with COBOL extensions via Bitlang COBOL and GnuCOBOL Language Server == Frameworks, libraries, and APIs == ACUCOBOL-GT — runtime and API library suite from Micro Focus CICS — IBM middleware for transaction processing in COBOL applications DB2 and IMS APIs — database access libraries commonly used with COBOL applications == Build tools and package managers == Apache Ant — scripting and build automation for COBOL/Java hybrid systems GNU Make — common build tool for compiling COBOL via GnuCOBOL Jenkins — used for CI/CD automation with COBOL builds == Testing and quality assurance == COBOL Check — open-source unit testing framework for COBOL IBM Rational Performance Tester — automated performance testing of web and server-based applications from the Rational Software division of IBM Micro Focus Unit Testing Framework — integrated COBOL unit testing tool == Debugging and profiling tools == GnuCOBOL debug mode — command-line debugging integrated in GnuCOBOL compiler IBM Debug Tool for z/OS — mainframe debugging for COBOL and PL/I Micro Focus Animator — step-through debugger for COBOL code

Spell checker

In software, a spell checker (or spelling checker or spell check) is a software feature that checks for misspellings in a text. Spell-checking features are often embedded in software or services, such as a word processor, email client, electronic dictionary, or search engine. == Design == A basic spell checker carries out the following processes: It scans the text and extracts the words contained in it. It then compares each word with a known list of correctly spelled words (i.e. a dictionary). This might contain just a list of words, or it might also contain additional information, such as hyphenation points or lexical and grammatical attributes. An additional step is a language-dependent algorithm for handling morphology. Even for a lightly inflected language like English, the spell checker will need to consider different forms of the same word, such as plurals, verbal forms, contractions, and possessives. For many other languages, such as those featuring agglutination and more complex declension and conjugation, this part of the process is more complicated. It is unclear whether morphological analysis—allowing for many forms of a word depending on its grammatical role—provides a significant benefit for English, though its benefits for highly synthetic languages such as German, Hungarian, or Turkish are clear. As an adjunct to these components, the program's user interface allows users to approve or reject replacements and modify the program's operation. Spell checkers can use approximate string matching algorithms such as Levenshtein distance to find correct spellings of misspelled words. An alternative type of spell checker uses solely statistical information, such as n-grams, to recognize errors instead of correctly-spelled words. This approach usually requires a lot of effort to obtain sufficient statistical information. Key advantages include needing less runtime storage and the ability to correct errors in words that are not included in a dictionary. In some cases, spell checkers use a fixed list of misspellings and suggestions for those misspellings; this less flexible approach is often used in paper-based correction methods, such as the see also entries of encyclopedias. Clustering algorithms have also been used for spell checking combined with phonetic information. == History == === Pre-PC === In 1961, Les Earnest, who headed the research on this budding technology, saw it necessary to include the first spell checker that accessed a list of 10,000 acceptable words. Ralph Gorin, a graduate student under Earnest at the time, created the first true spelling checker program written as an applications program (rather than research) for general English text: SPELL for the DEC PDP-10 at Stanford University's Artificial Intelligence Laboratory, in February 1971. Gorin wrote SPELL in assembly language, for faster action; he made the first spelling corrector by searching the word list for plausible correct spellings that differ by a single letter or adjacent letter transpositions and presenting them to the user. Gorin made SPELL publicly accessible, as was done with most SAIL (Stanford Artificial Intelligence Laboratory) programs, and it soon spread around the world via the new ARPAnet, about ten years before personal computers came into general use. SPELL, its algorithms and data structures inspired the Unix ispell program. The first spell checkers were widely available on mainframe computers in the late 1970s. A group of six linguists from Georgetown University developed the first spell-check system for the IBM corporation. Henry Kučera invented one for the VAX machines of Digital Equipment Corp in 1981. === Unix === The International Ispell program commonly used in Unix is based on R. E. Gorin's SPELL. It was converted to C by Pace Willisson at MIT. The GNU project has its spell checker GNU Aspell. Aspell's main improvement is that it can more accurately suggest correct alternatives for misspelled English words. Due to the inability of traditional spell checkers to check words in complex inflected languages, Hungarian László Németh developed Hunspell, a spell checker that supports agglutinative languages and complex compound words. Hunspell also uses Unicode in its dictionaries. Hunspell replaced the previous MySpell in OpenOffice.org in version 2.0.2. Enchant is another general spell checker, derived from AbiWord. Its goal is to combine programs supporting different languages such as Aspell, Hunspell, Nuspell, Hspell (Hebrew), Voikko (Finnish), Zemberek (Turkish) and AppleSpell under one interface. === PCs === The first spell checkers for personal computers appeared in 1980, such as "WordCheck" for Commodore systems which was released in late 1980 in time for advertisements to go to print in January 1981. Developers such as Maria Mariani and Random House rushed OEM packages or end-user products into the rapidly expanding software market. On the pre-Windows PCs, these spell checkers were standalone programs, many of which could be run in terminate-and-stay-resident mode from within word-processing packages on PCs with sufficient memory. However, the market for standalone packages was short-lived, as by the mid-1980s developers of popular word-processing packages like WordStar and WordPerfect had incorporated spell checkers in their packages, mostly licensed from the above companies, who quickly expanded support from just English to many European and eventually even Asian languages. However, this required increasing sophistication in the morphology routines of the software, particularly with regard to heavily-agglutinative languages like Hungarian and Finnish. Although the size of the word-processing market in a country like Iceland might not have justified the investment of implementing a spell checker, companies like WordPerfect nonetheless strove to localize their software for as many national markets as possible as part of their global marketing strategy. When Apple developed "a system-wide spelling checker" for Mac OS X so that "the operating system took over spelling fixes," it was a first: one "didn't have to maintain a separate spelling checker for each" program. Mac OS X's spellcheck coverage includes virtually all bundled and third party applications. Visual Tools' VT Speller, introduced in 1994, was "designed for developers of applications that support Windows." It came with a dictionary but had the ability to build and incorporate use of secondary dictionaries. === Browsers === Web browsers such as Firefox and Google Chrome offer spell checking support, using Hunspell. Prior to using Hunspell, Firefox and Chrome used MySpell and GNU Aspell, respectively. === Specialties === Some spell checkers have separate support for medical dictionaries to help prevent medical errors. == Functionality == The first spell checkers were "verifiers" instead of "correctors." They offered no suggestions for incorrectly spelled words. This was helpful for typos but it was not so helpful for logical or phonetic errors. The challenge the developers faced was the difficulty in offering useful suggestions for misspelled words. This requires reducing words to a skeletal form and applying pattern-matching algorithms. It might seem logical that where spell-checking dictionaries are concerned, "the bigger, the better," so that correct words are not marked as incorrect. In practice, however, an optimal size for English appears to be around 90,000 entries. If there are more than this, incorrectly spelled words may be skipped because they are mistaken for others. For example, a linguist might determine on the basis of corpus linguistics that the word baht is more frequently a misspelling of bath or bat than a reference to the Thai currency. Hence, it would typically be more useful if a few people who write about Thai currency were slightly inconvenienced than if the spelling errors of the many more people who discuss baths were overlooked. The first MS-DOS spell checkers were mostly used in proofing mode from within word processing packages. After preparing a document, a user scanned the text looking for misspellings. Later, however, batch processing was offered in such packages as Oracle's short-lived CoAuthor and allowed a user to view the results after a document was processed and correct only the words that were known to be wrong. When memory and processing power became abundant, spell checking was performed in the background in an interactive way, such as has been the case with the Sector Software produced Spellbound program released in 1987 and Microsoft Word since Word 95. Spell checkers became increasingly sophisticated; now capable of recognizing grammatical errors. However, even at their best, they rarely catch all the errors in a text (such as homophone errors) and will flag neologisms and foreign words as misspellings. Nonetheless, spell checkers can be considered as a type of foreign language writing aid that non-native language lea

Weight initialization

In deep learning, weight initialization or parameter initialization describes the initial step in creating a neural network. A neural network contains trainable parameters that are modified during training: weight initialization is the pre-training step of assigning initial values to these parameters. The choice of weight initialization method affects the speed of convergence, the scale of neural activation within the network, the scale of gradient signals during backpropagation, and the quality of the final model. Proper initialization is necessary for avoiding issues such as vanishing and exploding gradients and activation function saturation. Note that even though this article is titled "weight initialization", both weights and biases are used in a neural network as trainable parameters, so this article describes how both of these are initialized. Similarly, trainable parameters in convolutional neural networks (CNNs) are called kernels and biases, and this article also describes these. == Constant initialization == We discuss the main methods of initialization in the context of a multilayer perceptron (MLP). Specific strategies for initializing other network architectures are discussed in later sections. For an MLP, there are only two kinds of trainable parameters, called weights and biases. Each layer l {\displaystyle l} contains a weight matrix W ( l ) ∈ R n l − 1 × n l {\displaystyle W^{(l)}\in \mathbb {R} ^{n_{l-1}\times n_{l}}} and a bias vector b ( l ) ∈ R n l {\displaystyle b^{(l)}\in \mathbb {R} ^{n_{l}}} , where n l {\displaystyle n_{l}} is the number of neurons in that layer. A weight initialization method is an algorithm for setting the initial values for W ( l ) , b ( l ) {\displaystyle W^{(l)},b^{(l)}} for each layer l {\displaystyle l} . The simplest form is zero initialization: W ( l ) = 0 , b ( l ) = 0 {\displaystyle W^{(l)}=0,b^{(l)}=0} Zero initialization is usually used for initializing biases, but it is not used for initializing weights, as it leads to symmetry in the network, causing all neurons to learn the same features. In this page, we assume b = 0 {\displaystyle b=0} unless otherwise stated. Recurrent neural networks typically use activation functions with bounded range, such as sigmoid and tanh, since unbounded activation may cause exploding values. (Le, Jaitly, Hinton, 2015) suggested initializing weights in the recurrent parts of the network to identity and zero bias, similar to the idea of residual connections and LSTM with no forget gate. In most cases, the biases are initialized to zero, though some situations can use a nonzero initialization. For example, in multiplicative units, such as the forget gate of LSTM, the bias can be initialized to 1 to allow good gradient signal through the gate. For neurons with ReLU activation, one can initialize the bias to a small positive value like 0.1, so that the gradient is likely nonzero at initialization, avoiding the dying ReLU problem. == Random initialization == Random initialization means sampling the weights from a normal distribution or a uniform distribution, usually independently. === LeCun initialization === LeCun initialization, popularized in (LeCun et al., 1998), is designed to preserve the variance of neural activations during the forward pass. It samples each entry in W ( l ) {\displaystyle W^{(l)}} independently from a distribution with mean 0 and variance 1 / n l − 1 {\displaystyle 1/n_{l-1}} . For example, if the distribution is a continuous uniform distribution, then the distribution is U ( ± 3 / n l − 1 ) {\displaystyle {\mathcal {U}}(\pm {\sqrt {3/n_{l-1}}})} . === Glorot initialization === Glorot initialization (or Xavier initialization) was proposed by Xavier Glorot and Yoshua Bengio. It was designed as a compromise between two goals: to preserve activation variance during the forward pass and to preserve gradient variance during the backward pass. For uniform initialization, it samples each entry in W ( l ) {\displaystyle W^{(l)}} independently and identically from U ( ± 6 / ( n l + 1 + n l − 1 ) ) {\displaystyle {\mathcal {U}}(\pm {\sqrt {6/(n_{l+1}+n_{l-1})}})} . In the context, n l − 1 {\displaystyle n_{l-1}} is also called the "fan-in", and n l + 1 {\displaystyle n_{l+1}} the "fan-out". When the fan-in and fan-out are equal, then Glorot initialization is the same as LeCun initialization. === He initialization === As Glorot initialization performs poorly for ReLU activation, He initialization (or Kaiming initialization) was proposed by Kaiming He et al. for networks with ReLU activation. It samples each entry in W ( l ) {\displaystyle W^{(l)}} from N ( 0 , 2 / n l − 1 ) {\displaystyle {\mathcal {N}}(0,2/n_{l-1})} . === Orthogonal initialization === (Saxe et al. 2013) proposed orthogonal initialization: initializing weight matrices as uniformly random (according to the Haar measure) semi-orthogonal matrices, multiplied by a factor that depends on the activation function of the layer. It was designed so that if one initializes a deep linear network this way, then its training time until convergence is independent of depth. Sampling a uniformly random semi-orthogonal matrix can be done by initializing X {\displaystyle X} by IID sampling its entries from a standard normal distribution, then calculate ( X X ⊤ ) − 1 / 2 X {\displaystyle \left(XX^{\top }\right)^{-1/2}X} or its transpose, depending on whether X {\displaystyle X} is tall or wide. For CNN kernels with odd widths and heights, orthogonal initialization is done this way: initialize the central point by a semi-orthogonal matrix, and fill the other entries with zero. As an illustration, a kernel K {\displaystyle K} of shape 3 × 3 × c × c ′ {\displaystyle 3\times 3\times c\times c'} is initialized by filling K [ 2 , 2 , : , : ] {\displaystyle K[2,2,:,:]} with the entries of a random semi-orthogonal matrix of shape c × c ′ {\displaystyle c\times c'} , and the other entries with zero. (Balduzzi et al., 2017) used it with stride 1 and zero-padding. This is sometimes called the Orthogonal Delta initialization. Related to this approach, unitary initialization proposes to parameterize the weight matrices to be unitary matrices, with the result that at initialization they are random unitary matrices (and throughout training, they remain unitary). This is found to improve long-sequence modelling in LSTM. Orthogonal initialization has been generalized to layer-sequential unit-variance (LSUV) initialization. It is a data-dependent initialization method, and can be used in convolutional neural networks. It first initializes weights of each convolution or fully connected layer with orthonormal matrices. Then, proceeding from the first to the last layer, it runs a forward pass on a random minibatch, and divides the layer's weights by the standard deviation of its output, so that its output has variance approximately 1. === Fixup initialization === In 2015, the introduction of residual connections allowed very deep neural networks to be trained, much deeper than the ~20 layers of the previous state of the art (such as the VGG-19). Residual connections gave rise to their own weight initialization problems and strategies. These are sometimes called "normalization-free" methods, since using residual connection could stabilize the training of a deep neural network so much that normalizations become unnecessary. Fixup initialization is designed specifically for networks with residual connections and without batch normalization, as follows: Initialize the classification layer and the last layer of each residual branch to 0. Initialize every other layer using a standard method (such as He initialization), and scale only the weight layers inside residual branches by L − 1 2 m − 2 {\displaystyle L^{-{\frac {1}{2m-2}}}} . Add a scalar multiplier (initialized at 1) in every branch and a scalar bias (initialized at 0) before each convolution, linear, and element-wise activation layer. Similarly, T-Fixup initialization is designed for Transformers without layer normalization. === Others === Instead of initializing all weights with random values on the order of O ( 1 / n ) {\displaystyle O(1/{\sqrt {n}})} , sparse initialization initialized only a small subset of the weights with larger random values, and the other weights zero, so that the total variance is still on the order of O ( 1 ) {\displaystyle O(1)} . Random walk initialization was designed for MLP so that during backpropagation, the L2 norm of gradient at each layer performs an unbiased random walk as one moves from the last layer to the first. Looks linear initialization was designed to allow the neural network to behave like a deep linear network at initialization, since W R e L U ( x ) − W R e L U ( − x ) = W x {\displaystyle W\;\mathrm {ReLU} (x)-W\;\mathrm {ReLU} (-x)=Wx} . It initializes a matrix W {\displaystyle W} of shape R n 2 × m {\displaystyle \mathbb {R} ^{{\frac {n}{2}}\times m}} by any method, such as orthogonal initialization, t

Ultra Hal

Ultra Hal is a chatbot intended to function as a virtual assistant. It was developed by Zabaware, Inc. Ultra Hal uses a natural language interface with animated characters using speech synthesis. Users can communicate with the chatterbot via typing or via a speech recognition engine. It utilizes the WordNet lexical dictionary. Its name is an allusion to HAL 9000, the artificial intelligence from the movie 2001: A Space Odyssey. Ultra Hal won the 2007 Loebner Prize for "most human" chatterbot.

The Future of Work and Death

The Future of Work and Death is a 2016 documentary by Sean Blacknell and Wayne Walsh about the exponential growth of technology. The film showed at several film festivals including Raindance Film Festival, International Film Festival Rotterdam, Academia Film Olomouc and CPH:DOX. In May 2017 it received an official screening at the European Commission. It was distributed by First Run Features and Journeyman Pictures and was released on iTunes, Amazon Prime and On-demand on 9 May 2017. The film was made available on Sundance Now on 27 November 2017. A companion piece to the film, The Cost of Living, a documentary concerning universal basic income in Britain, was released on Amazon Prime on 8 October 2020. == Synopsis == World experts in the fields of futurology, anthropology, neuroscience, and philosophy consider the impact of technological advances on the two 'certainties' of human life; work and death. Charting human developments from Homo habilis, past the Industrial Revolution, to the digital age and beyond, the film looks at the shocking exponential rate at which mankind has managed to create technologies to ease the process of living. As we embark on the next phase of our adaptation, with automation and artificial intelligence signifying the complete move from man to machine, the film asks what the implications are for human fulfilment in an approaching era of job obsolescence and extreme longevity. == Cast == Dudley Sutton – Narrator Aubrey de Grey – Biomedical gerontologist and CSO of the SENS Research Foundation Will Self – Writer, journalist, political commentator and Professor of Contemporary Thought at Brunel University Rudolph E. Tanzi – Professor of Neurology at Harvard University and Director of the Genetics and Aging Research Unit at Massachusetts General Hospital (MGH) Martin Ford – Futurist and author Steve Fuller – Auguste Comte Chair in Social Epistemology at the Department of sociology at University of Warwick Murray Shanahan – Professor of Cognitive Robotics at Imperial College London Gray Scott – Futurist, executive producer of this production Vivek Wadhwa – Entrepreneur, academic and Director of Research at the Center for Entrepreneurship and Research Commercialization at the Pratt School of Engineering, Duke University Zoltan Istvan – Transhumanist and journalist Joanna Cook – Anthropologist, University College London Nicholas Kamara – Physician, Kable Hospital David Pearce – Transhumanist philosopher and co-founder of Humanity+ Peter Cochrane – Futurist and entrepreneur John Harris – Bioethicist, philosopher and Director of the Institute for Science, Ethics and Innovation at the University of Manchester Riva Melissa-Tez – Entrepreneur and transhumanist Ian Pearson – Futurologist Stuart Armstrong – Artificial intelligence researcher at Future of Humanity Institute

Image destriping

Image destriping is the process of removing stripes or streaks from images and videos without disrupting the original image/video. These artifacts plague a range of fields in scientific imaging including atomic force microscopy, light sheet fluorescence microscopy, and planetary satellite imaging. The most common image processing techniques to reduce stripe artifacts is with Fourier filtering. Unfortunately, filtering methods risk altering or suppressing useful image data. Methods developed for multiple-sensor imaging systems in planetary satellites use statistical-based methods to match signal distribution across multiple sensors. More recently, a new class of approaches leverage compressed sensing, to regularize an optimization problem, and recover stripe free images. In many cases, these destriped images have little to no artifacts, even at low signal to noise ratios.