Recurrent Neural Networks (RNNs) are artificial neural networks that contain at least one loop that allows the network’s internal states to be dynamically influenced by its own previous internal states, in addition to any new external input. Formally, containing loops makes RNNs cyclic graphs; in this way, RNNs contrast with feedforward networks, which are acyclic graphs without loops. While RNNs are strongly associated with sequence processing, feedback networks that process static inputs (like images) iteratively over time are also RNNs. Crucially, RNNs can model temporal dependencies, where current states or inputs depend on previous ones. RNNs have advanced theoretical understanding in cognitive science in domains including learning, memory, recognition, movement, and language.
History
In their early study of mathematical models of neural networks, McCulloch and Pitts (1943) laid out logical and computational differences between what they called networks with and without circles (i.e., cycles or loops) and emphasized how recurrence could enable sustained pattern activation—allowing a system to retain a pattern over time for future processing. While scientists experimented with adding recurrent connections to perceptrons (single-layer feedforward networks developed by Rosenblatt, 1958), Hopfield (1982) made a fundamental advance when he extended physics concepts (spin glasses, disordered magnetic systems studied in statistical mechanics, and key aspects of dynamical systems) to simple learning networks. Hopfield networks—single-layer systems in which typically all units are connected—store information in a way that allows the system to recover full patterns from partial input. This behavior includes association, error correction, and constraint satisfaction and relies on a learning rule where connections are strengthened between units that are active at the same time.
Boltzmann machines (e.g., Hinton & Sejnowski, 1986) laid the groundwork for advances in unsupervised learning and deep generative models, influencing the development of contemporary neural networks. These extend Hopfield’s ideas by incorporating probabilistic learning, enabling them to perform tasks like optimization and probabilistic inference. Both Hopfield networks and Boltzmann machines are attractor networks: Hopfield networks’ states evolve iteratively toward local minima, which correspond to stored patterns (memories). Boltzmann machine attractors are states with high probability (low energy) in the network’s energy landscape. The network probabilistically explores this landscape, sampling from various attractor basins over time instead of deterministically settling into one.
Such networks provided a key foundation for the connectionist revolution, ushered in by networks in the early- and mid-1980s that simulated complex processes such as written (McClelland & Rumelhart, 1981) and spoken (McClelland & Elman, 1986) word recognition. In these interactive activation models, activation spreads vertically (via bottom-up and top-down excitatory and/or inhibitory connections) and laterally (typically via inhibitory connections). Interactive activation is held to be a key computational foundation allowing humans to “approximate...optimal perceptual inference in real time” (McClelland et al., 2014).
Backpropagation algorithms (Rumelhart et al., 1986) paved the way for a new wave of learning models. Backpropagation assigns credit and blame to weights in a neural network based on errors, making typically tiny weight modifications that lead the network to reduce error. Jordan (1986) and Elman (1990) recognized that a partially recurrent architecture could be trained via backpropagation to process sequential information. In Simple Recurrent Networks (SRNs), later states are passed back to form part of the input for a lower level. In a Jordan network, output states are copied to a context layer at the next time step, with fully connected, trainable weights from context nodes to hidden nodes. In an Elman SRN, the previous hidden states are copied to the context nodes. At each step, the input includes bottom-up input and context states. Regular backpropagation can be used to train all weights, including those from context to hidden nodes. Elman et al. (1996) provided a sweeping reconsideration of many problems in cognitive development, showing how the ability of SRNs to learn temporal dependencies could address development in language, motor control, memory, and decision-making.
The development of the backpropagation through time (BPTT) algorithm (Williams & Zipser, 1989) provided a method to train networks over arbitrarily many time steps (Figure 1). This innovation allowed for the development of RNNs with greater memory capacity than SRNs as well as new kinds of networks with banks of recurrent ‘cleanup’ units that allowed for pattern separation via attractors (Hinton & Shallice, 1991). This approach has been extended to complex domains such as the Triangle Model of Reading (Harm & Seidenberg, 2004).
RNNs suffer from an important limitation: As we apply BPTT over longer sequences, gradients used for learning can either shrink toward zero (vanish) or grow uncontrollably (explode), making it increasingly difficult to assign credit or blame to appropriate units at appropriate time steps. This problem motivated the innovation of Long Short-Term Memory nodes (Hochreiter & Schmidhuber, 1997), which have internal structure that allows each unit to develop different temporal sensitivities. These include a memory cell and gates that modulate how much influence is given to external inputs, the unit’s own memory cell, and the strength of its output. LSTMs mitigate gradient problems. LSTMs are being used productively as models of a variety of cognitive domains, such as developing fairly simple models of human speech recognition that operate on real speech (Magnuson et al., 2020) [see Speech Recognition].

Several types of RNNs. 'Simple attractor networks' include Hopfield networks and Boltzman machines. Circles indicate nodes. White circles in the simple attractor network are 'visible,' input-receiving nodes; grey circles are hidden nodes. Such networks settle to attractor states as a function of their inputs and weights. In other diagrams, rectangles indicate banks of (unspecified numbers of) nodes. Arrows typically indicate full connectivity with tuneable weights (except for Interactive Activation); knobs indicate inhibitory connections. Dashed arrows indicate fixed, 'copyback' connections that copy state values from the previous timestep onto context units. Ovals with recurrent connections in the triangle model depict cleanup units, which create attractors that enable separation of similar patterns. Models on the top are mainly attractor networks (except Interactive Activation, though it has some related dynamical properties). Models on the bottom are typically applied to learning sequences (e.g., next-word prediction). In the fully recurrent network, backpropagation through time is used to train the hidden-to-hidden (recurrent) weights based on possibly many preceding time steps.
Core concepts
Interaction
Interactive activation models of word recognition naturally account for top-down effects. A key top-down effect is word superiority: Letters or phonemes are detected more readily in a word context than in isolation or a nonword context (Reicher, 1969; Rubin et al., 1976). In an interactive activation model, this phenomenon follows from letters and phonemes receiving top-down support from words that contain them in addition to bottom-up support. Joint effects of feedback and lateral inhibition in interactive activation models makes them robust: Given noisy inputs, even a slight bottom-up advantage for a word allows that word to inhibit similar words, which moderates the pattern of lexical-to-form feedback. At the intermediate level (letters or phonemes), lateral inhibition does the same. The joint effects of recurrent feedback and inhibition over time effectively denoise inputs (Magnuson et al., 2024).
Self-supervised learning
Elman (1990) used next-element prediction as a self-supervised method for SRN training. For example, for sentence processing, the current input is the current word in a sentence, and the training target is the next word in the sequence [see Sentence Processing; Computational Models of Language Learning]. The network ‘predicts’ the next word, and then the observed word is used to calculate error. Context layers allow SRNs to develop sensitivity to nonadjacent, long-distance dependencies (e.g., predicting plurally inflected verbs after BOY CHASED BY DOGS [RUNS, WALKS, matching BOY] but singular after BOYS CHASED BY DOGS ...). The network can learn to preserve traces of previous states that guide such predictions.
Questions, controversies, and new developments
Biological plausibility
SRNs and LSTMs are unrealistically limited in what they can learn (assuming practical limits on time and computational resources; Delétang et al., 2022). They also suffer from catastrophic interference (training on new materials overwrites previous learning), pointing to the need to develop more complex, biologically inspired models.
Interpretability
Even very simple recurrent models defy easy explanation. RNNs, like other learning models, often exhibit unexpected emergent behavior. For example, the LSTM-based EARSHOT model (Magnuson et al., 2020) maps speech (spectral slices) to semantic vectors via a hidden layer without any explicit phonetic training. Remarkably, it develops an emergent phonetic coding (some hidden nodes learn to selectively respond to different classes of consonants and vowels) in its hidden LSTM layer that resembles phonetically organized cortical responses to speech. However, other hidden nodes develop complex response properties that defy ready interpretation, limiting the conclusions that can be drawn. RNNs demonstrate what may be possible for recurrent systems to learn, but a large gap remains: how to establish whether or not computations homologous or analogous to those that emerge in an RNN may be implemented in the brain.
Broader connections
Contemporary Large Language Models (LLMs) like ChatGPT (Radford et al., 2018) leverage a special ‘attention’ mechanism in feedforward transformer architectures, enabling parallel processing of large chunks of data simultaneously, thus mitigating the limitations of recurrence by using self-attention to model dependencies in parallel across a sequence [see Large Language Models]. Although next-word prediction (Elman, 1990) remains a core underlying principle (among others) for training LLMs, a fundamental difference compared to RNNs is their simultaneous access to vast quantities of preceding and following context (in the sense that while information at position p within a chunk is processed, the model has access to substantial preceding and following information within the chunk; a sequence-processing model would have access to information prior to p but not future information).
Despite the difficulties in understanding how RNNs operate ‘under the hood,’ RNNs (and LMMs) can productively be compared to fine-grained time-course measures of neural activity, prompting new hypotheses about the hierarchical organization of neural systems in domains such as language and vision (e.g., Schrimpf et al., 2021; Brodbeck et al., 2024).
Acknowledgments
Preparation of this article was supported in part by U.S. National Science Foundation grant BCS-PAC 2043903, by Basque Government support to BCBL through the BERC 2022-2025 program, and by the Spanish State Research Agency through BCBL Severo Ochoa excellence accreditation CEX2020-001010-S and project PID2023-149585NB-I00.
Further reading
Elman, J., Bates, E., Johnson, M. H., Karmiloff-Smith, A., Parisi, D., & Plunkett, K. (1996). Rethinking innateness: A connectionist perspective on development. MIT Press. https://doi.org/10.7551/mitpress/5929.001.0001
Kar, K., Kornblith, S., & Fedorenko, E. (2022). Interpretability of artificial neural network models in artificial intelligence versus neuroscience. Nature Machine Intelligence, 4(12), 1065–1067. https://doi.org/10.1038/s42256-022-00592-3
Rao, R. P., & Ballard, D. H. (1999). Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience, 2(1), 79–87. https://doi.org/10.1038/4580
References
Brodbeck, C., Hannagan, T., & Magnuson, J. S. (2024). Recurrent neural networks as neuro-computational models of human speech recognition. Biorxiv. https://doi.org/10.1101/2024.02.20.580731
↩Delétang,G, Ruoss, A., Grau-Moya, J., Genewein, T., Wenliang, L. T., Catt, E., Cundy, C., Hutter, H., Legg, H., Veness, J., & Ortega, P. A. (2022). Neural networks and the Chomsky hierarchy. Arxiv. https://doi.org/10.48550/arXiv.2207.02098
↩Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179–211. https://doi.org/10.1207/s15516709cog1402_1
↩Elman, J., Bates, E., Johnson, M. H., Karmiloff-Smith, A., Parisi, D., & Plunkett, K. (1996). Rethinking innateness: A connectionist perspective on development. MIT Press. https://doi.org/10.7551/mitpress/5929.001.0001
↩Harm, M. W., & Seidenberg, M. S. (2004). Computing the meanings of words in reading: Cooperative division of labor between visual and phonological processes. Psychological Review, 111(3), 662–720. https://doi.org/10.1037/0033-295X.111.3.662
↩Hinton, G. E., & Shallice, T. (1991). Lesioning an attractor network: Investigations of acquired dyslexia. Psychological Review, 98(1), 74–95. https://doi.org/10.1037/0033-295X.98.1.74
↩Hinton, G. E., & T. Sejnowski. (1986). Learning and relearning in Boltzmann machines. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing (vol. 1., pp. 282–317). MIT Press.
↩Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9, 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
↩Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79(8), 2554–2558. https://doi.org/10.1073/pnas.79.8.2554
↩Jordan, M. I. (1986). Serial order: A parallel distributed processing approach (Technical report ICS 8604). Institute for Cognitive Science, University of California.
↩Magnuson, J. S., Crinnion, A. M., Luthra, S., Gaston, P., & Grubb, S. (2024). Contra assertions, feedback improves word recognition: How feedback and lateral inhibition sharpen signals over noise. Cognition, 242. https://doi.org/10.1016/j.cognition.2023.105661
↩Magnuson, J.S., You, H., Luthra, S., Li, M., Nam, H., Escabí, M., Brown, K., Allopenna, P.D., Theodore, R.M., Monto, N., & Rueckl, J.G. (2020). EARSHOT: A minimal neural network model of incremental human speech recognition. Cognitive Science, 44, e12823. https://doi.org/10.1111/cogs.12823
↩McClelland, J. L., & Rumelhart, D. E. (1981). An interactive activation model of context effects in letter perception: I. An account of basic findings. Psychological Review, 88(5), 375. https://doi.org/10.1037//0033-295X.88.5.375
↩McClelland, J. L., Mirman, D., Bolger, D. J., & Khaitan, P. (2014). Interactive activation and mutual constraint satisfaction in perception and cognition. Cognitive Science, 38(6), 1139–1189. https://doi.org/10.1111/cogs.12146
↩McClelland, J.L., & Elman, J.L. (1986). The TRACE model of speech perception. Cognitive Psychology, 18, 1–86. https://doi.org/10.1016/0010-0285(86)90015-0
↩McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5, 115–133. https://doi.org/10.1007/BF02478259
↩Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. https://openai.com/index/language-unsupervised
↩Reicher, G. M. (1969). Perceptual recognition as a function of meaningfulness of stimulus material. Journal of Experimental Psychology, 81(2), 275–280. https://doi.org/10.1037/h0027768
↩Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386. https://doi.org/10.1037/h0042519
↩Rubin, P., Turvey, M. T., & Van Gelder, P. (1976). Initial phonemes are detected faster in spoken words than in spoken nonwords. Perception & Psychophysics, 19, 394–398. https://doi.org/10.3758/BF03199398
↩Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536. https://doi.org/10.1038/323533a0
↩Schrimpf, M., Blank, I. A., Tuckute, G., Kauf, C., Hosseini, E. A., Kanwisher, N., ... & Fedorenko, E. (2021). The neural architecture of language: Integrative modeling converges on predictive processing. Proceedings of the National Academy of Sciences, 118(45), e2105646118. https://doi.org/10.1073/pnas.2105646118
↩Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2), 270–280. https://doi.org/10.1162/neco.1989.1.2.270
↩