Memory-Based Learning Models Of Inflectional Morphology A Methodological Case Study

Page of

PRINTED FROM the OXFORD RESEARCH ENCYCLOPEDIA, LINGUISTICS (linguistics.oxfordre.com). (c) Oxford University Press USA, 2016. All Rights Reserved. Personal use only; commercial use is strictly prohibited. Please see applicable Privacy Policy and Legal Notice (for details see Privacy Policy).

date: 14 March 2018

Computational Approaches to Morphology

Summary and Keywords

Computational psycholinguistics has a long history of investigation and modeling of morphological phenomena. Several computational models have been developed to deal with the processing and production of morphologically complex forms and with the relation between linguistic morphology and psychological word representations. Historically, most of this work has focused on modeling the production of inflected word forms, leading to the development of models based on connectionist principles and other data-driven models such as Memory-Based Language Processing (MBLP), Analogical Modeling of Language (AM), and Minimal Generalization Learning (MGL). In the context of inflectional morphology, these computational approaches have played an important role in the debate between single and dual mechanism theories of cognition. Taking a different angle, computational models based on distributional semantics have been proposed to account for several phenomena in morphological processing and composition. Finally, although several computational models of reading have been developed in psycholinguistics, none of them have satisfactorily addressed the recognition and reading aloud of morphologically complex forms.

Keywords: morphology, word recognition, inflection, distributional semantics, connectionism, exemplar-based, compounds, naive discriminative learning, rules, dual mechanism

Two broad categories of questions in morphology can be approached using computational techniques. In the area of computational linguistics, computational techniques can be used to address problems such as identifying a word’s morphological constituents in automatic processing of text. In the field of psycholinguistics, computational techniques can help to understand how morphology is psychologically represented. After a very brief overview of the role of morphology in computational linguistics, the majority of this article will be devoted to computational approaches to morphology in psycholinguistics.

1. Morphology in Computational Linguistics

In computational linguistics, morphological analysis primarily serves to solve practical problems. For instance, in English, compound words can be written without a space, with a dash, or with a space. Tokenization processes that consider anything delimited by spaces or punctuation as a word obviously end up considering different parts of spaced compounds as different tokens. A basic solution to this is to use a curated index of compound words. More advanced solutions consist of identifying compounds based on n‑gram statistics so that only likely compounds are retained (e.g., Su, Wu, & Chang, 1994). In automatic text processing, it is also useful to have methods to identify whether several morphological forms derive from the same lemma. A straightforward strategy for lemmatization is to use tables that list form–lemma correspondences. However, especially with highly inflected languages, the number of potential inflected forms can become so large that table-based lookup becomes impractical or impossible. In addition, lemmatization based on lookup can never be exhaustive, as new words are coined all the time and inflection of novel forms is a highly productive process. A solution consists of using stemming algorithms to remove inflectional affixes from forms to obtain a matching lemma (Porter, 2006; Paice, 1994).

2. Computational Approaches to Modeling Infection

2.1 Connectionist Models

In computational psycholinguistics, one of the important questions is how language users are able to generate novel inflected, compound, and derivational forms. Rumelhart and McClelland (1986) were the first to propose a connectionist model of inflectional morphology. During the learning phase, the model was presented with English present tense forms on its input layer and corresponding past tense forms on its output layer. Eventually, the model learned to correctly produce most past tense forms from the present tense forms, suggesting that the knowledge required for this conversion can be stored in connection weights. The model also showed a U-shaped learning curve, similar to what has been observed in acquisition: Children first produce irregular forms correctly, then overgeneralize the regular patterns to irregulars, and finally start producing the irregulars correctly again. In connectionist models, the same effect can be achieved by starting training with a small set of frequent regular and irregular forms and later expanding that set to less frequent forms. Despite its apparent success, the model was strongly criticized by Pinker and Prince (1988). Many of these shortcomings were addressed in works by Plunkett and Marchman (1991, 1993), Joanisse and Seidenberg (1999), and Cottrell and Plunkett (1994).

Around the same time that connectionist models of inflection were developed, three other proposals emerged that relied on data-driven principles to explain both regular and irregular inflection. All of these models start out with a knowledge base consisting of a list of exemplars associated with a label corresponding to the operation required to produce the inflected form. For instance, the form walk would have the label ‘+ed’ while the form sing would have the label ‘i>a’. These data-driven models share the assumption that correct inflectional forms can be produced by relying solely on a database of encoded forms. However, they differ in the way in which this is achieved.

2.2 Memory-Based Language Processing

Memory-Based Language Processing (MBLP)1 is an application and extension of the k-nearest neighbor (“k-nn”) algorithm (Fix & Hodges, 1951) to linguistic material, which was first developed by Daelemans and his colleagues at the end of the 1980s (see Daelemans & van den Bosch, 2005, for an overview). The central tenet of MBLP is that all encountered exemplars are kept in memory and that the creation of new forms is driven by generalization on the basis of similar memorized forms. In k-nn, the symbol k refers to the number of forms that are taken into account for generalization. When k=1, a generalization will be based on the single form that is least distant from the novel form, or, when there is more than one exemplar at that same distance, by the class that is shared by the majority of these exemplars. Applied to the English past tense, for instance, memory-based learning would predict that the past tense form of the novel verb to spling would be splung on the basis of similar sounding verbs such as spring–sprung, cling–clung, or swing–swung. On the other hand, the past tense form of to plip would become plipped because of its similarity to forms such as to slip–slipped, clip–clipped, or flip–flipped (Keuleers, 2008).

The earliest memory-based models of morphological processes were focused on predicting Dutch diminutive formation and the resulting models were able to predict diminutives encountered in corpora with high accuracy (Daelemans, Berck, & Gillis, 1997). MBLP researchers also started to look at psycholinguistic evidence, for instance child acquisition data or experimentally elicited inflections of novel forms. Dutch plural inflection was studied by Keuleers et al. (2007) and Keuleers and Daelemans (2007). Memory-based learning has also been applied to linking elements in Dutch compounds (Krott, Baayen, & Schreuder, 2001), Serbian instrumental inflection (Milin, Keuleers, & Filipović-Đurđević, 2011), and English past tense formation (Hahn & Nakisa, 2000; Keuleers, 2008; Nakisa & Hahn, 1996).

Simulations with memory-based learning have shown that relatively few examples are needed for correct generalization. In their methodological study on Dutch plural inflection, Keuleers and Daelemans (2007) showed that when accuracy is based on the number of attested forms that can be correctly predicted, the best value for k is usually 1, while when dealing with experimentally elicited inflection of novel forms, k=7 is usually the best value. This difference in neighborhood reflects the tension between creative use of morphology, which relies on more general patterns and the prediction of attested complex forms, which is more exception-based.

Throughout the years, there has been a focus on offering efficient, optimized, and user-friendly code base for memory-based language processing. This has led to numerous iterations of the Tilburg Memory Based Learner (TiMBL), which also has an accessible reference guide explaining all the parameters and options (Daelemans, Zavrel, van der Sloot, & van den Bosch, 2004).

2.3 Analogical Modeling

A second data-driven approach that was developed during the 1980s is Analogical Modeling (AM),2 first described in detail in Skousen (1989). While memory-based learning uses the most similar exemplars stored in memory as a basis for extrapolation, AM takes a more intricate route to determine the basis for analogy, making use of supracontexts, which are essentially abstractions over exemplars, ignoring one or more of their characteristics. A supracontext can thus be seen as a level at which exemplars can be grouped. Since supracontexts are created recursively, there exists a global supracontext that matches all exemplars in the data set. In its decision-making process, AM only considers homogeneous supracontexts, with a supracontext being defined as a context in which there is no more disagreement about the class of the exemplars it matches than in any of its subcontexts. The collection of homogeneous supracontexts is called the analogical set. For instance, if we represent an exemplar by its sequence of letters, then sip could would have supracontexts -ip, s-p, si-, -i-, s--, --p, and ---. The exemplar tip would share the supracontexts -ip, -i-, --p, and --- with sip. Since to sip and to tip both take the suffix -ed in the past tense, those supracontexts are homogeneous. When we want to determine the inflectional class of the exemplar bip, AM finds the homogeneous supracontexts and gives either the probability of the inflectional class in these contexts, or a discrete decision, reflecting the majority class. AM also has an extra parameter that can be used to vary the proportion of exemplars taken into account for a decision. This parameter is inspired by the idea of an imperfect memory and can obviously have consequences for the behavior of the model, with less frequent classes being affected more by imperfect memory.

In morphology, AM, like MBLP, has been applied mostly to inflection. An early study by Skousen (1989) covered the Finnish past tense, which is mostly regular but contains some subregularities, and showed that the existing situation in Finnish fits well with the predictions made by AM. Later applications have focused on Spanish verbal inflection (Eddington, 2009), Spanish gender assignment (Eddington, 2002b; Eddington & Lonsdale, 2007), Spanish diminutives (Eddington, 2002c), and the English past tense (Chandler, 2010; Eddington, 2004).

In a comparison of MBLP with AM on the task of German plural prediction, Daelemans, Gillis, and Durieux (1997) concluded that both algorithms not only perform the task with similar accuracy, but that the patterns of errors are also very similar. Eddington (2002a) compared AM and MBLP on the task of Spanish stress assignment and showed that while there were some minor differences, both models have about the same accuracy on predicting existing forms and that they display the same hierarchy of difficulty in assigning stress, consistent with patterns attributed to children who are learning Spanish.

AM has a canonical implementation in Perl, available from http://humanities.byu.edu/am/. In computational terms, the algorithm is much slower than MBL because execution time increases exponentially with increasing exemplars.

2.4 Minimal Generalization Learning

Minimal Generalization Learning (MGL) is a data-driven model that, like MBLP and AM, relies on a database of exemplars (Albright & Hayes, 2003). It is conceptually different from those two approaches because the exemplars are used to build a system of rules and are not used directly in the decision process. Despite this, MGL is very similar to AM in its use of contexts to match different exemplars. In MGL, contexts are constructed by pairwise comparison of verbs which undergo the same change in inflection. MGL is called minimal generalization learning, because when exemplars with the same inflectional change are compared, the model constructs the minimal context matching both patterns. For instance, comparing the verbs spring–sprung and sting–stung leads to the context /s__ŋ/, which matches all verbs beginning with /s/ and ending in /ŋ/. By presenting each exemplar and its inflectional change to the system and comparing this to previously evaluated verbs, multiple rules are constructed indicating which change can be applied in which context. MGL deals with different changes occurring in the same context by computing a reliability for each rule. This reliability is a simple probability, computed by taking the number of exemplars in a context that have a particular change and dividing this number by the total number of items covered by the context. A problem with this approach is that all rules that cover just two exemplars have maximum reliability. In general, rules covering fewer exemplars have a high chance of being very reliable, while offering only very narrow generalizations. Because of this, MGL adjusts the reliability of a rule for its scope, with rules covering few exemplars getting a large downward adjustment. In practice, this leads to general rules having higher reliability than specific rules.

When a novel form is presented to the MGL model, it will usually be covered by different rules suggesting the same or different changes. From all the rules covering the form, MGL will select the rule with the highest reliability and output its associated inflectional change. MGL can also give the probability of the most reliable matching output for each inflectional change.

Albright and Hayes (2003) have argued that there is a fundamental difference between MGL and models that are based on analogy, such as MBL and AM, because, in MGL, a rule context is a structural description of which forms may match. They claim that this structured similarity uniquely allows MGL to identify islands of reliability (IORs): contexts in which there is an unusually high support for a particular inflectional pattern. For instance, in the context /s__ŋ/, which matches the group of irregular verbs like sing–sung, the structural change /ɪ/–/ʌ/ is exceptionally reliable. Analogy-based models, which do not use such structural descriptions, would be unable to identify these islands. However, data comparing MGL to MBL and AM do not strongly support the idea that structured similarity is essential for data-driven models of inflection (Chandler, 2010; Keuleers, 2008).

Although MGL is cast as a rule-based model by its authors, it shares many features with the exemplar-based approaches discussed above. In particular, the minimal generalization procedure shares with AM the emphasis on avoiding direct comparison between exemplars by looking for broader contexts. This distinguishes MGL and AM from MBLP: The first two models structure the lexicon by contexts, while the latter model assumes that no such structuring is necessary. In addition, while MGL can be implemented as a rule-based system, it can also be implemented as an exemplar-based approach, where comparisons are done at runtime instead of relying on prederived rules (Keuleers, 2008).

2.5 Dual Mechanism Models

A substantial amount of computational modeling in morphology has been driven by the claim that the data on how language users choose to inflect existing and novel forms cannot be explained by models that have only an exemplar-based component. The dual mechanism view of morphology, based mostly on observation of English and German inflection, holds that inflection is characterized by a symbolic rule component that applies in noncanonical cases, such as borrowings and novel forms, and an exemplar-based component that handles only the cases that are not covered by the default rule (e.g., Marcus et al., 1995; Prasada & Pinker, 1993).

Interestingly, the evidence brought by proponents of the dual mechanism approach does not rely so much on computational models implementing the approach, but on observing the shortcomings of single mechanism models (e.g., Pinker & Prince, 1988). Consequently, a large body of work has shown that the phenomena for which a dual mechanism is theoretically posited can actually be explained by single mechanism models and that computational implementations of dual mechanism models usually do not offer a better account (e.g., Albright & Hayes, 2003; Hahn & Nakisa, 2000; Keuleers et al., 2007). The debate about single vs. dual mechanism models of morphology can be seen as reflecting a general tension between purely theoretical models and computational implementations, where theoretical analysis tends to lead to the postulation of additional mechanisms without exploring how simpler computational implementations can explain results in a way that is unforeseen by the theoretical analysis. Veríssimo and Clahsen’s (2014) study of Portuguese verbal inflection is the only study so far in which proponents of the dual mechanism view have offered a computational implementation of their model.

3. Modeling Morphology Using Distributional Semantics

Distributional semantics is a generic term for different methods that derive semantic representations from word co-occurrence relations in corpora. The product of a distributional semantic analysis is a vector space—typically with hundreds of dimensions—in which words are represented as numerical vectors. In the most basic implementation of such a vector space, each number in a word vector will indicate whether that word occurs in a particular context, such as a document. In more complex implementations, the numbers in a vector space can, for instance, represent how well the word can be predicted by other words. In any case, the similarity between the vectors, and therefore words, can be approximated by using any of a number of mathematical distance measures.

Recent developments in methods to generate these vector spaces have spurred various computational investigations about the role of semantics in morphology. Marelli, Amenta, and Crepaldi (2015) proposed an orthographic-semantic consistency (OSC) measure that quantifies how well a word’s orthography predicts its meaning. When all words in which a particular form occurs have similarity in meaning—as is the case with transparent word relations such as baker and bakery—then the vectors for these words should be close in semantic space and OSC will be high. In the case that there is similarity in form but not in meaning—as is the case with opaque word relations such as crypt and cryptic—the vectors should be distant in semantic space and OSC should be low. Marelli et al. showed that OSC is a significant predictor of word recognition times, and accounts, at least partially, for the well-known but little-studied effect that transparent words are processed faster than opaque words.

Recently, Mikolov, Yih, and Zweig (2013) showed that vector spaces built using recurrent neural networks can be used to extrapolate relationships between morphologically related forms. As an example, let us assume that, based on the relationship between the singular year and the plural years, we want to infer the plural form for law. In a vector space, the words year, years, law, and laws—like all other wordsare represented as equal length vectors of real numbers. The method proposed by Mikolov et al. consists of first subtracting the vectors for year from the vector for years. Then, the result of that subtraction is added to the vector for law. The vector resulting from this addition is the predicted vector for the plural of law. Since it would be highly improbable that a vector exists at exactly these coordinates, the final step is to find the closest vector in the multidimensional space. If the method is successful, this should be the vector for laws. Mikolov et al. report simulations on verbal, nominal, and adjectival inflection with varying degrees of success. They also demonstrate that the results obtained using vectors spaces based on recurrent neural networks produce much better results than vector spaces based on latent semantic analysis.

Marelli and Baroni (2015) have shown that vector space models can also be used to compute the meaning of derived forms. The contrast with other computational models of morphology including a semantic component is that the meanings of the words in Marelli and Baroni’s model are completely data-driven. Marelli and Baroni show that affixes can be represented as matrices that represent an optimal mapping from unaffixed words to their affixed versions. For instance, a matrix for the affix re- would be constructed by computing the matrix that best maps vectors for forms such as consider and apply to their affixed forms reconsider and reapply. When this matrix is multiplied with the vector for a stem, the result is a vector representing the composed meaning. For instance, multiplying the matrix for re with the vector for finalize would result in a semantic vector for the novel word refinalize. The model correctly predicts semantic intuitions about these novel forms, showing that vectors that are the most similar to the constructed vectors in semantic space are also rated as more similar by humans than the vectors for the stems themselves. For instance, the vector most similar to the constructed form insultable is the existing form reprehensible. When asked whether the form insultable is more similar to the stem insult or to reprehensible, participants choose the latter answer, showing that a plausible derivational meaning vector was computed. In addition to these findings, the model can replicate semantic transparency effects from the experimental literature.

4. Computational Approaches to Morphology in Reading and Recognizing Words

While several computational models of human reading ability have been proposed, the investigation of how morphologically complex words are read within these models has been very limited. One reason is that the most prominent models, which will be discussed below, have focused primarily on modeling the reading of monomorphemic and monosyllabic words.

The Dual Route Cascaded (DRC) model of reading (Coltheart, Rastle, Perry, Langdon, & Ziegler, 2001) and the Connectionist Dual Process (CDP++) models (Perry, Ziegler, & Zorzi, 2007) have both a lexical route, where a word’s spelling is tied to its full pronunciation, and a sublexical route that is able to read out words. The models share the same lexical route, but the former model uses grapheme-to-phoneme conversion rules (GPC), while the latter uses a two-layer connectionist network to map between orthography and phonology. These models treat stems and affixes completely differently:

While stems can be handled by the lexical route, frequent affixes (e.g., pre-, co-, -ity, -ness) are always handled by the sublexical route if they do not occur in known words. The triangle model (Harm & Seidenberg, 1999, 2004; Seidenberg & McClelland, 1989) is a connectionist model that contains bidirectional mappings between three components: orthography, phonology, and semantics. In this model, identifying the meaning of a written word can occur directly via the orthography to semantics mapping or indirectly via the phonology mapping. Reading a written word out loud can happen directly via the orthography to phonology mapping or can be mediated via the semantic mapping. Early incarnations of the triangle model also discarded the problem of reading morphologically complex words. Harm and Seidenberg (2004) implemented the semantic component of the model to include feature labels for stems as well as for morphological features, such as affixes representing number or gender. They demonstrated that when such a model is presented with novel forms that contain a potential affix, the corresponding morphological labels in the semantic component are strongly activated, suggesting that representing the semantics of stems as well as of affixes is important when addressing the role of morphology in reading.

Another branch of work using connectionist models has focused on artificial languages and has established that letter sequences corresponding to morphological affixes reliably predict a particular pronunciation, in contrast to sublexical patterns that do not correspond to morphemes (Plaut & Gonnerman, 2000; Rueckl & Raveh, 1999).

Sibley, Kello, Plaut, and Elman (2009) have noted that a problem with all the implementations above is that they rely on slot-based codes, in which each letter is assigned a particular position-specific slot (e.g., CAT = [C1, A2, T3]). These models are therefore limited to dealing with short words which can easily be aligned using a slot-based representation. Because slot-based representation schemes force characters from words to be assigned a certain position, they ignore the similarities between letters at different positions. Models using slot-based coding schemes have inherent problems with extracting common patterns in multimorphemic words. Trying to fit the words represented, representation, presented in a slot-based coding scheme, left-alignment would make clear that representation and represented share morphological structure but would fail to uncover the similarity between presented and represented. Right-alignment would acknowledge the similarity between presented and represented but would ignore any similarity between representation and represented.

Several alternatives to slot-based representations have been proposed (see Davis, 2006 for an overview). Most prominent in the psycholinguistic literature are open-bigram coding (Grainger & Whitney, 2004), which represents a word by all its combinations of two characters (e.g. CATS = [CA, CT, CS, AT, AS, TS]), and spatial coding (Davis, 2010), which gives the first letter the highest activation, the second letter the second-highest activation, and so on. Sibley, Kello, Plaut, and Elman (2008) developed an alternative based on a simple recurrent network (Elman, 1990). Their model learns to encode words of variable length as sequences of fixed length and to output those fixed length sequences again as variable length sequences. Unfortunately, there have been limited attempts to apply these coding schemes specifically to reading of morphologically complex words.

A recent approach that focuses specifically on morphology in word recognition is naive discriminative learning (Baayen, Milin, Filipović-Đurđević, Hendrix, & Marelli, 2011). This model has many similarities to connectionist models, but it is based on the principles of discriminative learning (Rescorla & Wagner, 1972). Like the triangle model in Harm and Seidenberg (2004), it represents the meaning of stems as well as affixes. However, stems and affixes are both linked to the representation of the whole word, not to a specific part of it. Since it does not include phonology, it is not a model of reading aloud. It focuses on simulating word identification response times or fixation durations during reading. The model has predicted paradigmatic effects in sentential reading in Serbian, a morphologically highly complex language, and many other experimental findings from English.

5. Critical Analysis and Future Directions

One of the reasons why computational approaches to word formation have flourished is that morphology seems to present a clear testing ground for the view that symbolic processing is a requirement for language (Fodor, 1975; Fodor & Pylyshin, 1988; Newell & Simon, 1976; Pinker & Prince, 1988). Some morphological domains, such as the English past tense formation and German plural formation, appear to present evidence characteristic of symbolic processing, with one regular inflectional suffix applying across the board, regardless of sound (Marcus et al., 1995; Pinker, 1998; Prasada & Pinker, 1993). At the same time, these domains also contain irregular inflectional processes that seem to apply in limited cases and are characteristic of nonsymbolic sound-driven processes. Starting with Rumelhart and McClelland (1986), one of the main motivations for developing computational models was to show that the evidence that seemed to be characteristic of symbolic processing could be explained in a model that only assumed sound-driven learning of morphology. While Pinker and Prince’s critique of the early connectionist models (1988) remains valuable, later developments (e.g., Hahn & Nakisa, 2000; Keuleers et al., 2008) repeatedly demonstrate that what seem to be hallmarks of symbolic processing are actually completely in line with a nonsymbolic data-driven view of morphology. At the same time, there are many differences within these computational approaches, with an important difference in whether they appeal to intermediate levels of organization (Albright & Hayes, 2003; Skousen, 1989) or direct form-to-form comparison (Daelemans & van den Bosch, 2005; Keuleers & Daelemans, 2007; Keuleers et al., 2008). The question whether these intermediate contexts are required to explain word formation is still extremely relevant and is bound to carry over to theorizing in other areas of psycholinguistics such as sentence formation. At the same time, it is important to understand that while these models are data-driven, they still make a lot of assumptions that are symbolic at the subform level: Form representations rely on phonetic symbols, inflectional endings are represented as explicit classes, and so forth. Whether these representations have any psychological reality is not an intermediate problem that can be discarded. Future research will have to address the formation of representations from sound input. If this development takes place, it will lead to more comprehensive theories of word formation that may be very far removed from current accounts.

In an attempt to simplify the problem of reading, computational psycholinguistic models were initially developed to deal with very short and simple words. When it comes to morphology, many leading models of reading are still hindered by this unfortunate legacy. Even today, the reading of long and morphologically complex forms is not seen as a core problem by models such as the DRC (Coltheart et al., 2001) and CDP+ (Perry, Ziegler, & Zorzi, 2007) which, by design, consider reading as independent of the acquisition and development of relations among words. However, it is becoming more and more clear that a model of reading and word identification cannot be complete without also being a model of word acquisition and that models that do not incorporate acquisition are stretched beyond their limits when put to the task of reading and recognition of morphologically complex words. The triangle model (Harm & Seidenberg, 2004) makes a step in that direction by including a semantic component in the model, but only in the naive discriminative model (Baayen et al., 2011) is the implicit learning of form-relations a central feature of the model.

Future research in computational approaches to morphology may benefit most from explicitly addressing learning. In this respect, it is telling that the naive discriminative learning model (Baayen et al., 2011) and the distributional semantics models developed for meaning identification (Marelli & Baroni, 2015) have a lot in common. The naive discriminative learning model is based on the discrimination learning model (Rescorla & Wagner, 1972), which has close mathematical correspondences to the learning rules used in state-of-the art distributional semantics models based on recurrent neural networks (Mikolov, Yih, & Zweig, 2013). There are also some obvious differences: The focus in neural network implementations of distributional semantics models are the intermediate representations formed in hidden layers; naive discriminative learning, on the other hand, does not make use of hidden layers at all. Still, these models offer a valuable insight, by showing that problems that are posed in explicitly linguistic morphological terminology can be addressed using very simple computational learning principles, without requiring any explicit morphological information.

Further Reading

Albright, A., & Hayes, B. (2003). Rules vs. analogy in English past tenses: A computational/experimental study. Cognition, 90(2), 119–161. doi:10.1016/S0010-0277(03)00146-XFind this resource:

Chandler, S. (2010). The English past tense: Analogy redux. Cognitive Linguistics, 21(3). doi:10.1515/COGL.2010.014Find this resource:

Daelemans, W., & Van den Bosch, A. (2005). Memory-based language processing. Cambridge: Cambridge University Press.Find this resource:

Marelli, M., Amenta, S., & Crepaldi, D. (2015). Semantic transparency in free stems: The effect of Orthography-Semantics Consistency on word recognition. The Quarterly Journal of Experimental Psychology, 68(8), 1571–1583. doi:10.1080/17470218.2014.959709Find this resource:

Pinker, S. (1998). Words and rules. Lingua, 106(1), 219–242.Find this resource:

Pinker, S., & Prince, A. (1988). On language and connectionism: Analysis of a parallel distributed processing model of language acquisition. Cognition, 28(1–2), 73–193.Find this resource:

Rueckl, J. G. (2011). Connectionism and the role of morphology in visual word recognition. The Mental Lexicon, 5(3), 371–400. doi:10.1075/ml.5.3.07rueFind this resource:

Skousen, R. (1989). Analogical modeling of language. Dordrecht, The Netherlands: Kluwer.Find this resource:

References

Albright, A., & Hayes, B. (2003). Rules vs. analogy in English past tenses: A computational/experimental study. Cognition, 90(2), 119–161. doi:10.1016/S0010-0277(03)00146-XFind this resource:

Baayen, R. H., Milin, P., Đurđević, D. F., Hendrix, P., & Marelli, M. (2011). An amorphous model for morphological processing in visual comprehension based on naive discriminative learning. Psychological Review, 118(3), 438–481. doi:10.1037/a0023851Find this resource:

Chandler, S. (2010). The English past tense: Analogy redux. Cognitive Linguistics, 21(3). doi:10.1515/COGL.2010.014Find this resource:

Coltheart, M., Rastle, K., Perry, C., Langdon, R., & Ziegler, J. (2001). DRC: A dual route cascaded model of visual word recognition and reading aloud. Psychological Review, 108(1), 204.Find this resource:

Cottrell, G. W., & Plunkett, K. (1994). Acquiring the mapping from meaning to sounds. Connection Science, 6(4), 379–412.Find this resource:

Daelemans, W., Berck, P., & Gillis, S. (1997). Data mining as a method for linguistic analysis: Dutch diminutives. Folia Linguistica, 31(1–2), 57–76.Find this resource:

Daelemans, W., Gillis, S., & Durieux, G. (1997). Skousen’s analogical modeling algorithm: A comparison with lazy learning. In New methods in language processing (pp. 3–15). London: University College Press.Find this resource:

Daelemans, W., & Van den Bosch, A. (2005). Memory-based language processing. Cambridge: Cambridge University Press.Find this resource:

Daelemans, W., Zavrel, J., van der Sloot, K., & Van den Bosch, A. (2004). Timbl: Tilburg memory-based learner. The Netherlands: Tilburg University. Retrieved from http://ilk.uvt.nl/downloads/pub/papers/ilk1001.pdf.Find this resource:

Davis, C. J. (2010). The spatial coding model of visual word identification. Psychological Review, 117(3), 713.Find this resource:

Davis, C. J. (2006). Orthographic input coding: A review of behavioural evidence and current models. In S. Andrews (Ed.), From inkmarks to ideas: Current issues in lexical processing (pp. 180–206). Hove: Psychology Press.Find this resource:

Eddington, D. (2002a). A comparison of two models: Tilburg memory-based learner versus analogical modeling of language. In R. Skousen, D. Lonsdale, & D. B. Parkinson (Eds.), Analogical modeling: An exemplar-based approach to language (pp. 141–155). Amsterdam: John Benjamins.Find this resource:

Eddington, D. (2002b). Spanish gender assignment in an analogical framework. Journal of Quantitative Linguistics, 9(1), 49–75.Find this resource:

Eddington, D. (2002c). Spanish diminutive formation without rules or constraints. Linguistics, 40(2), 395–420.Find this resource:

Eddington, D. (2004). Issues in modeling language processing analogically. Lingua, 114(7), 849–871.Find this resource:

Eddington, D. (2009). Spanish verbal inflection: A single-or dual-route system? Linguistics, 47(1), 173–199.Find this resource:

Eddington, D., & Lonsdale, D. (2007). Analogical modeling: an update (Unpublished manuscript). Brigham Young University, Provo, UT.Find this resource:

Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179–211.Find this resource:

curriculum vitae 2018


I am interested in words: their internal structure, their meaning, their distributional properties, how they are used in different speech communities and registers, and how they are processed in language comprehension and speech production. There are four main themes in my research:

  1. morphological productivity
  2. morphological processing
  3. language variation
  4. statistical data analysis

Morphological Productivity


The number of words that can be described with a word formation rule varies substantially. For instance, in English, the number of words that end in the suffix -th (e.g., warmth) is quite small, whereas there are thousands of words ending in the suffix -ness (e.g., goodness). The term 'morphological productivity' is generally used informally to refer to the number of words in use in a language community that a rule describes.

For a proper understanding of the intriguing phenomenon of morphological productivity, I believe it is crucial to distinguish between (a) language-internal, structural factors, (b) processing factors, and (c) social and stylistic factors. (For a short review, see my contribution to the HSK handbook of corpus linguistics . Formal linguists tend to focus on (a), psychologists on (b), and neither like to think about (c). Sociologists and anthropologists would probably only be interested in (c).

In order to make the rather fuzzy notion of quantity that is part of the concept of morphological productivity more precise, I have developed several quantitative measures based on conditional probabilities for assessing productivity (Baayen 1992, 1993, Yearbook of Morphology). These measures assess the outcome of all three kinds of factors mentioned above, and provide an objective starting point for interpretation given for the kind of materials sampled in the corpora from which they are calculated.

In a paper in Language from 1996, co-authored with Antoinette Renouf (pdf), we show how these productivity measures shed light on the role of structural factors. Hay (2003, Causes and Consequences of Word Structure, Routledge) documented the importance of phonological processing factors, which are addressed for a large sample of English affixes in Hay and Baayen (2002, Yearbook of Morphology, (pdf) as well as in Hay and Baayen (2003) in the Italian Journal of Linguistics (pdf).

Social and stylistic factors seem to be at least equally important as the structural factors determining productivity, see, e.g., Baayen, 1994, in the Journal of Quantitative Linguistics (pdf), the study of Plag et al. (1999) in English Language and Linguistics (pdf), and for register variation and productivity, Baayen and Neijt (1997) in Linguistics (pdf).

Hay and Plag (2004, Natural Language and Linguistic Theory) presented evidence that affix ordering in English is constrained by processing complexity. Complexity-based ordering theory holds that an affix that is more difficult to parse should occur closer to the stem than an affix this is easier to parse. (This result is related to the productivity paradox observed by Krott et al., 1999, published in Linguistics (pdf), according to which words with less productive affixes are more likely to feed further word formation.)

More recently, however, a study surveying a broader range of affixes, co-authored with Ingo Plag and published in Language in 2009 (pdf) documented an inverse U-shaped functional relation between suffix ordering and processing costs in the visual lexical decision and word naming tasks. Words with the least productive suffixed revealed, on average, the shortest latencies, and words with intermediate productivity the longest latencies. The most productive suffixes showed a small processing advantage compared to intermediately productive suffixes. This pattern of results may indicate a trade-off between storage and computation, with the costs of computation overriding the costs of storage only for the most productive suffixes.

In terms of graph theory, complexity-based ordering theory holds that the directed graph of English suffixes is acyclic. In practice, one typically observes a small percentage (10% or less) of affix orders violating acyclicity. In a recent study, The directed compound graph of English An exploration of lexical connectivity and its processing consequences (pdf), I showed that constituent order for two-constituent English compounds is also largely acyclic, with a similar violation rate as observed for affixes. The rank ordering for compounds, however, is not predictive at all for lexical processing costs. This suggests that acyclicity is not necessarily the result of complexity-based constraints. What emerged as important for predicting processing costs in this study are graph-theoretical concepts such as membership of the strongly connected component of the compound graph, and the shortest path length of the head to the modifier.

Morphological Processing


How do we understand and produce morphologically complex words such as hands and boathouse? The classical answer to this question is that we would use simple morphological rules, such as the rule adding an s to form the plural, or the rule allowing speakers to form compounds from nouns. I have always been uncomfortable with this answer, as so much of what makes language such a wonderful vehicle for literature and poetry is the presence of all kinds of subtle (and sometimes not so subtle) irregularities. Often, a 'rule' captures a main trend in a field in which several probabilistic forces are at work. Furthermore, an appeal to rules raises the question of how these supposed rules would work, would be learned, and would be implemented in the brain.

Nevertheless, morphological rules enjoy tremendous popularity in formal linguistics, which tends to view the lexicon as the repository of the unpredictable. The dominant metaphor is that of the lexicon as a calculus, a set of elementary symbols and rules for combining those symbols into well-formed expressions. The lexicon would then be very similar to a pocket calculator. Just as a calculator evaluates expressions such as 3+5 (returning 8), a morphological rule would evaluate hands as the plural of hand. Moreover, just as a calculator does not have in memory that 3+5 is equal to eight, irrespective of how often we ask it to calculate 3+5 for us, the mental lexicon (the structures in the brain subserving memory for words and rules) would not have a memory trace for the word hands. Irrespective of how often we would encounter the word hands, we would forget having seen, heard, or said the plural form. All we would remember is having used the word hand.

The problem with this theory is that the frequency with which a word is encountered in speech and writing co-determines the fine acoustic details with which it is pronounced, as well as how quickly we can understand it. For instance, Baayen, Dijkstra, and Schreuder published a study in the Journal of Memory and Language in 1997 (pdf) indicating that frequent plurals such as hands are read more quickly than infrequent plurals such as noses. Baayen, McQueen, Dijkstra, and Schreuder, 2003 (pdf) later reported the same pattern of results for auditory comprehension. Further discussion of this issue can be found in a book chapter co-authored with Rob Schreuder, Nivja de Jong, and Andrea Krott from 2002 (pdf). Even for speech production, there is evidence that the frequencies with which singulars and plurals are used co-determine the speed with which we pronounce nouns (Baayen, Levelt, Schreuder, Ernestus, 2008, (pdf)).

In recent studies using eye-tracking methodology, first fixation durations were shorter for more frequent compounds, even though the whole compound had not yet been scanned visually (see Kuperman, Schreuder, Betram and Baayen, JEP:HPP, 2009 for English, (pdf) and Miwa, Libben, Dijkstra and Baayen (submitted, available on request) for Japanese). These early whole-word frequency effects are incompatible with pocket calculator theories, and with any theory positing that reading critically depends on an initial decomposition of the visual input into its constituents.

Lexical memory also turns out to be highly sensitive to the fine details of the acoustic signal and the probability distributions of these details, see, for instance, Kemps, Ernestus, Schreuder, and Baayen, Memory and Cognition, 2005 (pdf). Work by Mark Pluymaekers, Mirjam Ernestus, and Victor Kuperman in the Journal of the Acoustical Society of America ((pdf) and (pdf)) indicates, furthermore, that the degree of affix reduction and assimilation in complex words correlates with frequency of use. Apparently, a word's specific frequency co-determines the fine details of its phonetic realization.

In summary, the metaphor of the brain working essentially as a pocket calculator is not very attractive in the light of what we now know about the sensitivity of our memories to frequency of use and to phonetic detail of individual words, even if they are completely regular.

Given that our brains have detailed memories of regular complex words, it seems promising to view the mental lexicon as a large instance base of forms that serve as exemplars for analogical generalization. Given pairs such as hand/hands, dot/dots and pen/pens, the plural of cup must be cups. Exemplar theories typically assume that the memory capacity of the brain is so vast that individual forms are stored in memory, even when they are regular. Exemplar theories can therefore easily accommodate the fact that regular complex words can have specific acoustic properties, and that their frequency of occurrence co-determines lexical processing. They also offer the advantage of being easy to implement computationally. In other words, in exemplar-based approaches to the lexicon, the lexicon is seen as a highly redundant, exquisite memory memory system, in which 'analogical rules' become, possibly highly local, generalizations over stored exemplars. Analogy (a word detested by linguists believing in calculator-like rules) can be formalized using well-validated techniques in statistics and machine learning. Techniques that I find especially insightful and useful are Royal Skousen's Analogical Modeling of Language (AML), and the Tilburg Memory Based Learner TiMBL developed by Walter Daeleman and Antal van den Bosch.

These statistical and machine learning techniques turn out to work very well for phenomena that resist description in terms of rules, but where native speakers nevertheless have very clear intuitions about what the appropriate forms are. Krott, Schreuder and Baayen (Linguistics, 1999, (pdf)) used TiMBL's nearest neighbor algorithm to solve the riddle of Dutch interfixes, Ernestus and Baayen studied several algorithms for understanding the subtle probabilistic aspects of the phenomenon of final devoicing in Dutch ((pdf) and (pdf)). Ingo Plag and his colleagues have used TiMBL to clarify the details of stress placement in English (see Plag et al. in Corpus Linguistics and Linguistic Theory 2007 and in Language, 2008).

Especially for compounds, a coherent pattern emerges for how probabilistic rules may work in languages such as English and Dutch. For compound stress, for interfixes, and, as shown by Cristina Gagne, for the interpretation of the semantic relation between modifier and head in a compound (see Gagne and Shoben, JEP:LMC 1997 and subsequent work), the probability distribution of the possible choices given the modifier appears to be the key predictor.

The importance of words' constituents as a domains of probabilistic generalization ties in well with results obtained on the morphological family size effect. Simple words that occur as a constituent in many other words (in English, words such as mind, eye, fire) are responded to more quickly in reading tasks than words that occur in only a few other words (e.g., scythe, balm). This effect, first observed for Dutch ((pdf), (pdf), (pdf)) has been replicated for English and typologically unrelated languages such as Hebrew, and Finnish ((pdf), (pdf)). Words with many paradigmatic connections to other words are processed faster, and these other words in turn constitute exemplar sets informing analogical generalization.

Although exemplar-based approaches to lexical processing are computationally very attractive, they come with their own share of problems. One question concerns the redundancy that comes with storing many very similar exemplars, a second question addresses the vast numbers of exemplars that may need to be stored. With respect to the first question, it is worth noting that any workable and working exemplar-based system, such as TiMBL, implements smart storage algorithms to alleviate the problem of looking up a particular exemplar in a very large instance base of exemplars. In other words, in machine learning approaches, some form of data compression tends to be part of the computational enterprise.

The second question, however, is more difficult to waive aside. Until recently, I thought that the number of different word forms in languages such as Dutch and English is so small, for an average educated adult native speaker well below 100.000, that the vast storage capacity of the human brain should easily accommodate such numbers of exemplars. However, recently several labs have reported frequency effects for sequences of words. For instance, http://www.ncilab.ca/people Antoine Tremblay studied sequences of four words in an immediate recall task and observed frequency effects in the evoked response potentials measured at the scalp (pdf). Crucially, these frequency effects (with frequency effects for subsequences of words controlled for) emerged not only for full phrases (such as on the large table) but also for partial phrases such (such as the president of the). There are hundreds of millions of different sequences of word pairs, word triplest, quadruplets, etc. It is, of course, possible that the brain does store all these hundreds of millions of exemplars, but I find this possibility unlikely. This problem of an exponential explosion of exemplars does not occur only when we move from word formation into syntax, it also occurs when we move into the auditory domain and consider acoustic exemplars for words. Even the same speaker will never say a given word in exactly the same way twice. Assuming exemplars for the acoustic exemplars of the words again amounts to assuming that staggering numbers of exemplars would be stored in memory (see, e.g., Ernestus and Baayen (2011) (pdf) for discussion).

There are several altenatives to exemplar theory. A first option is to pursue the classic position in generative linguistics, and to derive abstract rules from experience, obviating the need for individual exemplars. Whereas exemplar models such as TiMBL can be characterized aslazy learning, with analogical generalization operating at run time over an instance base of exemplars, greedy learning is characteristic of rule systems such as the Minimum Generalization Learner (MGL) of Albright and Hayes (Cognition, 2003), that derive large sets of complex symbolic rules from the input. Such systems are highly economical in memory, but this approach comes with its own price. First, I find the complexity of the derived systems of rules unattractive from a cognitive perspective. Second, once trained, the system works but can't learn: it ends up functioning like a sophisticated pocket calculator. Third, such approaches leave unexplained the existence of frequency and family size effects for (regular) complex words. Interestingly, Emmanuel Keuleers, in his Antwerpen 2008 doctoral dissertation, points out that from a computational perspective, the MGL and a nearest neighbor exemplar-based model are equivalent, the key difference being that the point at which the exemplars are evaluated: during learning for MGL, at runtime for the exemplar model.

A second alternative is to turn to connectionist modeling. Artificial neural networks (ANNs, see, e.g., the seminal study of Rumelhart and McClelland, 1986) offer many advantages. They are powerful statistical pattern associators, they implement a form of data compression, they can be lesioned for modeling language deficits, they are not restricted to discrete (or discretisized) input, etc. On the downside, it is unclear to what extent current connectionist models can handle the symbolic aspects of language (see, e.g., Levelt, 1991, Sprache und Kognition), and the performance of ANNs is often not straightforwardly interpretable. Although the metaphor of neural networks is appealing, the connectionist models that I am aware of make use of training algorithms that are mathematically sophisticated but biologically implausible. Nevertheless, connectionist models can provide surprisingly good fits to observed data on lexical processing. I have had the pleasure of working with Fermin Moscoso del Prado Martin for a number of years, and his connectionist models (see (pdf), and (pdf)) capture frequency effects, family size effects, as well as important linguistic generalizations.

Recently, in collaboration with Peter Milin, Peter Hendrix and Marco Marelli, I have been exploring an approach to morphological processing based on discriminative learning as defined by the Rescorla-Wagner equations. These equations specify how the strength of a cue to a given outcome changes over time as a function of how valid that cue is for that outcome. Ramscar, Yarlett, Dye, Denny and Thorpe (2010, Cognitive Science) have recently shown that the Rescorla-Wagner equations make excellent predictions for language acquisition. Following their lead, we have been examining the potential of these equations for morphological processing. The basic intuition is the following. Scrabble players know that the letter pair qa can be used for the legal scrabble words qaid, qat and qanat. The cue validity of qa for these words is quite high, whereas the cue validity of just the a for these words is very low - there are lots of more common words that have an a in them (apple, a, and, can, ...). Might it be the case that many of the morphological effects in lexical processing arise due to the cue strengths of orthographic information (provided by letters and letter pairs) to word meanings?

To address this question, we have made use of the equilibrium equations developed by Danks (Journal of Mathematical Psychology, 2003), which make it possible to estimate the weights from the cues (letters and letter pairs) to the outcomes (word meanings) from corpus-derived co-occurrence matrices, in our case co-occurrence matrices derived from 11,172,554 two and three-word phrases (comprising in all 26,441,155 word tokens) taken from the British National Corpus. The activation of a meaning was obtained by summing the weights from the active letters and letter pairs to that meaning. Response latencies were modeled as inversely proportional to this activation, and log-transformed to remove for statistical analysis the skew from the resulting distribution of simulated reaction times. We refer to this model as the Naive Discriminative Reader, where 'naive' refers to the fact that the weights to a given outcome are estimated independently of the weights to other other outcomes. The model is thus naive in the naive Bayes sense.

It turns out that this very simple, a-morphous, architecture captures a wide range of effects documented for lexical processing, including frequency effects, morphological family size effects, and relative entropy effects. For monomorphemic words, the Naive Discriminative Reader provides excellent predictions already without any free parameters. For morphologically complex words, good prediction accuracy requires a few free parameters. For instance, for compounds, it turns out that the meaning of the head noun is less important than the meaning of the modifier (which is read first). The model captures frequency effects for complex words, without there being representations for complex words in the model. The model also predicts phrasal frequency effects (Baayen and Hendrix 2011 (pdf) ), again without phrases receiving their own representations in the model. When used as a classifier for predicting the dative alternation, the model performs as well as a generalized linear mixed-effects model and as a support vector machine (pdf) .

The naive discriminative reader model differs from both subsymbolic and interactive activation connectionist models. The model does not belong to the class of subsymbolic models, as the representations for letters, letter pairs, and meanings are all straightforward symbolic representations. There are also no hidden layers, and no backpropagation. It is also not an interactive activation model, as there is only one forward pass of activation. Furthermore, its weights are not set by hand, but are derived from corpus input. In addition, it is extremely parsimoneous in the number of representations required, especially compared to interactive activation models incorporating representations for n-grams. Thus far, the results we have obtained with this new approach are encouraging. To our knowledge, there are currently no other implemented computational models for morphological processing available that correctly predict the range of morphological phenomena handled well by the naive discriminative reader. It will be interesting and informative to discover where its predictions break down.

Language Variation


My interest in stylometry was sparked by the seminal work of John Burrows, who documented stylistic, regional, and authorial differences by means of principal components analyses applied to the relative frequencies of the highest-frequency function words in literary texts. Burrows' research inspired a study published in the Journal of Quantitative Linguistics entitled Derivational Productivity and Text Typology(pdf), which explored the potential of a productivity measure for distinguishing different text types. Baayen, Tweedie and Van Halteren (1996, Literary and Linguistic Computing) compared function words with syntactic tags in an authorship identification study. Baayen, Van Halteren, Neijt and Tweedie (2002, Proceedings JADT, (pdf)) and Van Halteren, Baayen, Tweedie, Haverkort, and Neijt (JQL, 2005, (pdf) showed that in a controlled experiment, the writings of 8 students of Dutch language and literature at the university of Nijmegen could be correctly attributed to a remarkable degree (80% to 95% correct attributions), which was reanalyzed with Patrick Juola in a CHUM paper entitled A Controlled-Corpus Experiment in Authorship Identification by Cross-Entropy(pdf). For a study on the socio-stylistic stratification of the Dutch suffix -lijk in speech and writing, see Keune et al. (2006, (pdf)), and for the application of mixed effects modeling in sociolinguistic studies of language variation, see Tagliamonte and Baayen (2012), (pdf)). I am also interested in individual differences between speakers/writers in experimental tasks. For a study suggesting individual differences between speakers with respect to storage and computation of complex words, see my paper with Petar Milin (pdf). For register variation in morphological productivity, see, e.g., Plag et al. (1999) (pdf) in English Language and Linguistics.

Statistical Analysis of Language Data


My book on word frequency distributions (Kluwer, 2001) gives an overview of statistical models for word frequency distributions. Stefan Evert developed the LNRE extension of the Zipf-Mandelbrot model (Evert, 2004, JADT proceedings), which provides a flexible and often much more robust extension of the line of work presented in my book.

I have found R to be a great open-source tool for data analysis. My book with Cambridge University Press, Analyzing Linguistic Data: A practical introduction to Statistics using R, provides an introduction to R for linguists and psycholinguists. An older version of this book, with various typos and small errors, is available (see my list of publications for a link). The data sets and convenience functions used in this textbook are available in the 'languageR' package that is available from the CRAN archives. Recent versions of this package have some added functionality that is not documented in my book, such as a function for plotting the partial effects of mixed-effects regression models (plotLMER.fnc) and a function for explorin g autocorrelational structure in psycholinguistic experiments (acf.fnc).

Mixed effects modeling is a beautiful technique for understanding the structure of data sets with both fixed and random effec ts as typically obtained in (psycho)linguistic experiments. My favorite reference for nested random effects is Pinheiro and Bates (2000, Springer). For data sets with crossed fixed and random effects, the lme4 package of Douglas Bates provides a flexible tool kit for R. A paper co-authored with Doug Davidson and Douglas Bates in JML, Mixed-effects modeling with crossed random effect s for subjects and items(pdf) and chapter 7 of my introductory textbook provide various examples of its use. A highly recommended more technical book on mixed effects modeling that provides detailed examples of crossed random effects for subjects and item s is Julian J. Faraway (2006), Extending Linear Models with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models, Chapman and Hall. Practical studies discussing methodological aspects of mixed effects modeling that I was involved in are Capturing Correlational Structure in Russian Paradigms: a Case Study in Logistic Mixed-Effects Modeling (with Laura Janda and Tore Nesset) (pdf) ,Analyzing Reaction Times (with Petar Milin) (pdf), Models, Forests and Trees of York English: Was/Were Variation as a Case Study for Statistical Practice, with Sali Tagliamonte (pdf), and Corpus Linguistics and Naive Discriminative Learning(pdf). A general methodological paper (pdf) illustrates by means of some simulation studies the disadvantages of using factorial designs, still highly popular in psycholinguistics, for investigations for which regression designs are more appropriate.



The probability of encountering a new, previously unseen word type in a text after N word tokens have been read is given by the slope of the tangent to the curve at the point (N, V(N)). This slope (a result due to Turing), is equal to the number of word types occurring exactly once in the first N tokens, divided by N. The curves in the above figure show how the number of different words ending in ness keep increasing when reading through an 18 million word corpus, while the number of different words ending in ment quickly reaches a horizontal asymptote, indicating that the probability of observing new formations with this suffix is zero. Baayen (1992) introduces this approach to measuring productivity in further detail.


A cycle in the English compound graph: (woodcock, cockhorse, horsehair, hair oil, oil-silk, silkworm, wormwood). Each modifier is also a head, and each head is also a modifier.

Cost surface for reading aloud of compounds, as a function of the number of outgoing edges for the modifier and the shortest path length from the head to the modifier. The observed surface (left) is well approximated by a theoretical surface (right) derived from the hypothesis that activation traveling from the head through the network to the modifier interferes most strongly when the path is not too long and not too short. Very short paths (houseboat/boathouse) don't interfere due to strong support from the visual input, very long paths don't interfere due to activation decay with each step through the cycle. Further details are available in Baayen (2010)

The calculus of formal languages do not provide a useful metaphor for understanding how the brain deals with words. The brain is not a pocket calculator that never remembers what it computes.

Frequency dominance effects in reading and listening (top) differ from those in speaking (bottom). A singular dominant noun is a noun for which the singular is more frequent than the plural (e.g., nose). A plural dominant noun is characterized by a plural that is more frequent than its singular (e.g., hands). In reading and listening, a plural-dominant plural is understood more quickly than a singular dominant plural matched for stem frequency, for both high-frequency and low-frequency nouns. In speech, surprisingly, both plurals and singulars require more time to pronounce when the plural is dominant. Baayen, Levelt, Schreuder, and Ernestus (2008) show that this delay in production is due to the greater information load (entropy) of the inflectional paradigms of plural-dominant nouns.


Complex words that are used frequently tend to be shortened considerably in speech, to such an extent that out of context the word cannot be recognized. In context, the brain restores the acoustic signal to its canonical form. Kemps, Ernestus, Schreuder and Baayen (2004) (pdf) demonstrated this restoration effect for Dutch suffixed words. For English, the following example illustrates the phenomenen. First listen to this audio file, which plays a common high-frequency English word. Heard by itself, the word makes no sense at all. Now listen to this word in an audio file that also plays the sentential context. The sentence and the reduced word are spelled out here. A leading specialist on acoustic reduction is Mirjam Ernestus.

Analogical selection of an interfix for the Dutch compound schapenoog is based primarily on the distribution of interfixes in the set of compounds sharing the modifier schaap. As in this set en is the interfix that is attested most often, this is the preferred interfix for this compound. See Krott, Schreuder and Baayen (1999) for further discussion. The assignment of stress in English compounds and the interpretation of the semantic relation between modifier and head is based on exactly the same principle, with the modifier as the most important domain of probability-driven generalization.

The frequency of a four-word sequence (n-gram) modulates the evoked response potential 150 milliseconds post stimulus onset in an immediate recall task at electrode Fz. The top panel shows the main trend, with the standard negativity around 150 ms. The bottom panel shows how this general trend is modulated by the frequency of the four-word phrase (FreqABCD on the vertical axis). For the lower n-gram frequencies, higher-amplitude theta oscillations are present. Tremblay and Baayen (2010) provide further details.

A popular model in the psychology of reading is that orthographic morphemes (e.g., win and er for winner) mediate access from the orthographic level to word meanings. This theory has, as far as I know, never been implemented in a computational model.


In the naive discriminative reader model (computationally implemented in R in the package ndl), there is no separate morpho-orthographic representational layer. Orthographic representations for letters and letter pairs are connected directly to basic meanings such as WIN and AGENT. The weights on the links are estimated from the co-occurrence matrix of the cues on one hand, and the co-occurrence matrix of outcomes and cues on the other, using the equilibrium equations for the Rescorla-Wagner equations developed in Danks (Journal of Mathematical Psychology, 2003). The estimates are optimal in the least-squares sense.

Why no morphemes? Many morphologists (e.g., Matthews, Anderson, Blevins) have argued that affixes are not are not signs in the Saussurian sense. For instance, the Serbian case ending a in žena (woman) represents either NOMINATIVE and SINGULAR, or PLURAL and GENITIVE. Normal signs such as tree may have various shades of meaning (such as 'any perennial woody plant of considerable size', 'a piece of timber', 'a cross', 'gallows'), but these different shades of meaning are usually not intended simultaneously.

The naive discriminative predicts activation of all grammatical meanings associated with a case ending, depending on the frequencies with which these meanings are used, on how these meanings are used across the other words in the same inflectional class (cf. Milin et al., JML, 2009), and on their use in other inflectional classes. The red triangles indicate the activations predicted by the naive discriminative reading model for the correct meanings, the blue circles represent the activations of meanings that are incorrect. As hoped for, the red triangles tend to be to the right of the blue circles, indicating higher activation for correct meanings. For ženi, however, we have interference from the -i case ending expressing nominative plural in masculine nouns.

The coefficients of a regression model fitted to the observed reaction times to 1289 English monomorphemic words in visual lexical decision correlate well (r = 0.88) with the coefficients of the same regression model fitted to the simulated latencies predicted by the naive discriminative reader model. The model not only captures the effects of a wide range of predictors (from left to right, Word Frequency, Inflectional Entropy, Morphological Family Size, Number of Meanings, Word Length, Noun-Verb Ratio, N-count, Prepositional Relative Entropy, and Letter Bigram Frequency), but also correctly captures the relative magnitude of these effects.

Scatterplot matrix for by-subject intercepts and slopes according to a mixed-effects regression model for self-paced reading of Dutch poetry. Slower readers have larger intercepts. For most participants, the slope for Word Form Frequency is negative: they read more frequent words faster, as expected. Participants are split with respect to how they process complex words. There are subjects who read complex words with more constituents more slowly, while other subjects read words faster if they contain more constituent (Nmorphemes). Interestingly, the slow readers are more likely to experience more facilitation from word frequency. Conversely, the slow readers dwell longer on a word, the more constituents it has. In addition, readers who read words that have more constituents fastest are also the readers who do not show facilitation from word frequency. Apparently, within the population of our informants, there is a trade-off between constituent-driven processing and memory-based whole-word processing. Most participants are balanced, but at the extremes we find both dedicated lumpers and avid splitters. Further details on this experiment can be found in Baayen and Milin (2011).

A conditional inference tree predicting the choice between was (S) and were (R) in sentences such as There wasn’t the major sort of bombings and stuff like that but there was orange men. You know there em was odd things going on. for York (UK) English. A random forest based on conditional inference trees (fitted with the party package for R) outperformed a generalized linear mixed model fitted to the same data in terms of prediction accuracy. See Tagliamonte and Baayen (2012) for further details.

Categories: 1

0 Replies to “Memory-Based Learning Models Of Inflectional Morphology A Methodological Case Study”

Leave a comment

L'indirizzo email non verrà pubblicato. I campi obbligatori sono contrassegnati *