Harald Baayen
University Of Tuebingen, Linguistics, Faculty Member
The discriminative lexicon is introduced as a mathematical and computational model of the mental lexicon. This novel theory is inspired by word and paradigm morphology but operationalizes the concept of proportional analogy using the... more
The discriminative lexicon is introduced as a mathematical and computational model of the mental lexicon. This novel theory is inspired by word and paradigm morphology but operationalizes the concept of proportional analogy using the mathematics of linear algebra. It embraces the discriminative perspective on language, rejecting the idea that words’ meanings are compositional in the sense of Frege and Russell and arguing instead that the relation between form and meaning is fundamentally discriminative. The discriminative lexicon also incorporates the insight from machine learning that end-to-end modeling is much more effective than working with a cascade of models targeting individual subtasks. The computational engine at the heart of the discriminative lexicon is linear discriminative learning: simple linear networks are used for mapping form onto meaning and meaning onto form, without requiring the hierarchies of post-Bloomfieldian ‘hidden’ constructs such as phonemes, morphemes, and stems. We show that this novel model meets the criteria of accuracy (it properly recognizes words and produces words correctly), productivity (the model is remarkably successful in understanding and producing novel complex words), and predictivity (it correctly predicts a wide array of experimental phenomena in lexical processing). The discriminative lexicon does not make use of static representations that are stored in memory and that have to be accessed in comprehension and production. It replaces static representations by states of the cognitive system that arise dynamically as a consequence of external or internal stimuli. The discriminative lexicon brings together visual and auditory comprehension as well as speech production into an integrated dynamic system of coupled linear networks.
Research Interests:
In the past few years, speech recognition has become a new standard for state-of-the-art technology. We now talk to our phones as much as we talk on them. How can helping machines learn to listen improve our understanding of how our own... more
In the past few years, speech recognition has become a new standard for state-of-the-art technology. We now talk to our phones as much as we talk on them. How can helping machines learn to listen improve our understanding of how our own brains work? Dr Harald Baayen at Eberhard Karls University Tübingen and his collaborators work at the intersection of linguistics, psychology, and computational data science to illuminate elegant solutions for processing speech.
Research Interests:
We present the Chinese Lexical Database (CLD): a large-scale lexical database for simplified Chinese. The CLD provides a wealth of lexical information for 3913 one-character words, 34,233 two-character words, 7143 three-character words,... more
We present the Chinese Lexical Database (CLD): a large-scale lexical database for simplified Chinese. The CLD provides a wealth of lexical information for 3913 one-character words, 34,233 two-character words, 7143 three-character words, and 3355 four-character words, and is publicly available through http://www.chineselexicaldatabase.com. For each of the 48,644 words in the CLD, we provide a wide range of categorical predictors, as well as an extensive set of frequency measures, complexity measures, neighborhood density measures, orthography-phonology consistency measures, and information-theoretic measures. We evaluate the explanatory power of the lexical variables in the CLD in the context of experimental data through analyses of lexical decision latencies for one-character, two-character, three-character and four-character words, as well as word naming latencies for one-character and two-character words. The results of these analyses are discussed.
Research Interests:
The relation between speed and curvature provides a characterization of the spatio-temporal orchestration of kinematic movements. For hand movements, this relation has been reported to follow a power law with exponent. The same power law... more
The relation between speed and curvature provides a characterization of the spatio-temporal orchestration of kinematic movements. For hand movements, this relation has been reported to follow a power law with exponent. The same power law has been claimed to govern articulatory movements. We studied the functional form of speed as predicted by curvature using electromagnetic articulography, focusing on three sensors: the tongue tip, the tongue body, and the lower lip. Of specific interest to us was the question of whether the speed-curvature relation is modified by articulatory practice, gauged with words’ frequencies of occurrence. Although analyses imposing linearity a priori indeed supported a power law, relaxation of this linearity assumption revealed that the effect of curvature on speed levels off substantially for lower values of curvature. A modification of the power law is proposed that takes this curvature into account. Furthermore, controlling statistically for number of phones and word duration, we observed that the speed-curvature function was further modulated by an interaction of lexical frequency by curvature, such that for increasing frequency, speed decreased slightly for low curvatures while it increased slightly for high curvatures. The modulation of the balance between speed and curvature by lexical frequency provides further evidence that the skill of articulation improves with practice on a word-to-word basis, and challenges theories of speech production.
Research Interests:
Research into the phenomenon of morphological productivity, “the possibility for language users to coin, unintentionally, a number of formations which are in principle uncountable” (Schultink 1961), has mainly focused on the qualitative... more
Research into the phenomenon of morphological productivity, “the possibility for language users to coin, unintentionally, a number of formations which are in principle uncountable” (Schultink 1961), has mainly focused on the qualitative factors which jointly determine the productivity of word formation rules. It is well known that word formation processes are subject to various syntagmatic conditions. Booij (1977) develops a typology of such conditioning factors, distinguishing between rule-specific and rule-independent restrictions on the one hand, and between restrictions pertaining to phonological, stratal and syntactic characteristics on the other.1 The rôle of pardigmatic factors is discussed in van Marie (1985). He points out that (roughly) synonymous affixes tend to select their base words from complementary domains. Hence they can be analyzed as mutually affecting their respective degrees of productivity.
Research Interests:
This mega-study reports a word naming experiment addressing the production of Vietnamese compounds. Instead of considering as response variable only the naming latency, we also investigate as response variable the acoustic duration of the... more
This mega-study reports a word naming experiment addressing the production of Vietnamese compounds. Instead of considering as response variable only the naming latency, we also investigate as response variable the acoustic duration of the speech produced. Effects of compound frequency, word length, and constituent family size were present in both latencies and durations, but effects of constituent frequency were absent. This sets Vietnamese apart from languages such as English and Dutch, for which constituent frequency effects are well attested. We attribute the absence of constituent frequency effects on bisyllabic structures constituting the basic, unmarked phonological form of the Vietnamese word. Our data also challenge models of speech production holding that the onset of speech production would not be affected by the properties of glottal stop initial syllables, as we observed naming latencies to be co-determined by the family size, tone, and syllable type of the second syllable. A remarkable convergence of the effects of frequency and family size for response latencies and acoustic durations validates the word naming task combining these two response variables as an experimental paradigm for the study of phonological encoding in speech production.
Research Interests:
A method for estimating word frequencies on the basic's of lexical dispersion data that makes use of some results in occupancy theory is outlined. The accuracy of the method, which assumes that words occur independently in texts, is... more
A method for estimating word frequencies on the basic's of lexical dispersion data that makes use of some results in occupancy theory is outlined. The accuracy of the method, which assumes that words occur independently in texts, is tested on Lewis Carroll's “Alice in Wonderland”. Although intra-textual and inter-textual cohesion can be traced as sources of misfit, the probabilistic aspect of the relation between frequency and dispersion is strong enough to allow lower boundary estimates of word frequency to be obtained for roughly 75% of the words.
Research Interests:
Two visual world eyetracking experiments investigated how acoustic cue value and statistical variance affect perceptual uncertainty during Cantonese consonant (Experiment 1) and tone perception (Experiment 2). Participants heard low- or... more
Two visual world eyetracking experiments investigated how acoustic cue value and statistical variance affect perceptual uncertainty during Cantonese consonant (Experiment 1) and tone perception (Experiment 2). Participants heard low- or high-variance acoustic stimuli. Euclidean distance of fixations from target and competitor pictures over time was analysed using Generalised Additive Mixed Modelling. Distance of fixations from target and competitor pictures varied as a function of acoustic cue, providing evidence for gradient, nonlinear sensitivity to cue values. Moreover, cue value effects significantly interacted with statistical variance, indicating that the cue distribution directly affects perceptual uncertainty. Interestingly, the time course of effects differed between target distance and competitor distance models. The pattern of effects over time suggests a global strategy in response to the level of uncertainty: as uncertainty increases, verification looks increase accordingly. Low variance generally creates less uncertainty, but can lead to greater uncertainty in the face of unexpected speech tokens.
Research Interests:
This study investigates functional interpretations of left anterior negativities (LANs), a language-related EEG effect that has been found for syntactic and morphological violations. We focus on three possible interpretations of LANs... more
This study investigates functional interpretations of left anterior negativities
(LANs), a language-related EEG effect that has been found for syntactic and morphological violations. We focus on three possible interpretations of LANs caused by the replacement of irregular affixes with regular affixes: misapplication of morphological rules, mismatch of the presented form with analogy-based expectations, and mismatch of the presented form with stored representations. Event-related brain potentials were recorded during the visual presentation of existing and novel Dutch compounds. Existing compounds contained correct or replaced interfixes (dame+s+salons > damessalons vs. *dame+n+salons > *damensalons “women’s hairdresser salons”), while novel Dutch compounds contained interfixes that were either supported or not supported by analogy to similar existing compounds (kruidenkelken vs. ?kruidskelken ‘herb chalices’) - earlier studies had shown that interfixes are selected by analogy instead of rules. All compounds were presented with correct or incorrect regular plural suffixes (damessalons vs. *damessalonnen). Replacing suffixes or interfixes in existing compounds both led to increased (L)ANs between 400 and 700ms without any evidence for different scalp distributions for interfixes and suffixes. There was no evidence for a negativity when manipulating the analogical support for interfixes in novel compounds. Together with earlier studies, these results suggest that LANs had been caused by the mismatch of the presented forms with stored forms. We discuss these findings with respect to the single/dual route debate of morphology and LANs found for the misapplication of syntactic rules.
(LANs), a language-related EEG effect that has been found for syntactic and morphological violations. We focus on three possible interpretations of LANs caused by the replacement of irregular affixes with regular affixes: misapplication of morphological rules, mismatch of the presented form with analogy-based expectations, and mismatch of the presented form with stored representations. Event-related brain potentials were recorded during the visual presentation of existing and novel Dutch compounds. Existing compounds contained correct or replaced interfixes (dame+s+salons > damessalons vs. *dame+n+salons > *damensalons “women’s hairdresser salons”), while novel Dutch compounds contained interfixes that were either supported or not supported by analogy to similar existing compounds (kruidenkelken vs. ?kruidskelken ‘herb chalices’) - earlier studies had shown that interfixes are selected by analogy instead of rules. All compounds were presented with correct or incorrect regular plural suffixes (damessalons vs. *damessalonnen). Replacing suffixes or interfixes in existing compounds both led to increased (L)ANs between 400 and 700ms without any evidence for different scalp distributions for interfixes and suffixes. There was no evidence for a negativity when manipulating the analogical support for interfixes in novel compounds. Together with earlier studies, these results suggest that LANs had been caused by the mismatch of the presented forms with stored forms. We discuss these findings with respect to the single/dual route debate of morphology and LANs found for the misapplication of syntactic rules.
Research Interests:
This paper provides an introduction to mixed-effects models for the analysis of repeated measurement data with subjects and items as crossed random effects. A worked-out example of how to use recent software for mixed-effects modeling is... more
This paper provides an introduction to mixed-effects models for the analysis of repeated measurement data with subjects and items as crossed random effects. A worked-out example of how to use recent software for mixed-effects modeling is provided. Simulation studies illustrate the advantages offered by mixed-effects analyses compared to traditional analyses based on quasi-F tests, by-subjects analyses, combined by-subjects and by-items analyses, and random regression. Applications and possibilities across a range of domains of inquiry are discussed.
Research Interests:
In this study we present a novel set of discrimination-based indicators of language processing derived from Naive Discriminative Learning (NDL) Theory (Baayen, Milin, Filipovi´c Ður ¯ devi´c, Hendrix, & Marelli, 2011). We compare the... more
In this study we present a novel set of discrimination-based indicators of language processing derived from Naive Discriminative Learning (NDL) Theory (Baayen, Milin, Filipovi´c Ður ¯ devi´c, Hendrix, & Marelli, 2011). We compare the effectiveness of these new measures with classical lexical-distributional measures — in particular, frequency counts and form similarity measures — to predict lexical decision latencies when a complete morphological segmentation of masked primes is or is not possible. Data derive from a re-analysis of a large subset of decision latencies from the English Lexicon Project (Balota et al., 2007), as well as from the results of two new masked priming studies. Results demonstrate the superiority of discrimination-based predictors over lexical-distributional predictors alone, across both the simple and primed lexical decision tasks. Comparable priming after masked CORNER and CORNEA type primes, across two experiments, fails to support early obligatory segmentation into morphemes as predicted by the morpho-orthographic account of reading. Results fit well with NDL theory, which, in conformity with word and paradigm theory (Blevins, 2003), rejects the morpheme as a relevant unit of analysis. Furthermore, results indicate that readers with greater spelling proficiency and larger vocabularies make better use of orthographic priors and handle lexical competition more efficiently.
Research Interests:
Generalized additive mixed models are introduced as an extension of the generalized linear mixed model which makes it possible to deal with temporal autocorrelational structure in experimental data. This autocorrelational structure is... more
Generalized additive mixed models are introduced as an extension of the generalized linear mixed model which makes it possible to deal with temporal autocorrelational structure in experimental data. This autocorrelational structure is likely to be a consequence of learning, fatigue, or the ebb and flow of attention within an experiment (the `human factor'). Unlike molecules or plots of barley, subjects in psycholinguistic experiments are intelligent beings that depend for their survival on constant adaptation to their environment, including the environment of an experiment. Three data sets illustrate that the human factor may interact with predictors of interest, both factorial and metric. We also show that, especially within the framework of the generalized additive model, in the nonlinear world, fitting maximally complex models that take every possible contingency into account is ill-advised as a modeling strategy. Alternative modeling strategies are discussed for both confirmatory and exploratory data analysis.
Research Interests:
Sound units play a pivotal role in cognitive models of auditory comprehension. The general consensus is that during perception listeners break down speech into auditory words and subsequently phones. Indeed, cognitive speech recognition... more
Sound units play a pivotal role in cognitive models of auditory comprehension. The general consensus is that during perception listeners break down speech into auditory words and subsequently phones. Indeed, cognitive speech recognition is typically taken to be computationally intractable without phones. Here we present a computational model trained on 20 hours of conversational speech that recognizes word meanings within the range of human performance (model 25%, native speakers 20–44%), without making use of phone or word form representations. Our model also generates successfully predictions about the speed and accuracy of human auditory comprehension. At the heart of the model is a ‘wide’ yet sparse two-layer artificial neural network with some hundred thousand input units representing summaries of changes in acoustic frequency bands, and proxies for lexical meanings as output units. We believe that our model holds promise for resolving longstanding theoretical problems surrounding the notion of the phone in linguistic theory.
Research Interests:
The present study uses electromagnetic articulography, by which the position of tongue and lips during speech is measured, for the study of dialect variation. By using generalized additive modeling to analyze the articulatory... more
The present study uses electromagnetic articulography, by which the position of tongue and lips during speech is measured, for the study of dialect variation. By using generalized additive modeling to analyze the articulatory trajectories, we are able to reliably detect aggregate group differences, while simultaneously taking into account the individual variation of dozens of speakers. Our results show that two Dutch dialects show clear differences in their articulatory settings, with generally a more anterior tongue position in the dialect from Ubbergen in the southern half of the Netherlands than in the dialect of Ter Apel in the northern half of the Netherlands. A comparison with formant-based acoustic measurements further reveals that articulography is able to reveal interesting structural articulatory differences between dialects which are not visible when only focusing on the acoustic signal.
Research Interests:
The age-related declines observed in scores on Paired Associate Learning tests are widely taken as evidence that supports the idea that human cognitive capacities decline across the lifespan. In a computational simulation, we show that... more
The age-related declines observed in scores on Paired Associate Learning tests are widely taken as evidence that supports the idea that human cognitive capacities decline across the lifespan. In a computational simulation, we show that the patterns of change in PAL scores are actually predicted by the models that formalize the associative learning process in other areas of behavioral and neuroscientific research. These models also predict that manipulating language exposure can reproduce the experience-related performance differences erroneously attributed to age-related decline in age-matched adults. Consistent with this, old bilinguals outperformed native speakers in a German PAL test, an advantage that increased with age. These analyses and results show that age-related PAL performance changes reflect the predictable effects of learning on the associability of test items, and indicate that failing to control for these effects is distorting our understanding of cognitive and brain development in adulthood.
Research Interests:
Affixes display massive variability in morphological productivity. Some affixes (such as English -ness) are highly productive, and regularly used to create new words. Other af- fixes are completely non-productive (e.g -th). Individual... more
Affixes display massive variability in morphological productivity. Some affixes (such as English -ness) are highly productive, and regularly used to create new words. Other af- fixes are completely non-productive (e.g -th). Individual affixes can be differently produc- tive with different kinds of bases (see, e.g. Baayen and Lieber 1991), and even across dif- ferent registers (Plag et al. 1999). This type of variable behavior makes the phenomenon very complex to model, and has even lead some linguists to dismiss it as linguistically uninteresting.
Research Interests:
Many studies report that word recognition in a second language is affected by the native language. However, little is known about the role of the specific language combination of the bilinguals. To investigate this issue, native speakers... more
Many studies report that word recognition in a second language is affected by the native language. However, little is known about the role of the specific language combination of the bilinguals. To investigate this issue, native speakers of French, German, and Dutch carried out a word identification task (progressive demasking) on 1025 monosyllabic English (L2) words. In contrast to previous studies, a regression approach was adopted, including a large number of within- and between-language variables as predictors. Remarkably, a substantial overlap of RT patterns was found across the groups of bilinguals, showing that word recognition results obtained for one group of bilinguals generalize to bilinguals with different mother tongues. Moreover, among the set of predictors that contributed significantly to RT variance, only one between-language variable was present (cognate status); all others reflected characteristics of the target language. Thus, although influences across languages exist, word recognition in L2 by proficient bilinguals is primarily determined by within- language factors, while cross-language effects appear to be limited. An additional comparison of the bilingual data with a native control group showed that there are subtle, but significant differences between L1 and L2 processing.
Research Interests:
Letter transpositions are relatively harmless for reading English and other Indo-European languages with an alphabetic script, but severely disrupt comprehension in Hebrew. Furthermore, masked orthographic priming does not produce... more
Letter transpositions are relatively harmless for reading English and other Indo-European languages with an alphabetic script, but severely disrupt comprehension in Hebrew. Furthermore, masked orthographic priming does not produce facilitation as in English (Frost, 2012). This simulation study compares the costs of letter transpositions and of letter exchanges for Modern English and Classical Hebrew, using the framework of naive discriminative learning (Baayen, Milin, Filipovic Durdjevic, Hendrix, & Marelli, 2011). The greater disruption of transpositions for Hebrew as compared to English is correctly replicated by the model, as is the relative immunity of loanwords in Hebrew to letter transpositions. Furthermore, the absence of facilitation of form priming in Hebrew is correctly predicted. The results confirm the hypothesis that the distributional statistics of the orthographic cues in the two languages are the crucial factor determining the experimental hallmarks of orthographic processing, as argued by Frost (2012).
Research Interests:
Corpus surveys have shown that the exact forms with which idioms are realized are subject to variation. We report a rating experiment showing that such alternative realizations have varying degrees of acceptability. Idiom variation... more
Corpus surveys have shown that the exact forms with which idioms are realized are subject to variation. We report a rating experiment showing that such alternative realizations have varying degrees of acceptability. Idiom variation challenges processing theories associating idioms with fixed multi-word form units (Bobrow & Bell, 1973), fixed configurations of words (Cacciari & Tabossi, 1988), or fixed superlemmas (Sprenger, Levelt & Kempen, 2006), as they do not explain how it can be that speakers produce variant forms that listeners can still make sense of. A computational model simulating comprehension with naive discriminative learning is introduced that provides an explanation for the different degrees of acceptability of several idiom variant types. Implications for multi-word units in general are discussed.
Research Interests:
This study is a critical review of the role of frequency of occurrence in lexical processing, in the context of a large set of collinear predictors including not only frequencies collected from different sources, but also a wide range of... more
This study is a critical review of the role of frequency of occurrence in lexical processing, in the context of a large set of collinear predictors including not only frequencies collected from different sources, but also a wide range of other lexical properties such as length, neighborhood density, measures of valence, arousal, and dominance, semantic diversity, dispersion, age of acquisition, and measures grounded in discrimination learning. We show that age of acquisition ratings and subtitle frequencies constitute (reconstructed) genres that favor frequent use for very different subsets of words. As a consequence of the very different ways in which collinear variables profile as a function of genre, the fit between these variables and measures of lexical processing depends on both genre and task. The methodological implication of these results is that when evaluating effects of lexical predictors on processing, it is advisable to carefully consider what genres were used to obtain these predictors, and to consider the system of predictors and potential conditional independencies using graphical modeling.
Research Interests:
All current theories of auditory comprehension assume that the segmentation of speech into word forms is an essential prerequisite to understanding. We present a computational model that does not seek to learn word forms, but instead... more
All current theories of auditory comprehension assume that the segmentation of speech into word forms is an essential prerequisite to understanding. We present a computational model that does not seek to learn word forms, but instead decodes the experiences discriminated by the acoustic contrasts in the input. At the heart of this model is a discrimination learning network (Ramscar et al., 2010; Ramscar and Baayen, 2013), trained not on isolated words, but on full utterances. This network constitutes an atemporal long-term memory system. A fixed-width short term memory buffer projects a constantly updated moving window over the incoming speech onto the network's input layer. In response, the memory generates temporal activation functions for each of the output units. Output units (lexical contrasts, or lexomes) with high extended activation reflect a high degree of confidence that the cues that discriminate it from other possible lexomes are present in the external world. Lexomes that are not encoded in the signal give rise to little or no interference. We show that this new discriminative perspective on auditory comprehension is consistent with young infants' sensitivity to the statistical structure of the input. Simulation studies, both with artificial language and with English child directed speech, provide a first computational proof of concept and demonstrate the importance of utterance-wide co-learning.
Research Interests:
Recent studies have documented frequency effects for word n-grams, independently of word unigram frequency. Further studies have revealed constructional prototype effects, both at the word level as well as for phrases. The present speech... more
Recent studies have documented frequency effects for word n-grams, independently of word unigram frequency. Further studies have revealed constructional prototype effects, both at the word level as well as for phrases. The present speech production study investigates the time course of these effects for the production of prepositional phrases in English, using event related potentials (ERPs). For word frequency, oscillations in the theta range emerged. By contrast, phrase frequency showed a persistent effect over time. Furthermore, independent effects with different temporal and topographical signatures characterized phrasal prototypicality. In a simulation study we demonstrate that naive discrimination learning provides an alternative account of the data that is as least as powerful as a standard lexical predictor analysis. The implications of the current findings for models of language processing are discussed. (PsycINFO Database Record
Research Interests:
Although Vietnamese has a long history of linguistic research, as yet no psycholinguistic studies addressing lexical processing in this language have been carried out. This paper is the first to investigate lexical processing in... more
Although Vietnamese has a long history of linguistic research, as yet no psycholinguistic studies addressing lexical processing in this language have been carried out. This paper is the first to investigate lexical processing in Vietnamese, and this addresses the reading of Vietnamese bi-syllabic compound words. A large single-subject experiment with 20,000 words was complemented by a smaller multiple-subject experiment with 550 words. We report the novel finding of an inhibitory, anti-frequency effect of Vietnamese compounds’ constituents. We show that this anti-frequency effect is predicted by a computational model of lexical processing grounded in naive discrimination learning. We also show that predictors derived from this model provide a much better fit to the observed reaction times than traditional lexical-distributional predictors. Effects of the density of the compound graph, previously observed for English, were replicated for Vietnamese. Furthermore, tone diacritics were found to be important predictors of silent reading, providing further evidence for the role of phonology in reading.
Research Interests:
In this paper, some electronically gathered data arepresented and analyzed about the presence of the pastin newspaper texts. In ten large text corpora of sixdifferent languages, all dates in the form of yearsbetween 1930 and 1990 were... more
In this paper, some electronically gathered data arepresented and analyzed about the presence of the pastin newspaper texts. In ten large text corpora of sixdifferent languages, all dates in the form of yearsbetween 1930 and 1990 were counted. For six of thesecorpora this was done for all the years between 1200and 1993. Depicting these frequencies on the timeline,we find an underlying regularly declining curve,deviations at regular places and culturally determinedpeaks at irregular points. These three phenomena areanalyzed.
Mathematically spoken, all the underlying curves havethe same form. Whether a newspaper gives much orlittle attention to the past, the distribution of thisattention over time turns out to be inverselyproportional to the distance between past and present.It is shown that this distribution is largelyindependent of the total number of years in a corpus,the culture in which it is published, the language andthe date of origin of the corpus. The phenomenon isexplained as a kind of forgetting: the larger thedistance between past and present, the more difficultit is to connect something of the past to an item inthe present day. A more detailed analysis of the datashows a breakpoint in the frequency vs. distance fromthe publication date of the texts. References toevents older than approximately 50 years are theresult of a forgetting process that is distinctivelydifferent from the forgetting speed of more recentevents.
Pandel's classification of the dimensions ofhistorical consciousness is used to answer thequestion how these investigations elucidate thehistorical consciousness of the cultures in which thenewspapers are written and read.
Mathematically spoken, all the underlying curves havethe same form. Whether a newspaper gives much orlittle attention to the past, the distribution of thisattention over time turns out to be inverselyproportional to the distance between past and present.It is shown that this distribution is largelyindependent of the total number of years in a corpus,the culture in which it is published, the language andthe date of origin of the corpus. The phenomenon isexplained as a kind of forgetting: the larger thedistance between past and present, the more difficultit is to connect something of the past to an item inthe present day. A more detailed analysis of the datashows a breakpoint in the frequency vs. distance fromthe publication date of the texts. References toevents older than approximately 50 years are theresult of a forgetting process that is distinctivelydifferent from the forgetting speed of more recentevents.
Pandel's classification of the dimensions ofhistorical consciousness is used to answer thequestion how these investigations elucidate thehistorical consciousness of the cultures in which thenewspapers are written and read.
Research Interests:
A well-known problem in the domain of quantitative linguistics and stylistics concerns the evaluation of the lexical richness of texts. Since the most obvious measure of lexical richness, the vocabulary size (the number of different word... more
A well-known problem in the domain of quantitative linguistics and stylistics concerns the evaluation of the lexical richness of texts. Since the most obvious measure of lexical richness, the vocabulary size (the number of different word types), depends heavily on the text length (measured in word tokens), a variety of alternative measures has been proposed which are claimed to be independent of the text length. This paper has a threefold aim. Firstly, we have investigated to what extent these alternative measures are truly textual constants. We have observed that in practice all measures vary substantially and systematically with the text length. We also show that in theory, only three of these measures are truly constant or nearly constant. Secondly, we have studied the extent to which these measures tap into different aspects of lexical structure. We have found that there are two main families of constants, one measuring lexical richness and one measuring lexical repetition. Thirdly, we have considered to what extent these measures can be used to investigate questions of textual similarity between and within authors. We propose to carry out such comparisons by means of the empirical trajectories of texts in the plane spanned by the dimensions of lexical richness and lexical repetition, and we provide a statistical technique for constructing confidence intervals around the empirical trajectories of texts. Our results suggest that the trajectories tap into a considerable amount of authorial structure without, however, guaranteeing that spatial separation implies a difference in authorship.
Research Interests:
This auditory lexical decision study shows that cohort entropies, conditional root uniqueness points, and morphological family size all contribute to the dynamics of the auditory comprehension of prefixed words. Three entropy measures... more
This auditory lexical decision study shows that cohort entropies, conditional root uniqueness points, and morphological family size all contribute to the dynamics of the auditory comprehension of prefixed words. Three entropy measures calculated for different positions in the stem of Dutch prefixed words revealed facilitation for higher entropies, except at the point of disambiguation, where we observed inhibition. Morphological family
Research Interests:
Most studies addressing lexical processing make use of factorial designs. For many re- searchers in this field of inquiry, a real experiment is a factorial experiment. Methods such as regression and factor analysis would not allow for... more
Most studies addressing lexical processing make use of factorial designs. For many re- searchers in this field of inquiry, a real experiment is a factorial experiment. Methods such as regression and factor analysis would not allow for hypothesis testing and would not contribute substantially to the advancement of scientific knowledge. Their use would be restricted to exploratory studies at best. This paper is an apology coming to the defense of regression designs for experiments including lexical distributional variables as predictors.
Research Interests:
In this study we introduce an information-theoretical formulation of the emergence of type- and token-based effects in morphological processing. We describe a probabilistic measure of the informational complexity of a word, its... more
In this study we introduce an information-theoretical formulation of the emergence of type- and token-based effects in morphological processing. We describe a probabilistic measure of the informational complexity of a word, its information residual, which encompasses the combined influences of the amount of information contained by the target word and the amount of information carried by its nested morphological paradigms. By means of re-analyses of previously published data on Dutch words we show that the information residual outperforms the combination of traditional token- and type-based counts in predicting response latencies in visual lexical decision, and at the same time provides a parsimonious account of inflectional, derivational, and compounding processes.
Research Interests:
This study examines the productivity of five English derivational affixes in a British newspaper, the Times (London), in the period from September 1989 to December 1992. This diachronic corpus of roughly 80 million word tokens contains... more
This study examines the productivity of five English derivational affixes in a British newspaper, the Times (London), in the period from September 1989 to December 1992. This diachronic corpus of roughly 80 million word tokens contains large numbers of neologisms. Thus, this corpus offers a good opportunity to test both qualitative and quantitative theories of morphological productivity. Our investigations support the usefulness of the quantitative formalization of the notion DEGREE OF PRODUCTIVITY developed in Baayen 1992, 1993a. At the same time, they illustrate that productivity is a function of both text type and real time. An investigation of the morphological structure of the neologisms provides strong support for Aronoff’s (1976) claim that the productivity of an affix may vary significantly with the morphological structure of the base word to which it attaches.
Research Interests:
It has recently been shown that listeners use systematic differences in vowel length and intonation to resolve ambiguities between onset-matched simple words (Davis, Marslen-Wilson, & Gaskell, 2002; Salverda, Dahan, & McQueen, 2003). The... more
It has recently been shown that listeners use systematic differences in vowel length and intonation to resolve ambiguities between onset-matched simple words (Davis, Marslen-Wilson, & Gaskell, 2002; Salverda, Dahan, & McQueen, 2003). The present study shows that listeners also use prosodic information in the speech signal to optimize morphological processing. The precise acoustic realization of the stem provides crucial information to the listener about the morphological context in which the stem appears and attenuates the competition between stored inflectional variants. We argue that listeners are able to make use of prosodic information, even though the speech signal is highly variable within and between speakers, by virtue of the relative invariance of the duration of the onset. This provides listeners with a baseline against which the durational cues in a vowel and a coda can be evaluated. Furthermore, our experiments provide evidence for item-specific prosodic effects.
Research Interests:
This study addresses the comprehension of reduced words, taking as point of departure two lexical decision experiments reported in Ernestus (2009). Ernestus discusses the consequences of segment reduction in auditory com- prehension in... more
This study addresses the comprehension of reduced words, taking as point of departure two lexical decision experiments reported in Ernestus (2009). Ernestus discusses the consequences of segment reduction in auditory com- prehension in terms of exemplars for reduced forms and generalization pro- cesses reconstructing the unreduced form. A different approach is explored in the present study, using a computational model based on discriminative learning to explain the pattern of results in the experimental data. This new modeling approach, which provides the researcher with detailed information into what distributional properties of the language input may drive the ob- served effects, suggests that the unusual biphones in reduced words are the key to understanding why reduced words can be learned and understood.
Research Interests:
The number of different words expected on the basis of the urn model to appear in, for example, the first half of a text, is known to overestimate the observed number of different words. This paper examines the source of this... more
The number of different words expected on the basis of the urn model to appear in, for example, the first half of a text, is known to overestimate the observed number of different words. This paper examines the source of this overestimation bias. It is shown that this bias does not arise due to sentence-bound syntactic constraints, but that it is a direct consequence of topic cohesion in discourse. The nonrandom, clustered appearance of lexically specialized words, often the key words of the text, explains the main trends in the overestimation bias both quantitatively and qualitatively. Theeffectsofnonrandomnessaresostrongthattheyintroduceanoverestimation bias in distributions of units derived from words, such as syllables and digrams. Nonrandom word usage also affects the accuracy of the Good-Turingfrequency estimates which,for the lowest frequencies, reveal a strong underestimation bias. A heuristic adjusted frequency estimate is proposed that, at leastfor novel-sized texts, is considerably more accurate.ABSTRACT
Research Interests:
Given a form that is previously unseen in a sufficiently large training corpus, and that is morphologically n-ways ambiguous (serves n different lexical functions) what is the best estimator for the lexical prior probabilities for the... more
Given a form that is previously unseen in a sufficiently large training corpus, and that is morphologically n-ways ambiguous (serves n different lexical functions) what is the best estimator for the lexical prior probabilities for the various functions of the form? We argue that the best estimator is provided by computing the relative frequencies of the various functions among the hapax legomena--the forms that occur exactly once in a corpus; in particular, a hapax-based estimator is better than one based on the proportion of the various functions among words of all frequency ranges. As we shall argue, this is because when one computes an overall measure, one is including high-frequency words, and high-frequency words tend to have idiosyncratic properties that are not at all representative of the much larger mass of (productively formed) low-frequency words. This result has potential importance for various kinds of applications requiring lexical disambiguation, including, in particular, stochastic taggers. This is especially true when some initial hand-tagging of a corpus is required:for predicting lexical priors for very low-frequency morphologically ambiguous types (most of which would not occur in any given corpus), one should concentrate on tagging a good representative sample of the hapax legomena, rather than extensively tagging words of all frequency ranges.
Research Interests:
Reaction times (RTs) are an important source of information in experimental psychology. Classical methodological considerations pertaining to the statistical analysis of RT data are optimized for analyses of aggregated data, based on... more
Reaction times (RTs) are an important source of information in experimental psychology. Classical methodological considerations pertaining to the statistical analysis of RT data are optimized for analyses of aggregated data, based on subject or item means (c.f., Forster & Dickinson, 1976). Mixed-effects modeling (see, e.g., Baayen, Davidson, & Bates, 2008) does not require prior aggregation and allows the researcher the more ambitious goal of predicting individual responses. Mixed-modeling calls for a reconsideration of the classical methodological strategies for analysing rts. In this study, we argue for empirical exibility with respect to the choice of transformation for the RTs. We advocate minimal a-priori data trimming, combined with model criticism. We also show how trial-to-trial, longitudinal dependencies between individual observations can be brought into the statistical model. These strategies are illustrated for a large dataset with a non-trivial random-effects structure. Special attention is paid to the evaluation of interactions involving fixed-effect factors that partition the levels sampled by random-effect factors.
Research Interests:
The history of generative grammar has been such that exploration of the structure and meanings of words has long stayed on the back burner. With respect to the structure of words, this picture began to change in the late 1970’s and early... more
The history of generative grammar has been such that exploration of the structure and meanings of words has long stayed on the back burner. With respect to the structure of words, this picture began to change in the late 1970’s and early 1980’s with the appearance of the first subsantial works on the structure of word formation within the theory (e.g., Aronoff 1976; Lieber 1980; Williams 1981; Selkirk 1982). Following this there was a gradual increase in interest in the area of morphology that has led to a virtual explosian in recent years. A somewhat less direct trajectory has been followed in the history of lexical semantics with generative grammar. After an initial burst of activity as part of the Generative Semantics/Interpretive Semantics debate of the late 1960’s, interest in general issues of lexical semantics flagged within the generative tradition, with the notable exception of the lines of work pursued by Bierwisch (1989, 1997) or Jackendoff (1983, 1987, 1990, 1991, 1996).
Research Interests:
In a 50,000-word corpus of spoken British English the occurrence of words embedded within other words is reported. Within-word embedding in this real speech sample is common, and analogous to the extent of embedding observed in the... more
In a 50,000-word corpus of spoken British English the occurrence of words embedded within other words is reported. Within-word embedding in this real speech sample is common, and analogous to the extent of embedding observed in the vocabulary. Imposition of a syllable boundary matching constraint reduces but by no means eliminates spurious embedding. Embedded words are most likely to overlap with the beginning of matrix words, and thus may pose serious problems for speech recognisers.
Research Interests:
Large Number of Rare Event (LNRE) models for word frequency distributions (e.g., Orlov, 1983, and Sichel, 1986; see Chitashvili and Baayen, 1993, for a review) build on the assumption that words occur randomly in texts. Non-randomness in... more
Large Number of Rare Event (LNRE) models for word frequency distributions (e.g., Orlov, 1983, and Sichel, 1986; see Chitashvili and Baayen, 1993, for a review) build on the assumption that words occur randomly in texts. Non-randomness in actual word use, however, leads to considerable discrepancies between the theoretical predictions for the total number of different word types (E[V (N)]) and the numbers of different types with frequencies of 1,2,3, ..., (E[V (m, N)]) at a given sample size N (in word tokens) from a text of length No, and the corresponding emperical values. Hubert and Labbé (1988a, 1988b) have shown that binomial interpolation can be sifnificantly enhanced by partitioning the vocabulary into specialised and non-specialised words. Unlike the binomial model, LNRE models allow not only interpolation to smaller sample sizes but also extrapolation to larger sample sizes. The purpose of this paper is to show that the interpolation and extrapolation accuracy of LNRE models can be enhanced by incorporating Hubert and Labbé’s insights. Conditional on the goodness-of-fit of an LNRE model to the frequency spectrum V(m, No) of a given text, enriching the model with a parameter for the proportion of lexically specialised words may lead to a significant increase in interpolation and extrapolation accuracy. The accuracy of what we will refer to as partition-based adjustment will be compared with an adjustment technique in which the parameters of LNRE models themselves are directly adjusted for non-randomness in word use. The advantages and disadvantages of both methods will be discussed.
Research Interests:
Research Interests:
Word frequency distributions are generally extremely skewed and are described as having a Large Number of Rare Events (LNRE). LNRE distributions have been found to provide fits to many examples of word frequency distributions. However,... more
Word frequency distributions are generally extremely skewed and are described as having a Large Number of Rare Events (LNRE). LNRE distributions have been found to provide fits to many examples of word frequency distributions. However, Baayen and Tweedie (1998b) present a distribution of the Dutch suffix -heid which cannot be fitted by standard methods. In this case, the data come from a composite source and we introduce the idea of mixture distributions to deal with this. We present expressions for the expected word frequency distribution, the expected number of tokens, and the number of types in the population. An acceptable fit to the -heid data is presented.
Research Interests:
This paper provides an overview of the ongoing development of a large corpus of spoken Dutch in Flanders and the Netherlands. We outline the design of this corpus and the various layers of annotation with which the speech signal is... more
This paper provides an overview of the ongoing development of a large corpus of spoken Dutch in Flanders and the Netherlands. We outline the design of this corpus and the various layers of annotation with which the speech signal is enriched. Special attention is paid to the problems we have encountered, and to the tools and protocols developed for obtaining consistent and reliable annotations. We also discuss the outcome of a recent external evaluation of our project by an international committee of experts.
Research Interests:
This paper reports an experiment in authorship attribution that reveals considerable authorial structure in texts written by authors with very similar background and training, with genre and topic being strictly controlled for. We... more
This paper reports an experiment in authorship attribution that reveals considerable authorial structure in texts written by authors with very similar background and training, with genre and topic being strictly controlled for. We interpret our results as supporting the hypothesis that authors have ’textual fingerprints’, at least for texts produced by authors who are not consciously changing their style of writing across texts. What this study has also taught us is that discriminant analysis is a more appropriate technique to use than principal components analysis when predicting the authorship of an unknown (held-out) text on the basis of known (training) texts of which the authorial provenance is available. Finally, standard discriminant analysis can be enhanced considerably by using an entropy-based weighting scheme of the kind used in latent semantic analysis (Landauer et al., 1998).
Research Interests:
This paper provides an overview of how a series of different distributional properties of irregular and regular verbs affect lexical processing in single-word comprehension and production. (Tabak et al., 2005) show that it is possible to... more
This paper provides an overview of how a series of different distributional properties of irregular and regular verbs affect lexical processing in single-word comprehension and production. (Tabak et al., 2005) show that it is possible to predict whether a verb is (ir)regular from not only frequency, but also from its neighborhood density, inflectional entropy, morphological fam- ily size, number of synsets, its auxiliary, and its number of argument structures. These variables were observed to be predictive for both response latencies and errors in a visual lexical study of 286 Dutch verbs. Interestingly, the greater number of synsets characterizing irregulars led to especially short response latencies for irreg- ular past plurals. Moreover, a higher informa- tion complexity, as estimated by the inflectional entropy measure developed in, led to shorter re- sponse latencies, and especially so for irregular verbs. In this study, we investigated whether such measures of semantic density could be ob- served to play a role in word naming also. Two word naming experiments were carried out, sim- ple word naming as well as cross-tense naming. Semantic variables were predictive primarily in cross-tense naming, a task which also revealed effects suggesting competition between the form read and the form said. This competition chal- lenges dual route models of production (Pinker, 1991; Pinker, 1999) and argues for exemplar models of direct lexical access.
Research Interests:
This paper describes ongoing research aiming at the descrip- tion of variation in speech as represented by asynchronous ar- ticulatory features. We will first illustrate how distances in the articulatory feature space can be used for... more
This paper describes ongoing research aiming at the descrip- tion of variation in speech as represented by asynchronous ar- ticulatory features. We will first illustrate how distances in the articulatory feature space can be used for event detection along speech trajectories in this space. The temporal structure imposed by the cosine distance in articulatory feature space coincides to a large extent with the manual segmentation on phone level. The analysis also indicates that the articulatory feature representation provides better such alignments than the MFCC representation does. Secondly, we will present first re- sults that indicate that articulatory features can be used to probe for acoustic differences in the onsets of Dutch singulars and plurals.
Research Interests:
This study investigates whether the acoustic durations of derivational affixes in Dutch are affected by the frequency of the word they occur in. In a word naming experiment, subjects were presented with a large number of words containing... more
This study investigates whether the acoustic durations of derivational affixes in Dutch are affected by the frequency of the word they occur in. In a word naming experiment, subjects were presented with a large number of words containing one of the affixes ge-, ver-, ont, or -lijk. Their responses were recorded on DAT tapes, and the durations of the affixes were measured using Automatic Speech Recognition technology. To investigate whether frequency also affected durations when speech rate was high, the presentation rate of the stimuli was varied. The results show that a higher frequency of the word as a whole led to shorter acoustic realizations for all affixes. Furthermore, affixes became shorter as the presentation rate of the stimuli increased. There was no interaction between word frequency and presentation rate, suggesting that the frequency effect also applies in situations in which the speed of articulation is very high.
Research Interests:
This study explores socio-geographic variation in morphological productivity in spoken Dutch. For 72 affixes, we extracted the hapax legomena from 24 sub-corpora of the Corpus of Spo- ken Dutch, which we defined by the speaker’s country... more
This study explores socio-geographic variation in morphological productivity in spoken Dutch. For 72 affixes, we extracted the hapax legomena from 24 sub-corpora of the Corpus of Spo- ken Dutch, which we defined by the speaker’s country (Flanders versus The Netherlands), education level (High versus Non-High), sex (Women versus Men), and age (Young, Mid or Old). The large number of cells with zero counts, and the substantial variation in the sizes of the sub-corpora underlying the cell counts, posed a special challenge for the statistical analysis. We fitted three different kinds of models to our data: an ordinary least squares linear model with a transformation of the proportion of hapax legomena in the sub-corpus as dependent variable, a linear mixed effects model with affix as random effect and the trans- formed proportions as dependent variable, and a generalized linear model with a binomial link which considered the hapax legomena as successes, and all remaining words as failures. The generalized linear model outperformed the others, in spite of the extremely small prob- abilities of success. We discuss why the generalized linear model is superior, and show how generalized linear models can be used to visualize by-affix variability in productivity.
Research Interests:
This study addresses the roles of segment deletion, durational reduction, and frequency of use in the comprehension of morphologically complex words. We report two auditory lexical decision experiments with reduced and unreduced... more
This study addresses the roles of segment deletion, durational reduction, and frequency of use in the comprehension of morphologically complex words. We report two auditory lexical decision experiments with reduced and unreduced prefixed Dutch words. We found that segment deletions as such delayed comprehension. Simultaneously, however, longer durations of the different parts of the words ap- peared to increase lexical competition, either from the word’s stem (Experiment 1) or from the word’s morphological continuation forms (Experiment 2). Increased lexical competition slowed down espe- cially the comprehension of low frequency words, which shows that speakers do not try to meet lis- teners’ needs when they reduce especially high fre- quency words.
Research Interests:
We studied the frequencies of phone and syllable deletions in spontaneous Dutch, and the extent to which such deletions are influenced by the various linguistic and sociolinguistic factors represented in the transcriptions, word... more
We studied the frequencies of phone and syllable deletions in spontaneous Dutch, and the extent to which such deletions are influenced by the various linguistic and sociolinguistic factors represented in the transcriptions, word segmentations and metadata of the Spoken Dutch Corpus. In addition to providing insight into the frequencies of phone and syllable deletions and the factors influencing them, our study illustrates the new opportunities for analysing rich and therefore complex corpus data offered by a recently developed statistical modelling technique: the possibility to model the effects of random factors as crossed instead of nested with generalised linear mixed effects models.
We observed average phone and syllable deletion rates of 7.57% and 5.46% respectively. 20.32% of the words had at least one phone missing, and 6.89% of the words had at least one syllable deleted. The mixed effects models for phone and syllable deletion had several effects in common, which implies that both types of deletion are to a large extent influenced by the same factors. The strongest factors across both models were lexical stress, word duration and the segmental context of the syllable onset of the following word.
We observed average phone and syllable deletion rates of 7.57% and 5.46% respectively. 20.32% of the words had at least one phone missing, and 6.89% of the words had at least one syllable deleted. The mixed effects models for phone and syllable deletion had several effects in common, which implies that both types of deletion are to a large extent influenced by the same factors. The strongest factors across both models were lexical stress, word duration and the segmental context of the syllable onset of the following word.
Research Interests:
In many of the world’s languages, nouns are inflected for number. In general, the singular is simpler than the plural, both with respect to form, and with respect to meaning. For instance, the English singular nose consists of just the... more
In many of the world’s languages, nouns are inflected for number. In general, the singular is simpler than the plural, both with respect to form, and with respect to meaning. For instance, the English singular nose consists of just the bare stem nose, while the plural is created from the singular by adding the suffix -s. This difference in formal complexity is mirrored in the complexity of the corresponding semantics, with the singular typically refering to one and the plural to two or more instances of the noun’s referent. Using the terminology of structuralist linguistics, the singular is the unmarked, and the plural the marked form.
Research Interests:
Given that word translation equivalents in differen t languages can give rise to different word structure s which can cast different shades of meaning, this study investigates whether such cross-linguistic differences influence how... more
Given that word translation equivalents in differen t languages can give rise to different word structure s which can cast different shades of meaning, this study investigates whether such cross-linguistic differences influence how speakers of different languages compare two objects. Picture comparison tasks revealed that speakers of Japanese and English utilize distinct cognitive processes when asked to evaluate how similar two objects are, influenced by various lexical properties of the wor d translation equivalents of the names for the object s in each language. This result provides partial support for the Linguistic Relativity Hypothesis, which holds that the language speakers are exposed to influences their conception of reality, a hypothesis which allows us to explore empirically the relationship between language and thought in the mental lexicon.
Research Interests:
In the application of any statistical analysis method to the modeling of linguistic phenomena, a recurring question is how to understand the statistical results from a cognitive perspective. Although quantitative models may provide... more
In the application of any statistical analysis method to the modeling of linguistic phenomena, a recurring question is how to understand the statistical results from a cognitive perspective. Although quantitative models may provide detailed and useful insights into which factors enhance the probability of particular linguistic phenomena, they tend leave unanswered how actual speakers come to learn and use their language in the way they do.
Research Interests:
A frequently replicated finding is that higher frequency words tend to be shorter and contain more strongly reduced vowels. However, little is known about potential differences in the articulatory gestures for high vs. low frequency... more
A frequently replicated finding is that higher frequency words tend to be shorter and contain more strongly reduced vowels. However, little is known about potential differences in the articulatory gestures for high vs. low frequency words. The present study made use of electromagnetic articulography to investigate the production of two German vowels, [i] and [a], embedded in high and low frequency words. We found that word frequency differently affected the production of [i] and [a] at the temporal as well as the gestural level. Higher frequency of use predicted greater acoustic durations for long vowels; reduced durations for short vowels; articulatory trajectories with greater tongue height for [i] and more pronounced downward articulatory trajectories for [a]. These results show that the phonological contrast between short and long vowels is learned better with experience, and challenge both the Smooth Signal Redundancy Hypothesis and current theories of German phonology.
Research Interests:
Proper name systems provide individuals with personal identifiers, and convey social and hereditary information. We identify a common information structure in the name grammars of the world’s languages, which makes this complex... more
Proper name systems provide individuals with personal identifiers, and convey social and hereditary information. We identify a common information structure in the name grammars of the world’s languages, which makes this complex information processing task manageable, and evaluate the impact that the re-engineering of naming practices for legal and political purposes has had on the communicative and psychological properties of these socially evolved systems. While East-Asian naming systems have been largely unaffected by state legislation, legal interference has transformed Western naming practices, making individual names harder to process and remember. Further, the structural collapse of Western naming systems has not affected all parts of society equally: In the US, it has had a disproportionate impact on those sections of society that are least successful in economic and social terms. We consider the implications of these changes for name memory across the lifespan, and for future naming practices.
Research Interests:
Across a range of psychometric tests, reaction times slow as adult age increases. These changes have been widely taken to show that cognitive-processing capacities decline across the lifespan. Contrary to this, we suggest that slower... more
Across a range of psychometric tests, reaction times slow as adult age increases. These changes have been widely taken to show that cognitive-processing capacities decline across the lifespan. Contrary to this, we suggest that slower responses are not a sign of processing deficits, but instead reflect a growing search problem, which escalates as learning increases the amount of information in memory. A series of computational simulations show how age-related slowing emerges naturally in learning models, as a result of the statistical properties of human experience and the increased information-processing load that a lifetime of learning inevitably brings. Once the cost of processing this extra information is controlled for, findings taken to indicate declines in cognitive capacity support little more than the unsurprising idea that choosing between or recalling items becomes more difficult as their numbers increase. We review the implications of this for scientific and cultural understanding of aging.
Research Interests:
The perception of prosodic prominence is influenced by differ- ent sources like different acoustic cues, linguistic expectations and context. We use a generalized additive model and a ran- dom forest to model the perceived prominence on a... more
The perception of prosodic prominence is influenced by differ- ent sources like different acoustic cues, linguistic expectations and context. We use a generalized additive model and a ran- dom forest to model the perceived prominence on a corpus of spoken German. Both models are able to explain over 80% of the variance. While the random forests give us some insights on the relative importance of the cues, the general additive model gives us insights on the interaction between different cues to prominence.
Research Interests:
A frequently replicated finding is that the frequency of words affects their phonetic shape. In English, high frequency words have been shown to contain more centralized vowels than low frequency words. By contrast, a recent study... more
A frequently replicated finding is that the frequency of words
affects their phonetic shape. In English, high frequency words
have been shown to contain more centralized vowels than low
frequency words. By contrast, a recent study on vowel artic-
ulation in German has shown a contrary finding. At the ges-
tural level, tongue movements in HF words showed more ex-
tensive vowel targets and less coarticulationwith consonants.
This paper further evaluates the later finding by taking into ac-
count a large set of verbs covering the continuum between high
and low frequency. In addition to frequency the effects of two
factors were analyzed: inflection (sagt vs. sagen) and speech
rate (normal vs. fast). Our results imply that language experi-
ence increases the proficiency with which words are articulated:
speakers are able to plan and target tongue movements earlier.
affects their phonetic shape. In English, high frequency words
have been shown to contain more centralized vowels than low
frequency words. By contrast, a recent study on vowel artic-
ulation in German has shown a contrary finding. At the ges-
tural level, tongue movements in HF words showed more ex-
tensive vowel targets and less coarticulationwith consonants.
This paper further evaluates the later finding by taking into ac-
count a large set of verbs covering the continuum between high
and low frequency. In addition to frequency the effects of two
factors were analyzed: inflection (sagt vs. sagen) and speech
rate (normal vs. fast). Our results imply that language experi-
ence increases the proficiency with which words are articulated:
speakers are able to plan and target tongue movements earlier.
Research Interests:
We report on an experimental study of the processing of noun-noun compounds by native and non-native speakers of English, based on Event-Related Potentials recorded during a mask-primed lexical decision task. Analysis was by generalised... more
We report on an experimental study of the processing of noun-noun compounds by native and non-native speakers of English, based on Event-Related Potentials recorded during a mask-primed lexical decision task. Analysis was by generalised linear mixed-effect modelling and generalised additive mixed modelling. Non-native processing is found to display headedness effects induced by the mothertongue. The frequency of the constituent nouns and of the intended compounds are also shown to have an effect on processing.
Research Interests:
The present study introduces articulography, the measurement of the position of tongue and lips during speech, as a promising method to the study of dialect variation. By using generalized additive modeling... more
The present study introduces articulography, the
measurement of the position of tongue and lips
during speech, as a promising method to the study of
dialect variation. By using generalized additive
modeling to analyze articulatory trajectories, we are
able to reliably detect aggregate group differences,
while simultaneously taking into account the
individual variation across dozens of speakers. Our
results on the basis of Dutch dialect data show clear
differences between the southern and the northern
dialect with respect to tongue position, with a more
frontal tongue position in the dialect from Ubbergen
(in the southern half of the Netherlands)
than in the dialect of Ter Apel (in the
northern half of the Netherlands). Thus articulography
appears to be a suitable tool to investigate
structural differences in pronunciation at the dialect level.
measurement of the position of tongue and lips
during speech, as a promising method to the study of
dialect variation. By using generalized additive
modeling to analyze articulatory trajectories, we are
able to reliably detect aggregate group differences,
while simultaneously taking into account the
individual variation across dozens of speakers. Our
results on the basis of Dutch dialect data show clear
differences between the southern and the northern
dialect with respect to tongue position, with a more
frontal tongue position in the dialect from Ubbergen
(in the southern half of the Netherlands)
than in the dialect of Ter Apel (in the
northern half of the Netherlands). Thus articulography
appears to be a suitable tool to investigate
structural differences in pronunciation at the dialect level.
Research Interests:
Listeners rely on highly variable, non-discrete acoustic information to understand spoken messages. The present ‘visual world’ eye tracking study investigated whether the amount of acoustic cue variation affected Cantonese listeners’... more
Listeners rely on highly variable, non-discrete acoustic information to understand spoken messages. The present ‘visual world’ eye tracking study investigated whether the amount of acoustic cue variation affected Cantonese listeners’ perception of speech contrasts. Participants saw pictures of word pairs which were identical except for initial consonants (unaspirated versus aspirated). Auditory stimuli were continua of increasing VOT presented in bimodal distributions. The amount of acoustic variation varied between conditions: high-variance versus low- variance. Generalised Additive Modelling analyses showed, in the low-variance condition, eye movements reflected cue values: there was differential fixation behaviour for category means, boundaries and peripheries. In contrast, in the high- variance condition, the acoustic cue had little effect: fixation behaviour was similar across the different acoustic cue values. This demonstrates listeners’ high sensitivity to the discriminative value of acoustic cues. How much cue dimensions are utilised depends on their variance.
Research Interests:
A stochastic model based on insights of Mandelbrot (1953) and Simon (1955) is discussed against the background of new criteria of adequacy that have become available recently as a result of studies of the similarity relations between... more
A stochastic model based on insights of Mandelbrot (1953) and Simon (1955) is discussed against the background of new criteria of adequacy that have become available recently as a result of studies of the similarity relations between words as found in large computerized text corpora.
Research Interests:
A problem that tends to be ignored in the statistical analysis of experimental data in the language sciences is that responses often constitute time series, which raises the problem of autocorrelated errors. If the errors indeed show... more
A problem that tends to be ignored in the statistical analysis of experimental data in the language sciences is that responses often constitute time series, which raises the problem of autocorrelated errors. If the errors indeed show autocorrelational structure, evaluation of the significance of pre- dictors in the model becomes problematic due to potential anti-conservatism of p-values.
Research Interests:
The mathematical and computational tools available for the study of word frequency distributions have become increasingly powerful since Zipf published his seminal studies some 60 years ago (Zipf 1935, 1949). The fisrt frequency counts... more
The mathematical and computational tools available for the study of word frequency distributions have become increasingly powerful since Zipf published his seminal studies some 60 years ago (Zipf 1935, 1949). The fisrt frequency counts were obtained manually, either by going through a text and filing new words and updating the frequencies of words already encountered on slips of paper, or by going through (manually compiled) concordances. The first statistician to study word frequency distributions, G. U. Yule, obtained the data for his book on „The statistical study of literary vocabulary“ (Yule, 1944) in this way. The first frequency dictionary of Dutch, „De meest voorkomende woorden en woordcombinaties in het Nederlandsch“, was similarly compiled manually by De la Court in 1937.
Research Interests:
This study starts from the hypothesis, first advanced by McDonald and Shillcock (2001), that the word frequency effect for a large part reflects local syntactic co-occurrence. It is shown that indeed the word frequency effect in the sense... more
This study starts from the hypothesis, first advanced by McDonald and Shillcock (2001), that the word frequency effect for a large part reflects local syntactic co-occurrence. It is shown that indeed the word frequency effect in the sense of pure repeated exposure accounts for only a small proportion of the variance in lexical decision, and that local syntactic and morphological co-occurrence probabilities are what makes word frequency a powerful predictor for lexical decision latencies. A comparison of two computational models, the cascaded dual route model (Coltheart, Rastle, Perry, Langdon, & Ziegler, 2001) and the Naive Discriminative Reader (Baayen, Milin, Filipovic Durdjevic, Hendrix, & Marelli, 2010), indicates that only the latter model properly captures the quantitative weight of the latent dimensions of lexical variation as predictors of response times. Computational models that account for frequency of occurrence by some mechanism equivalent to a counter in the head therefore run the risk of overestimating the role of frequency as repetition, of overestimating the importance of words’ form properties, and of underestimating the importance of contextual learning during past experience in proficient reading.
Research Interests:
The history of mankind is characterized by constant change. One aspect of this change is the rise, spread, and demise in time and space of civilizations and religions. Another, perhaps more systematic, aspect of this constant change is... more
The history of mankind is characterized by constant change. One aspect of this change is the rise, spread, and demise in time and space of civilizations and religions. Another, perhaps more systematic, aspect of this constant change is that technological innovations, and thanks to these innovations, the amount of information available to agents in human societies has been increasing exponentially.
Societal changes lead to changes in language. Meibauer et al. (2004) and Scherer (2005) observed in a diachronic corpus of German newspapers that the use of multilexemic words increased over time, in response to increasing onomasiological needs for new technologies and a growing body of knowledge.
Societal changes lead to changes in language. Meibauer et al. (2004) and Scherer (2005) observed in a diachronic corpus of German newspapers that the use of multilexemic words increased over time, in response to increasing onomasiological needs for new technologies and a growing body of knowledge.
Research Interests:
In English, as in most of the world’s languages, the majority of words are multimorphemic. In the psycholinguistic literature on lexical processing in bilinguals, however, multimorphemic words have thus far received relatively little... more
In English, as in most of the world’s languages, the majority of words are multimorphemic. In the psycholinguistic literature on lexical processing in bilinguals, however, multimorphemic words have thus far received relatively little treatment. In this chapter, we discuss the opportunities that the study of multimorphemic words afford. We also consider the consequences that a multimorphemic perspective may have on the conceptualization of the bilingual mental lexicon in general and on the links among lexical reoperations within in. We present a study of compound processing among Hebrew-English bilinguals. These bilinguals performed a lexical decision task with constituent priming in both their languages. We investigated within and between language priming effects as well as differences among compound word types. Results point to a highly integrated lexical organization but also illustrate the complexity of such experimental studies.
Research Interests:
Healthy ageing is associated with cognitive changes that are commonly thought to reflect diminished processing power in our minds and brains. Formally, however, it is impossible to establish whether cognitive processing capacities... more
Healthy ageing is associated with cognitive changes that are commonly thought to reflect diminished processing power in our minds and brains. Formally, however, it is impossible to establish whether cognitive processing capacities actually do decline over the lifespan in the absence of functional models of cognitive processing, and without controlling for the way that learning tends to result in increases in the amount of information processed by the cognitive system. When the cognitive processes measured by psychometric tests of behavior are formally defined, the performance of older and younger adults can be seen to show little evidence of decline; rather, changes in test scores closely reflect the predictable performance of a system with consistent capacity processing increasing information loads as age and experience increase. Main Text It is widely believed that adult aging is inevitably accompanied by a progressive loss of cognitive function. This belief is apparently confirmed by the fact that adults' ages increase, their reaction times on many tasks get slower, and their memories, especially for names, appear to fade (Salthouse, 2009). Recently however, a number of findings have challenged conventional wisdoms. The idea that aging is necessarily coupled with declines in cognitive capacity has come under scientific challenge, as indeed has the notion that the neural structures that implement mental functions simply " atrophy " in our later years (Ramscar et al, 2013, 2014; Burke & Barnes, 2006). Seen from the perspective of our attempts to understand and model ordinary, healthy cognitive functioning across the lifespan, the idea that age-related changes in reaction time and memory performance should necessarily be seen as signs of declines in performance or capacity runs counter to another, widely-accepted conventional wisdom, namely that our minds and brains can be best understood by treating them as a natural information processing device. It turns out that a simple fact about information processing systems is that speed and accuracy are, by definition, a function of the amount of information that is being processed and the capacity of the processing device (Shannon, 1948). Put simply, if processing capacity is held constant, and the amount of information that has to be processed is increased, something has to give. Information processing systems are digital, not just in the commonly understood sense that they make use of binary codes of ones and zeros, but also in the more interesting sense that in information theory, the " information " that is communicated in systems is broken down into a set of discrete, discriminable states that can be encoded by various combinations of ones and zeros.
Research Interests:
Research in Humanities Computing 5: Selected Papers from the Ach/Allc Conference, University of California, Santa Barbara, August 1995, editor Giorgio Perissinotto. This is the fifth volume of selected papers from the joint annual... more
Research in Humanities Computing 5: Selected Papers from the Ach/Allc Conference, University of California, Santa Barbara, August 1995, editor Giorgio Perissinotto. This is the fifth volume of selected papers from the joint annual conference of the ACH and the ALLC which are the two major associations for the use of computers in the humanities.
Research Interests:
Aspecten van Bijbelvertalen. Lezingen gehouden voor de werkgroep bijbelvertalen van de Vrije Universiteit. Lezingen van: Harald Baayen, Riens de Haan, Adrianus Koster, Pieter van Reenen en Taco Smit.
Research Interests:
Word frequency distributions have been studied from a variety of perspectives. In literary studies, word frequency distributions have attracted the attention of scholars interested in authorship attribution and vocabulary richness (Orlov... more
Word frequency distributions have been studied from a variety of perspectives. In literary studies, word frequency distributions have attracted the attention of scholars interested in authorship attribution and vocabulary richness (Orlov 1983b, Muller 1977, 1979, Menard 1983, Thisted and Efron 1987, Herdan 1960, 1964). Psychologists have long been interested in word frequencies since word frequency is one of the most robust and important predictors of reponse time in a variety of experimental tasks addressing on-line word production and word recognition (Carroll 1969, 1970, Scarborough et al. 1977, Whaley 1978). Recently, word frequency distributions have also been exploited for the study of morphological productivity, the extent to which various word formation processes are alive in the language and may be expected to give rise to new (morphologically complex) formations (Baayen 1992, 1993a). This paper focuses on the probabilistic properties of word frequency distributions and on the statistical techniques developed for their analysis. Some attempt will be made, however, to understand the typical statistical properties of word frequency distributions of running texts in terms of the morphological structure of the constituent words and the productivity of the underlying word formation processes.
Research Interests:
We present a model in which the crucial function of morphological processing is seen as calculating meaning. The process by which meaning is obtained from form is analyzed in terms of a hybrid architecture in which symbolic computations... more
We present a model in which the crucial function of morphological processing is seen as calculating meaning. The process by which meaning is obtained from form is analyzed in terms of a hybrid architecture in which symbolic computations are carried out on representations that have become available through spreading activation. We introduce a mechanism of activation feedback from semantic and syntactic representations to so-called concept nodes and from these concept nodes to form-based access representations. This allows the modelt to become tuned to the requirements imposed on the recognition system by the language specific structural and distributional characteristics of the morphological system to which it is exposed. The model is compared with other models proposed in the literature.
Research Interests:
reviews a number of linguistic and psycholinguistic computational models of morphological processing / survey the basic linguistic facts that any model should attempt to account for / describes 2 linguistics computational models, K.... more
reviews a number of linguistic and psycholinguistic computational models of morphological processing / survey the basic linguistic facts that any model should attempt to account for / describes 2 linguistics computational models, K. Koskenniemi's (1983) 2-level morphology model and G. Ritchie et al's (1992) unification-based model, [as well as M. Gasser's (1994) connectionist models of morphological processing of complex words in the auditory modality] / after that, the results of a number of psycholinguistic experiments are summarized, followed by an outline of a number of verbal (not implemented) models of morphological processing review the main linguistic devices by which semantic or syntactic information is morphologically expressed, and discuss the role of phonology / some of the complexity issues that may be encountered are briefly summarized / focus on the semantic aspects of word formation / trace the processing consequences of the linguistic facts discussed (PsycINFO Database Record (c) 2016 APA, all rights reserved)
Research Interests:
The present study addresses the relation between productivity and markedness using lexical statistics and psycholinguistic experimentation. Generally, unmarked affixes are taken to be more productive than their marked counterparts. Hence... more
The present study addresses the relation between productivity and markedness using lexical statistics and psycholinguistic experimentation. Generally, unmarked affixes are taken to be more productive than their marked counterparts. Hence it is somewhat counterintuitive to find that the productivity measure P developed in Baayen (1989, 1991) assigns to marked -ster (loopster ‚female walker’) a higher degree of productivity than to unmarked -er (loper‚ walker’), as pointed out by van Marle (1991:154). The aim of the present paper is to investigate in some detail what factors give rise to this exceptional P value. The discussion is structured as follows. After focusing on the markedness relation between the Dutch suffixes -er and -ster in section 1, and following a brief discussion of the quantitative operationalizations of the notion „degree of productivity“ developed in Baayen (1989, 1991, 1992) in section 2, I will argue that it is the marked nature of -ster that gives rise to the unexpectedly high value of P (section 3). Section 4 discusses a production experiment bearing out the P-based prediction that subjects should be able to coin more new formations in marked -ster than in unmarked -er.
Research Interests:
This paper describes three adjectivizing affixes in British newspaper text,-type, mock-, and-shape, which are widely used but have thus far escaped detailed documentation in the literature. This study is part of a larger project on... more
This paper describes three adjectivizing affixes in British newspaper text,-type, mock-, and-shape, which are widely used but have thus far escaped detailed documentation in the literature. This study is part of a larger project on lexical innovation in which we combine the methodology followed by the first author in the earlier AVIATOR project, in which successive chunks of news data were analyzed, with the approach to morphological productivity taken by the second author. In Baayen and Renouf (1996), we discuss the productivity of five wellestablished derivational affixes in The Times:-ness,-ity, un-, in-, and-Ly. The present study focuses on three 'vogue ' affixes and their use in The Independent, a newspaper for which a corpus has been compiled of roughly 280 million word tokens covering the years 1988-1997. We present our findings for the prefix mock- in Section 1. In Section 2, we turn to-type, a suffix which is very productive in our data. Finally, Section 3 deals with the suffix-shape, which appears to be becoming available as a new alternative to-shaped.
Research Interests:
This paper addresses the semantics of a fully regular and productive Dutch suffix, -heid, which, like –ness in English, creates abstract nouns from adjectives. We present evidence from lexicography, lexical statistics, and... more
This paper addresses the semantics of a fully regular and productive Dutch suffix, -heid, which, like –ness in English, creates abstract nouns from adjectives. We present evidence from lexicography, lexical statistics, and psycholinguistics in support of the hypothesis that –heid serves two semantic functions: anaphoric reference and term formation. The lexicographic evidence shows that these two functions give rise to different kinds of translation equivalents. The lexical statistics show that an adequate word frequency model must assume that words in –heid come from very different distributions. The psycholinguistic evidence shows a different balance of storage and computation depending on the semantic function of the –heid formations.
Research Interests:
This chapter outlines a model for morphological processing in the mental lexicon for visual word recognition. The model is designed to reflect a number of basic functional properties of the human processing system, such as sensitivity to... more
This chapter outlines a model for morphological processing in the mental lexicon for visual word recognition. The model is designed to reflect a number of basic functional properties of the human processing system, such as sensitivity to frequency of occurrence, the temporal nature of human lexical processing, and the fact that during word recognition various partially matching lexical candidates become activated and co-influence the time required for word recognition.
Research Interests:
In Dutch, past-tense forms are created by suffixing -te or -de to the verb stem. The suffix -te is added when the stem ends in an underlying voiceless obstruent, while -de is suffixed elsewhere (e.g. Booij 1995:61). This description is... more
In Dutch, past-tense forms are created by suffixing -te or -de to the verb stem. The suffix -te is added when the stem ends in an underlying voiceless obstruent, while -de is suffixed elsewhere (e.g. Booij 1995:61).
This description is not completely correct, since speakers sometimes suffix -te after underlyingly voiced obstruents, and -de after underlyingly voiceless ones. For instance, 838 out of the 1086 tokens of the past-tense form of glans „gleam“ present on the internet on 8 February 2001 were spelled as glanste, instead of glansde, and 52 out of the 424 tokens of the past -tense form of krab „scratch“ were spelled as krabte, instead of krabde (search engine: Alta Vista). Apparently, the choice between -de and -te is not only directed by the underlying (voice) -specification of the stem-final obstruent.
In this paper, we investigate violations of the standard description, henceforth referred to as „rule“, that -te follows underlyingly voiceless obstruents and -de all other types of segments.
This description is not completely correct, since speakers sometimes suffix -te after underlyingly voiced obstruents, and -de after underlyingly voiceless ones. For instance, 838 out of the 1086 tokens of the past-tense form of glans „gleam“ present on the internet on 8 February 2001 were spelled as glanste, instead of glansde, and 52 out of the 424 tokens of the past -tense form of krab „scratch“ were spelled as krabte, instead of krabde (search engine: Alta Vista). Apparently, the choice between -de and -te is not only directed by the underlying (voice) -specification of the stem-final obstruent.
In this paper, we investigate violations of the standard description, henceforth referred to as „rule“, that -te follows underlyingly voiceless obstruents and -de all other types of segments.
Research Interests:
This chapter addresses the balance of storage and computation in the mental lexicon for fully regular and productive inflectional processes in Dutch. We present evidence that both regular inflected nouns and regular inflected verbs show... more
This chapter addresses the balance of storage and computation in the mental lexicon for fully regular and productive inflectional processes in Dutch. We present evidence that both regular inflected nouns and regular inflected verbs show clear and robust effects of storage, but that at the same time on-line parsing also plays a role. We argue that the balance of storage and computation cannot be predicted on the basis of economy of linguistic description. Instead, a range of cognitive and linguistic factors are crucial determinants.
Research Interests:
The Dutch low-frequency morphologically complex word branding, ‘surf, the rolling and splashing of the waves’ (7 occurrences per million), consists of two high-frequency constituents, brand, ‘fire, to burn’ (111 occurrences per million),... more
The Dutch low-frequency morphologically complex word branding, ‘surf, the rolling and splashing of the waves’ (7 occurrences per million), consists of two high-frequency constituents, brand, ‘fire, to burn’ (111 occurrences per million), and the nominalizing suffix -ing (13330 occurrences per million. Semantically opaque words such as branding pose an interesting challenge for dual route models of morphological processing in word recognition that allow complex words to be recognized by means of both a direct route and a parsing route (Baayen, Dijkstra, Schreuder, 1997; Burani Laudanna, 1992; Frauenfelder & Schreuder, 1992; Laudanna & Burani, 1995; Schreuder & Baayen, 1995).
Research Interests:
The vocabulary of Dutch shows traces of contact with a wide range of languages including classical Greek, Latin, Hebrew, Arabic and French. In the description of the vocabulary from a synchronic point of view, information about the origin... more
The vocabulary of Dutch shows traces of contact with a wide range of languages including classical Greek, Latin, Hebrew, Arabic and French. In the description of the vocabulary from a synchronic point of view, information about the origin of words does not play a role. However, the dichotomy of the native and non-native strata is indispensable, cf. De Haas and Trommelen (1993:11-13), Booij (2002a: 94-5). Native bases combine freely with native affixes (1a) and non-native bases likewise combine with non-native affixes (1b). Various native affixes also attach to non-native bases (1c). However, non-native affixes hardly ever combine with native base words (1d).
Research Interests:
Traditionally, formal rewrite rules are understood as the normal way to create novel words, while analogy is taken as an unformalizable and exceptional way to create a new word on the basis of an existing word (see e.g., Anshen & Aronoff... more
Traditionally, formal rewrite rules are understood as the normal way to create novel words, while analogy is taken as an unformalizable and exceptional way to create a new word on the basis of an existing word (see e.g., Anshen & Aronoff 1988). The rule-based approach appears to be adequate for phenomena with strong systematicities which can be easily captured by deterministic rules. However, the very same phenomena can often be described equally well by means of formal and computational models of analogy. In the analogical approach, all novel words are modeled on one or more similar existing forms which serve as the analogical set. Especially in the case of gradual phenomena, where rules often capture only the more or less deterministic sub-patterns in the data, the rule-based approach becomes unsatisfactory. It is these phenomena above all which form a testing ground for the two kinds of approaches.
Research Interests:
Words occur as morphological constituents in other words. The number of complex words (e.g., great-ness, great-ly, ... ) in which a base word (e.g., great) occurs, its morphological family size, is a strong co-determinant of response... more
Words occur as morphological constituents in other words. The number of complex words (e.g., great-ness, great-ly, ... ) in which a base word (e.g., great) occurs, its morphological family size, is a strong co-determinant of response latencies in visual lexical processing. Words that occur in many other words are responded to faster than words that occur in only a few other words. Surprisingly, the morphological family size effect is independent of the frequencies of use of the base word and the frequencies of its family members. We report two experiments with adjectives such as great presented in different morphological and phrasal contexts. A partition of the morphological family members into nouns, verbs, and two kinds of adjectives revealed different effects on the response latencies across these contexts. These results imply that the observed effects can be understood as the result of activation resonance in contextualy restricted networks of morphologically related words in the mental lexicon. Possibly, the contextually determined co-activation of a word’s family members is part and parcel of its overall meaning percept in the brain.
Research Interests:
Ever since Gernsbacher (1984), it is widely believed that word frequency counts based on corpora are unreliable, particularly for the highest and the lowest frequency words due to regression towards the mean. In this study however, we... more
Ever since Gernsbacher (1984), it is widely believed that word frequency counts based on corpora are unreliable, particularly for the highest and the lowest frequency words due to regression towards the mean. In this study however, we show that word frequency counts across corpora are not subject to regression towards the mean, neither in theory nor in practice. Sampling error due to underdispersin, however, remains a serious concern.
Research Interests:
Six experiments examined how inflected Dutch words are recognized. The auditory lexical decision task was used in Experiments 1, 3, and 5, using, respectiveliy, nouns which take the plural affix –en, nouns which take the plural affix –s,... more
Six experiments examined how inflected Dutch words are recognized. The auditory lexical decision task was used in Experiments 1, 3, and 5, using, respectiveliy, nouns which take the plural affix –en, nouns which take the plural affix –s, or a mixture of nouns and verbs. Experiments 2, 4, and 6 were visual analogs of the three auditory experiments. In the first four experiments, the relative frequency of the singular and plural forms of words influenced response latencies to plurals, but not to singulars. In the last two experiments, higher frequency singular nouns and verbs were responded to more rapidly than their corresponding lower frequency plurals. The results suggest that there are independent representations of plural forms for nouns and verbs, in both the auditory and visual modalities, even for forms with fully regular affixes. They argue against the view that storage in the mental lexicon is reserved for irregular forms only.
Research Interests:
In structuralist and generative theories of morphology, probability is a concept that, until recently, has not had any role to play. By contrast, research on language variation across space and time has a long history of using statistical... more
In structuralist and generative theories of morphology, probability is a concept that, until recently, has not had any role to play. By contrast, research on language variation across space and time has a long history of using statistical models to gauge the probability of phenomena such as t-deletion as a function of age, gender, education, area, and morphological structure. In this chapter, I discuss four case studies that illustrate the crucial role of probability even in the absence of sociolinguistic variation. The first case study shows that by bringing probability into morphological theory, the intuiitive notion of morphological productivity can be made more precise. The second case study considers a data set that defies analysis in terms of traditional syntagmatic rules, but that can be understood as being governed by probalistic paradigmatics. The third case study illustrates how the use of item-specific underlying features can mask descriptive problems that can only be resolved in probalistic morphology. Finally, the fourth case study focuses on the role that probability plays in understanding morphologically complex words. However, before discussing the different ways in which probability emerges in morphology, it is useful to first ask the question of why probability theory, until very recently, has failed to have an impact in linguistic morphology, in contrast to, for instance, biological morphology.
Research Interests:
This chapter begins with a discussion of linking rules across languages. It then discusses the semantics of unaccusatives and an experiment to test the semantic factors in linking rules. It argues that syntactic unaccusativity is... more
This chapter begins with a discussion of linking rules across languages. It then discusses the semantics of unaccusatives and an experiment to test the semantic factors in linking rules. It argues that syntactic unaccusativity is determined by meaning in both German and Dutch. Two semantic factors appear to determine unaccusativity—[telicity] and [actor]. Subjects use the Telicity Linking Rule for verbs with detectable endpoints, classifying them as unaccusative. They also sometimes use the Actor Linking Rule to classify verbs with detectable actors as unergative. When both an endpoint and an actor are present for a given verb, subjects classify the verb as unaccusative. Thus, the Telicity Linking Rule appears to take priority over the Actor Linking Rule. This was related to the geometry of their conceptual Structure representations.
Research Interests:
The spelling of linking elements in Dutch compounds such as boekenkast ‘bookScase’ and slangenbeet ‘snakeSbite’ has been an issue since the introduc- tion of an extensive set of rules in De Vries and Te Winkel (1884), the publica- tion... more
The spelling of linking elements in Dutch compounds such as boekenkast ‘bookScase’ and slangenbeet ‘snakeSbite’ has been an issue since the introduc- tion of an extensive set of rules in De Vries and Te Winkel (1884), the publica- tion that received legal status in 1947 and offers the foundations of present-day Dutch orthography. Though most of De Vries and Te Winkel’s spelling system is still in force today, their spelling of linking elements no longer is. This aspect of Dutch spelling was changed in 1954 and in 1995, cf. the overview of words changed and not changed since 1884 in (1).
Research Interests:
In this study we introduce the Accumulation of Expectations technique to build vectorial representations of the orthographic and phonetic forms of all the words in a language for use in connectionist models. We demonstrate how this... more
In this study we introduce the Accumulation of Expectations technique to build vectorial representations of the orthographic and phonetic forms of all the words in a language for use in connectionist models. We demonstrate how this technique can be used to build realistic orthographic representations for all Dutch and English words from the CELEX database, and realistic phonetic representations for all Dutch words in CELEX.
Research Interests:
A series of studies (Pinker 1991, 1997; Kim et al. 1991; Pinker 1999; Pinker and Ullman 2002) have argued that differences between irregular and regular verbs are restricted to form. However, there are also studies which suggest that a... more
A series of studies (Pinker 1991, 1997; Kim et al. 1991; Pinker 1999; Pinker and Ullman 2002) have argued that differences between irregular and regular verbs are restricted to form. However, there are also studies which suggest that a strict separation of form and meaning may be counterproductive for theories of the mental lexicon. Bybee and Slobin (1982) observed that under time pressure, irregular past tense forms such as sat were produced upon presentation of seat, instead of the correct regular past tense form seated. Apparently, high-frequency irregular past tense forms served as attractors for the past-tense of semantically related regulars. Furthermore, Ramscar (2002) reported that when subjects were asked to say the past tense for pseudowords, the semantic context co-determined whether a regular or an irregular past tense was produced. This result also suggests that regularity interacts with semantics.
Existing techniques for vector-space semantic representations have provided useful tools for the automatic building of semantic type systems. However, these models tend to pay little attention to the position of each element in the... more
Existing techniques for vector-space semantic representations have provided useful tools for the automatic building of semantic type systems. However, these models tend to pay little attention to the position of each element in the sequence of words. This leads to the loss of valuable information. We present an unsupervised technique that extracts rich representations encoding morphological, syntactic and semantic information from large corpora. We present results based both on a small artificial corpus, and larger realistic samples of text.
Research Interests:
Morphology is the branch of linguistics that studies the internal structure of words. Comparisons of sequences of words such as Strong strength long length warm warmth deep depth and great greatness ready readiness glad gladness sad... more
Morphology is the branch of linguistics that studies the internal structure of words. Comparisons of sequences of words such as Strong strength long length warm warmth deep depth and great greatness ready readiness glad gladness sad sadness show that systematic changes in form (the suffixation of –th or –ness) are accompanied by systematic changes in meaning (changing an adjective into an abstract noun). The question of productivity arises as soon as the numbers of word pairs belonging to a word formation pattern are counted. In the case of –th, the CELEX lexical database (Baayen/Piepenbrock/Gulikers: 1995) lists only three other such formations, broad breadth true truth wide width while in the case of –ness, thousands of other formations are attested. Speakers of English are reluctant to extend the –th series with new forms such as coolth, and when they do so, it is likely to be on purpose to achieve some special effect (Schultink 1961, Aronoff 1983). By contrast, neologisms in –ness are difficult to identify as such, and they are seldom used with the explicit intention of foregrounding. Descriptively, -th is said to be unproductive, and –ness productive. Used in this way, the term productivity denotes a qualitative dichotomy between extendable and non-extendable morphological patterns. This qualitative use of the term productivity runs into problems when less extreme morphological patterns are considered. The patterns employ employee legate legatee deport deportee and active activity actual actuality neutral neutrality are supported by 36 and 496 pairs in the CELEX lexical database respectively. The former pattern is intuitively judged not to be unproductive, whereas the latter pattern seems not to be really productive. Apparently, a type count of the number of attested formations is not a reliable indicator of whether a word formation pattern is productive.
Research Interests:
Words differ with respect to how often they are used in speech or writing. Words such as eat and hand are common, while words such as scythe and supersensoriness are rare. A characteristic property of words as they are used in every-day... more
Words differ with respect to how often they are used in speech or writing. Words such as eat and hand are common, while words such as scythe and supersensoriness are rare. A characteristic property of words as they are used in every-day language is that there are relatively few words with high frequencies of use, and large numbers of words with very low frequencies of use. Not only does one tend to find many rare words in a given sample of textual materials, when the sample is increased, large numbers of new rare words are also observed. This holds for small samples just as well as for large samples of tens of milions of words. This implies that there are great numbers of low-probability words, many of which will not be seen in actual textual samples. Word frequency distributions, in other words, are highly skewed, asymmetrical distributions.
Research Interests:
Large data resources play an increasingly important role in both linguistics and psycholinguistics. The first data resources used by both psychologists and linguists alike were word frequency lists such as Thorndike and Lorge (1944) and... more
Large data resources play an increasingly important role in both linguistics and psycholinguistics. The first data resources used by both psychologists and linguists alike were word frequency lists such as Thorndike and Lorge (1944) and Kučera and Francis (1967). Although the Brown corpus on which the frequency counts of Kučera and Francis were based was very large for its time, comprising some one million word forms carefully sampled from different registers of English, many common words did not appear in the frequency lists, while others appeared with counterintuitive frequencies of use.
Research Interests:
The historical development of periphrastic do in different types of English sentences is well-documented in Ellegard (1953), where the periphrastic-do construction is analyzed in 107 texts between 1390 and 1710. Ellegard’s examples... more
The historical development of periphrastic do in different types of English sentences is well-documented in Ellegard (1953), where the periphrastic-do construction is analyzed in 107 texts between 1390 and 1710. Ellegard’s examples illustrating this syntactic change can be found also in Kroch (1989a,a), Ogura (1993), and Vulanovic (2005), to appear). These papers use Ellegard’s data to discuss the change further. Kroch (1989a,b) and Ogura (1993) give plausible linguistic explanations of the development of periphrastic do in different types of sentences. Vulanovic (2005) uses his grammar efficiency model to confirm Ellegard’s hypothesis that emphatic do influenced the development in affirmative declarative sentences.
Research Interests:
Cette étude montre que la structure morphologique d’un mot influence la réalisation acoustique de ses affixes. Dans un corpus de Néerlandais parlé, l’agglomérat consonantique [xh] du suffixe –igheid est prononcé d’une façon plus courte si... more
Cette étude montre que la structure morphologique d’un mot influence la réalisation acoustique de ses affixes. Dans un corpus de Néerlandais parlé, l’agglomérat consonantique [xh] du suffixe –igheid est prononcé d’une façon plus courte si les deux consonants sont séparés par une coupe morphologique. Cette observation peut être expliquée dans une approche de la morphologie basée sur la théorie informationnelle selon laquelle l’information d’un suffixe dépend de la densité du paradigme morphologique. The pronunciation of words and affixes is characterized by immense intra and interspeaker variation. In the current study, we investigated whether some of this variation can be accounted for by morphological structure. Our focus was on the Dutch suffix igheid ( /əxhEit / ), which occurs in different types of words. Three morphological types were considered, which differ in whether the base word and the form ending in ig occur in isolation (see Table 1).
Research Interests:
In many languages, underlyingly voiced obstruents are realized as voiceless in word-final position (Final Devoicing). Previous research has shown that this neutralization of the underlying [voice]-specification may be phonetically... more
In many languages, underlyingly voiced obstruents are realized as voiceless in word-final position (Final Devoicing). Previous research has shown that this neutralization of the underlying [voice]-specification may be phonetically incomplete, with underlyingly voiced obstruents being realized as slightly voiced. Listeners are sensitive to incomplete neutralization, since they can tell apart above chance level the phonetic realizations of words underlyingly ending in voiced and voiceless obstruents. This study discusses to what extent incomplete neutralization is functional. It reports a series of experiments showing that incomplete neutralization can be induced in Dutch just by the way words are spelled, and that no minimal word pairs are required for incomplete neutralization to emerge. In addition, these experiments also show that Dutch listeners take advantage of incomplete neutralization, even when there are no task-specific reasons for them to do so. Incomplete neutraliza- tion appears to be a subphonemic cue to past-tense formation, and has a more substantial role in language processing than has been assumed thus far. Finally, our data show that Dutch listeners and speakers dynamically adapt their production and interpretation of the acoustic signal to the voicing properties of the orthographic and acoustic forms encountered previously in the experiment.
Research Interests:
Words with internal morphological structure are processed differently in the mental lexicon than monomorphemic words, both in language comprehension and in speech production. How morphological structure is represented in the brain is... more
Words with internal morphological structure are processed differently in the mental lexicon than monomorphemic words, both in language comprehension and in speech production. How morphological structure is represented in the brain is hotly disputed. Dissociations between the processing of regular and irregular complex words on the one hand, and gradient morphological phenomena on the other, are key issues in the debate between single and dual route approaches, and between symbolic and connectionist approaches.
Research Interests:
In Dutch, all morpheme-final obstruents are voiceless in word-final position. As a consequence, the distinction between obstruents that are voiced before vowel-initial suffixes and those that are always voiceless is neutralized. This... more
In Dutch, all morpheme-final obstruents are voiceless in word-final position. As a consequence, the distinction between obstruents that are voiced before vowel-initial suffixes and those that are always voiceless is neutralized. This study adds to the ex- isting evidence that the neutralization is incomplete: neutralized, alternating plosives tend to have shorter bursts than non-alternating plosives. Furthermore, in a rating study, listeners scored the alternating plosives as more voiced than the non- alternating plosives, showing sensitivity to the subtle subphonemic cues in the acoustic signal. Importantly, the participants who were presented with the complete words, instead of just the final rhymes, scored the voiceless alternating plosives as even more voiced. This shows that listeners' perception of voice is affected by their knowledge of the obstruent's realization in the word's morphological paradigm. Ap- parently, subphonemic paradigmatic levelling is a characteristic of both production and perception. We explain the effects within an analogy-based approach.
Research Interests:
Theoretical linguists have traditionally relied on linguistic intuitions such as gram- maticality judgments for their data. But the massive growth of computer-readable texts and recordings, the availability of cheaper, more powerful... more
Theoretical linguists have traditionally relied on linguistic intuitions such as gram- maticality judgments for their data. But the massive growth of computer-readable texts and recordings, the availability of cheaper, more powerful computers and soft- ware, and the development of new probabilistic models for language have now made the spontaneous use of language in natural settings a rich and easily accessible alter- native source of data.
Research Interests:
In the seventies of the previous century, the mathematical properties of formal lan- guages have provided a key source of inspiration to morphological theory. Models such as developed by Lieber (1980) and Selkirk (1984) viewed the lexicon... more
In the seventies of the previous century, the mathematical properties of formal lan- guages have provided a key source of inspiration to morphological theory. Models such as developed by Lieber (1980) and Selkirk (1984) viewed the lexicon as a calculus, a formal system combining a repository of morphemes with rules for combining these morphological atomic units into complex words.
Research Interests:
Old English morpho-syntax allows a degree of word order flexibility that is exploited by discourse strategies. Key elements here are: adverbs functioning as discourse partitioners, and a wider range of pronominal elements, extending the... more
Old English morpho-syntax allows a degree of word order flexibility that is exploited by discourse strategies. Key elements here are: adverbs functioning as discourse partitioners, and a wider range of pronominal elements, extending the number of strategies for anaphoric reference. The syntactic effect is an extended range of subject and object positions, which are exploited for discourse flexibility. In particular, a class of high adverbs, including primarily þa “then” and þonne “then”, define on their left an area in which discourse-(linked) elements occur, including a range of pronouns, but also definite nominal subjects. The latter occur here because the Old English weak demonstrative pronouns that serve to mark definiteness also allow specific anaphoric reference to a discourse antecedent. We also develop a model of quantitative analysis that brings out the relationship between the narrowly circumscribed syntactic system and the relative diffuseness of the discourse referential facts.
Research Interests:
The vocabulary of English and most other languages contains many words that have internal structure (see also Articles 26 and 31). Words such as STRANGENESS, WEAKNESS, and SOFTNESS contain the formative NESS, which is usually found to the... more
The vocabulary of English and most other languages contains many words that have internal structure (see also Articles 26 and 31). Words such as STRANGENESS, WEAKNESS, and SOFTNESS contain the formative NESS, which is usually found to the right of adjectives. Words ending in NESS almost always are ab- stract nouns. We refer to the sets of words sharing aspects of form structure
and aspects of meaning as morphological categories.
and aspects of meaning as morphological categories.
Research Interests:
It is generally accepted that we store representations of words in a mental dictionary, which we call the lexicon. However, what exactly is stored in the mental lexicon remains an open question. For example, do we store the word dog as... more
It is generally accepted that we store representations of words in a mental dictionary, which we call the lexicon. However, what exactly is stored in the mental lexicon remains an open question. For example, do we store the word dog as well as its plural form dogs, or do we only store dog and have a rule (NOUN + -s = plural) to compute the plural form. A similar question arises regarding the storage vs. computation of multi-word units, wherein a single meaning is attached to a string of words. The canonical examples are phrasal verbs (give up), compounds (jailbird), and idioms (kick the bucket). By their very nature, these items offer us an opportunity to understand the interplay between storage and computation. Corpus-based research has shown that the tendency for words to occur together in discourse extends far beyond the canonical (e.g., Biber, Johansson, Leech, Conrad, and Finegan, 1999; Bod, Scha, and Sima’an, 2003). In fact, other sequences of words, such as in the middle of, pattern together with such frequency that it may be enough to treat them as single units in their own right (Biber et al., 1999). There is a good psycholinguistic basis for proposing that the mind stores and processes these multi-word units as wholes (e.g., Bod, 2001; Schmitt and Underwood, 2004; Underwood, Schmitt, and Galpin, 2004; Jiang and Nekrasova, 2007; Conklin and Schmitt, 2008; Tremblay, Derwing, Libben, and Westbury, 2008). The main reason may be the structure of the mind itself, which stores a vast number of information units in long-term memory, but is only able to process about 4-7 of them online, in working memory (Miller, 1956). In effect, the mind might make use of a relatively unlimited resource (long-term memory) to compensate for relatively limited one (working-memory) by storing a number of frequently needed/used multi-word units as wholes.
Research Interests:
This chapter discusses the role of compound token frequency, head and modifier token frequency, and head and modifier compound family sizes (type frequencies) in the reading of English and Dutch compounds, using data from word naming,... more
This chapter discusses the role of compound token frequency, head and modifier token frequency, and head and modifier compound family sizes (type frequencies) in the reading of English and Dutch compounds, using data from word naming, visual lexical decision, and eye-tracking studies. Using generalized additive regression modeling, it is shown that these frequency measures enter into many complex interactions. These interactions argue against current staged models, favoring morphological processing as part of a complex dynamic system.
Research Interests:
This study explores the consequences of morphological connectivity for English compounds, combining tools from graph theory with measures of lexical processing costs as available in the English Lexicon Project (Balota et al., 2007). The... more
This study explores the consequences of morphological connectivity for English compounds, combining tools from graph theory with measures of lexical processing costs as available in the English Lexicon Project (Balota et al., 2007). The directed compound graph reveals a significant trend to acyclicity just as the directed affix graphs of Hay and Plag (2004); Plag and Baayen (2009); Zirkel (2010), and similar correlations of rank and productivity. Rank in the directed graph, however, fails to correlate with measures of processing complexity. In order to understand the high degree of acyclicity, it is hypothesized that the activation of more distant neighbors in the lexical network is disadvantageous. A measure for more distant lexical neighbors, secondary family size, is proposed, and shown to have an inhibitory effect in visual lexical decision and word naming. Furthermore, an inhibitory effect of the shortest path from head to modifier is documented, and shown to depend on a specific time window within which activation reaching the modifier disrupts the process of compound interpreta- tion.
Research Interests:
This chapter reviews the role of corpora in phonological research, as well as the role of exemplars in phonological theory. We begin with illustrating the importance of corpora for phonological research as a source of data. We then... more
This chapter reviews the role of corpora in phonological research, as well as the role of exemplars in phonological theory. We begin with illustrating the importance of corpora for phonological research as a source of data. We then present an overview of speech corpora, and discuss the kinds of problems that arise when corpus data have to be transcribed and analyzed. The enormous variability in the speech signal that emerges from speech corpora, in combination with current experimental evidence, calls for more sophisticated theories of phonology than those developed in the early days of generative grammar. The importance of exemplars for accurate phonological generalization is dis- cussed in detail, as well as the characteristics of and challenges to several types of abstractionist, exemplar, and hybrid models.
Research Interests:
Mixed-effect modeling is recommended for data with repeated measures, as often encountered in designed experiments as well as in corpus-based studies. The mixed-effect model provides a flexible instrument for studying data sets with both... more
Mixed-effect modeling is recommended for data with repeated measures, as often encountered in designed experiments as well as in corpus-based studies. The mixed-effect model provides a flexible instrument for studying data sets with both fixed-effect factors and random-effect factors, as well as numerical covariates, that allows conclusions to generalize to the populations sampled by the random-effect factors. Mixed-effect models can straightforwardly incorporate two or more random-effect factors. By providing shrinkage estimates for the effects associated with the units sampled with a given random-effect factor, the mixed model provides enhanced prediction accuracy. Mixed-effect models also make available enhanced instruments for modeling interactions of random-effect and fixed-effect predictors. As mixed-effects models do not depend on prior aggregation, they also offer the researcher the possibility to bring longitudinal effects into the statistical model.
Research Interests:
Lezingen gehouden voor de werkgroep bijbelvertalen van de Vrije Universiteit.
Research Interests:
This volume brings together a series of studies of morphological processing in Germanic (English, German, Dutch), Romance (French, Italian), and Slavic (Polish, Serbian) languages. The question of how morphologically complex words are... more
This volume brings together a series of studies of morphological processing in Germanic (English, German, Dutch), Romance (French, Italian), and Slavic (Polish, Serbian) languages. The question of how morphologically complex words are organized and processed in the mental lexicon is addressed from different theoretical perspectives (single and dual route models), for different modalities (auditory and visual comprehension, writing), and for language development. Experimental work is reported, as well as computational and statistical modeling. Thus, this volume provides a useful overview of the range of issues currently attracting reseach at the intersection of morphology and psycholinguistics.
Research Interests:
Statistical analysis is a useful skill for linguists and psycholinguists, allowing them to understand the quantitative structure of their data. This textbook provides a straightforward introduction to the statistical analysis of language.... more
Statistical analysis is a useful skill for linguists and psycholinguists, allowing them to understand the quantitative structure of their data. This textbook provides a straightforward introduction to the statistical analysis of language. Designed for linguists with a non-mathematical background, it clearly introduces the basic principles and methods of statistical analysis, using 'R', the leading computational statistics programme. The reader is guided step-by-step through a range of real data sets, allowing them to analyse acoustic data, construct grammatical trees for a variety of languages, quantify register variation in corpus linguistics, and measure experimental data using state-of-the-art models. The visualization of data plays a key role, both in the initial stages of data exploration and later on when the reader is encouraged to criticize various models. Containing over 40 exercises with model answers, this book will be welcomed by all linguists wishing to learn more about working with and presenting quantitative data.
Research Interests:
This book is an introduction to the statistical analysis of word frequency distributions, intended for linguists, psycholinguistics, and researchers work ing in the field of quantitative stylistics and anyone interested in quantitative... more
This book is an introduction to the statistical analysis of word frequency distributions, intended for linguists, psycholinguistics, and researchers work ing in the field of quantitative stylistics and anyone interested in quantitative aspects of lexical structure. Word frequency distributions are characterized by very large numbers of rare words. This property leads to strange statisti cal phenomena such as mean frequencies that systematically keep changing as the number of observations is increased, relative frequencies that even in large samples are not fully reliable estimators ofpopulationprobabilities, and model parameters that emerge as functions of the text size. Special statistical techniques for the analysis of distributions with large numbers of rare events can be found in various technical journals. The aim of this book is to make these techniques more accessible for non-specialists. Chapter 1 introduces some basic concepts and notation. Chapter 2 describes non-parametricmethods for the analysis ofword frequency distributions. The next chapterdescribes in detail three parametricmodels, the lognormal model, the Yule-Simon Zipfian model, and the generalized inverse Gauss-Poisson model. Chapter 4 introduces the concept of mixture distributions. Chapter 5 explores the effectofnon-randomness inword use on the accuracy of the non parametric and parametric models, all of which are based on the assumption that words occur independently and randomly in texts. Chapter 6 presents examples of applications.
Research Interests:
Thesis--Vrije Universiteit te Amsterdam
Research Interests:
Released under the gnu general public license for linux and unix at the 1999 meeting of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing
Research Interests:
The analysis of experimental data with mixed-effects models requires decisions about the specification of the appropriate random-effects structure. Recently, Barr et al. (2013) recommended fitting 'maximal' models with all possible random... more
The analysis of experimental data with mixed-effects models requires decisions about the specification of the appropriate random-effects structure. Recently, Barr et al. (2013) recommended fitting 'maximal' models with all possible random effect components included. Estimation of maximal models, however, may not converge. We show that failure to converge typically is not due to a suboptimal estimation algorithm, but is a consequence of attempting to fit a model that is too complex to be properly supported by the data, irrespective of whether estimation is based on maximum likelihood or on Bayesian hierarchical modeling with uninformative or weakly informative priors. Importantly, even under convergence, overparameterization may lead to uninterpretable models. We provide diagnostic tools for detecting overparameterization and guiding model simplification. Finally, we clarify that the simulations on which Barr et al. base their recommendations are atypical for real data. A detailed example is provided of how subject-related attentional fluctuation across trials may further qualify statistical inferences about fixed effects, and of how such nonlinear effects can be accommodated within the mixed-effects modeling framework.
Research Interests:
GAMM (Generalized Additive Mixed Modeling; Lin & Zhang, 1999) as implemented in the R package 'mgcv' (Wood, S.N., 2006; 2011) is a nonlinear regression analysis which is particularly useful for time course data such as EEG, pupil... more
GAMM (Generalized Additive Mixed Modeling; Lin & Zhang, 1999) as implemented in the R package 'mgcv' (Wood, S.N., 2006; 2011) is a nonlinear regression analysis which is particularly useful for time course data such as EEG, pupil dilation, gaze data (eye tracking), and articulography recordings, but also for behavioral data such as reaction times and response data. As time course measures are sensitive to autocorrelation problems, GAMMs implements methods to reduce the autocorrelation problems. This package includes functions for the evaluation of GAMM models (e.g., model comparisons, determining regions of significance, inspection of autocorrelational structure in residuals) and interpreting of GAMMs (e.g., visualization of complex interactions, and contrasts).
CELEX is the Dutch Centre for Lexical Information. It was developed as a joint enterprise of the University of Nijmegen, the Institute for Dutch Lexicology in Leiden, the Max Planck Institute for Psycholinguistics in Nijmegen, and the... more
CELEX is the Dutch Centre for Lexical Information. It was developed as a joint enterprise of the University of Nijmegen, the Institute for Dutch Lexicology in Leiden, the Max Planck Institute for Psycholinguistics in Nijmegen, and the Institute for Perception Research in Eindhoven. Over the years it has been funded mainly by the Netherlands Organisation for Scientific Research (NWO) and the Dutch Ministry of Science and Education. CELEX is now part of the Max Planck Institute for Psycholinguistics.
Research Interests:
Research Interests:
Research Interests:
Intuïties ten aanzien van de spontane uitbreidbaarheid van morfologische categorieën spelen een zeer belangrijke rol bij het bepalen van produktiviteit. In het licht van de belangwekkende resultaten die door een zorgvuldig onderzoek van... more
Intuïties ten aanzien van de spontane uitbreidbaarheid van morfologische categorieën spelen een zeer belangrijke rol bij het bepalen van produktiviteit. In het licht van de belangwekkende resultaten die door een zorgvuldig onderzoek van wat intuïtief als mogelijk wordt ervaren zijn bereikt, lijkt op het eerste gezicht een corpusgebaseerde, kwantitatieve benadering weinig nieuwe inzichten te kunnen bieden. Dat een statistische analyse van de frekwenties waarmee de soorten in het corpus voorkomen wel degelijk vruchtbaar kan zijn, en zelfs tot nieuwe inzichten kan leiden, hoop ik in dit artikel duidelijk te maken.
Research Interests:
In de klassieke benadering van het verschijnsel produktiviteit, te vinden in onder meer het werk van Uhlenbeck en Schultink, worden produktieve woordvormingsprocédés onder meer gekarakteriseerd door het feit dat de aantallen door deze... more
In de klassieke benadering van het verschijnsel produktiviteit, te vinden in onder meer het werk van Uhlenbeck en Schultink, worden produktieve woordvormingsprocédés onder meer gekarakteriseerd door het feit dat de aantallen door deze procédés voortgebrachte verschillende formaties, de typen, in principe niet telbaar zijn. Produktieve morfologische categorieën kenmerken zich door hun principiële uitbreidbaarheid. Terwijl een inventarisatie van de aantallen typen bij improduktieve categorieën mogelijk is in het licht van de beperkte omvang van dergelijke categorieën, heeft de principiële uitbreidbaarheid van de produktieve morfologische categorie tot gevolg dat een telling van het aantal typen geen uitsluitsel kan bieden ten aanzien van de produktiviteitsvraag. Zoals blijkt uit de tellingen in tabel 1 van de aantallen typen zoals die gevonden worden in het corpus Uit den Boogaart (1975), is het onmogelijk om op grond van het aantal typen, voortaan aan te duiden met V, te bepalen of een procédé produktief dan wel improduktief is.
Research Interests:
Van Santen, re-evaluating and extending Aronoff's (1976) approach to morphological productivity, develops a theoretical framework in which the notion ‘degree of productivity’ is strictly defined as the probability of actuation in the... more
Van Santen, re-evaluating and extending Aronoff's (1976) approach to morphological productivity, develops a theoretical framework in which the notion ‘degree of productivity’ is strictly defined as the probability of actuation in the derivational domain(s) for which a word formation process is defined. Within this framework she explores in detail the - often variable - effects of semantic frameworks on word formation. The present paper critically examines Van Santens analysis of morphological productivity, especially with respect to the role of the theoretical notions ‘competence’ and ‘performance’ on the one hand, and her interpretation of the concept ‘degree of productivity’ on the other. It is also suggested that in addition to semantic factors, pragmatic factors should be recognized as co-determining probabilities of actuation.
Research Interests:
Het prefix ont- hecht zich aan adjectiva, nomina en verba. Bij formaties van het type ontzadelen leidt dit tot het probleem dat de woordsoort van het grondwoord niet zonder meer vast te stellen is. Aan deze structureel ambigue afleidingen... more
Het prefix ont- hecht zich aan adjectiva, nomina en verba. Bij formaties van het type ontzadelen leidt dit tot het probleem dat de woordsoort van het grondwoord niet zonder meer vast te stellen is. Aan deze structureel ambigue afleidingen kan zowel een nomen als een verbum ten grondslag liggen. De Vries (1975:172-173) kiest voor een denominale disambiguëring, een keuze die hij motiveert met een beroep op de produktiviteit van dit type afleiding. Volgens De Vries sluit het produktieve type ontzadelen - hij noemt zelf formaties als ontnieten, ontkaften en ontverven als voorbeelden van recente nieuwvormingen - zich wat haar produktiviteit betreft aan bij de ondubbelzinnig denominale produktieve afleidingen van het type ontvlezen, en onderscheidt het zich in dit opzicht van improduktieve deverbale afleidingen als ontbranden en ontvluchten. Op grond hiervan concludeert De Vries dat het type ontzadelen denominaal is.
Research Interests:
Wat heeft het theoretisch taalkundig onderzoek aan corpora? Volgens een emeritus hoogleraar van de Harvard universiteit vorig jaar in een kernachtig stukje in de Linguist List helemaal niets: sinds Chomsky zou iedereen toch moeten weten... more
Wat heeft het theoretisch taalkundig onderzoek aan corpora? Volgens een emeritus hoogleraar van de Harvard universiteit vorig jaar in een kernachtig stukje in de Linguist List helemaal niets: sinds Chomsky zou iedereen toch moeten weten dat alles wat je ooit over taal zou willen weten in ons eigen hoofd zit. Wie een beetje nadenkt en z’n eigen intuities raadpleegt heeft geen corpus nodig.
Research Interests:
Het onderzoek naar taal en het menselijk taalvermogen heeft in het verleden enorme impulsen gehad vanuit wiskundige disciplines zoals de theorie van formele talen en de formele logica. Harald Baayen is ervan overtuigd dat het voor het... more
Het onderzoek naar taal en het menselijk taalvermogen heeft in het verleden enorme impulsen gehad vanuit wiskundige disciplines zoals de theorie van formele talen en de formele logica. Harald Baayen is ervan overtuigd dat het voor het hedendaagse taalkundig onderzoek essentieel is om ook inzichten uit de kansrekening, de statistiek en de machineleertechnieken in het hart van de theorievorming een plaats te geven. Alleen zo zal men in staat zijn om ten volle recht te doen aan de subtiele probabilistische systematiek die zo kenmerkend blijkt te zijn voor de dynamiek van woorden onder elkaar.
Research Interests:
Abstract This paper presents a detailed critique of some current gold standards for the statistical analysis of experimental data in psycholinguistics. A series of examples illustrates (1) the disadvantages of reducing numerical variables... more
Abstract This paper presents a detailed critique of some current gold standards for the statistical analysis of experimental data in psycholinguistics. A series of examples illustrates (1) the disadvantages of reducing numerical variables to factors and the importance of ...
Hoe gaat het brein van de dichter om met overeenkomsten in de vorm en betekenis van woorden? Laat het dichterlijk omgaan met taal meetbare sporen na in het geheugen en beïnvloedt het de wijze waarop woorden worden begrepen en verstaan?... more
Hoe gaat het brein van de dichter om met overeenkomsten in de vorm en betekenis van woorden? Laat het dichterlijk omgaan met taal meetbare sporen na in het geheugen en beïnvloedt het de wijze waarop woorden worden begrepen en verstaan? Drie experimenten, uitgevoerd in samenwerking met cultuurpodium LUX in Nijmegen, brachten opmerkelijke verschillen aan het licht tussen dichters en niet- dichters. Uit reactietijden bleek dat dichters meer associatieve lexicale verbanden overwogen, maar hun uiteindelijke keuzes bleken net zo rationeel als die van niet-dichters. Paradoxaal genoeg lieten dichters zich bij hun oordelen minder beïnvloeden door eindrijm. Ze profileer- den zich als zorgvuldige en bedachtzame woordexperts. Een aantal dimensies kon worden geïdentificeerd waarop dichters individueel van elkaar verschilden. Een vraag voor literatuurwetenschappelijk vervolgonderzoek is of deze individuele verschillen niet alleen iets zeggen over de verwerking van woorden in de hersenen van de dichters, maar ook over hun poëzie.
Op het eerste gezicht lijkt ’t kofschip een gemakkelijke regel. Maar voor een gemakkelijke regel vraagt het wel om heel veel lestijd. De auteurs van dit artikel onderzochten de problemen en suggereren een oplossing.
Research Interests:
Dit besprekingsartikel valt in twee delen uiteen. Het eerste gedeelte biedt een critische bespreking van het hierboven genoemde boek. Het tweede gaat nader in op het voorstel van Booij om het kenmerk [lang] de fonologische functie van het... more
Dit besprekingsartikel valt in twee delen uiteen. Het eerste gedeelte biedt een critische bespreking van het hierboven genoemde boek. Het tweede gaat nader in op het voorstel van Booij om het kenmerk [lang] de fonologische functie van het oude kenmerk [tense] te laten overnemen.
Research Interests:
Het centrum voor lexicale informatie CELEX te Nijmegen is een van de Nederlandse wetenschappelijke expertisecentra. In deze bijdrage geef ik eerst enige achtergrondinformatie over CELEX, waarna een beknopt overzicht volgt van de rijkdom... more
Het centrum voor lexicale informatie CELEX te Nijmegen is een van de Nederlandse wetenschappelijke expertisecentra. In deze bijdrage geef ik eerst enige achtergrondinformatie over CELEX, waarna een beknopt overzicht volgt van de rijkdom aan lexicale informatie die CELEX ter beschikking stelt. Tenslotte behandel ik een aantal problemen op met name het gebied van de morfologie waar ik als regelmatige gebruiker van CELEX mee in aanraking ben gekomen.
Research Interests:
This book is a collection of papers dealing with various aspects of morphological processing and representation. The range of languages and phenomena discussed is somewhat broader than suggested by the title. Although the main emphasis is... more
This book is a collection of papers dealing with various aspects of morphological processing and representation. The range of languages and phenomena discussed is somewhat broader than suggested by the title. Although the main emphasis is on German inflectional morphology, Italian and Dutch are the subject languages in one study, while derivational morphology is also touched upon in various papers. The main focus of the book is on the role of morphology in language comprehension. The papers are preceded by an introduction by the editor in which the methodology of psychological experimental research is briefly outlined and where some of the main issues in ongoing research are summarized.
Research Interests:
This book (henceforth LA) is a well-edited collection of papers presented at the First International Workshop on Lexical Acquisition, held in 1989. Its central theme is how to build one or more possibly domain-specific lexicons containing... more
This book (henceforth LA) is a well-edited collection of papers presented at the First International Workshop on Lexical Acquisition, held in 1989. Its central theme is how to build one or more possibly domain-specific lexicons containing the right amount of information to allow natural language processing (NLP) systems to handle real language data. Presently available machine-readable dictionaries are not suited to this task. Their format is not always consistent, important information is often lacking, and obsolete words and word senses may lead NLP systems astray. Building better lexicons, however, is a formidable task, as the „missing data“ on a words’s lexical semantics, ist allocations, ist idiomatic uses, and its various senses and shades of meaning are simply unavailable. If the „lexical bottleneck“ in NLP is to be eliminated, ways have to be found to obtain this missing information.
Research Interests:
Review of R. Köhler and B.B. Rieger (Eds.), Contributions to Quantitative Linguistics. Proceedings of the First International Conference on Quantitative Linguistics, Qualico, Trier, 1991. Kluwer Academic Publishers, Dordrecht, 1993. US$... more
Review of R. Köhler and B.B. Rieger (Eds.), Contributions to Quantitative Linguistics. Proceedings of the First International Conference on Quantitative Linguistics, Qualico, Trier, 1991. Kluwer Academic Publishers, Dordrecht, 1993. US$ 158.00.
Glottometrika 14. Quantitative Linguistics, Vol. 53, G. Altmann, ed., Wissenschaftlicher Verlag Trier, 1993, 231 pp
Glottometrika 14. Quantitative Linguistics, Vol. 53, G. Altmann, ed., Wissenschaftlicher Verlag Trier, 1993, 231 pp
Research Interests:
Research Interests:
In 1989, Skousen outlined a radically new theory of language in a book called Analogical Modeling of Language. According to this theory, rule-based accounts of language are intrinsically flawed, in that they are incapable of coming to... more
In 1989, Skousen outlined a radically new theory of language in a book called Analogical Modeling of Language. According to this theory, rule-based accounts of language are intrinsically flawed, in that they are incapable of coming to grips with nondeterministic linguistic rules and the analogical forces shaping language behavior and language change. By introducing a rigorous mathematical definition of analogy, S was able to develop a unitary explanation both for data that seem governed by categorial linguistic rules and for nondeterministic data of the kind that have been analyzed by means of variable rules in sociolinguistics.
Research Interests:
This is an impressively well-written textbook on statistical techniques for resolving syntactical and lexical ambiguities in natural language processing. Ch. 1 introduces context-free grammars and the chart-parsing algorithm, an algorithm... more
This is an impressively well-written textbook on statistical techniques for resolving syntactical and lexical ambiguities in natural language processing. Ch. 1 introduces context-free grammars and the chart-parsing algorithm, an algorithm implementing a parser for context-free grammars. Immediately following the formal definition of a context-free grammar, Charniak calls attention to the limitations of context-free grammars, notably with respect to agreement and long-distance dependencies. He points out that it is not clear whether the full syntax of English can be adequately accounted for by the context-free formalism. But as this formalism supports highly efficient parsing algorithms, and as it is widely used, C adopts it as a starting point for developing pobabilistic grammars.
Research Interests:
Variation in language is ubiquitous. Language use differs from dialect to dialect, from social group to social group, from individual speaker to individual speaker, and, for individual speakers, from situational context to situational... more
Variation in language is ubiquitous. Language use differs from dialect to dialect, from social group to social group, from individual speaker to individual speaker, and, for individual speakers, from situational context to situational context. It is the situational variation that is the subject of Biber's thorough cross- linguistic study.
Research Interests:
Text and Corpus Analysis is a collection of case studies in English linguistics. The first part of the book introduces the basic concepts and methods of a tradition in British linguistics to which Stubbs attaches the names of Firth,... more
Text and Corpus Analysis is a collection of case studies in English linguistics. The first part of the book introduces the basic concepts and methods of a tradition in British linguistics to which Stubbs attaches the names of Firth, Halliday, and Sinclair. The second part of the book shows how methods from corpus linguistics can be applied to texts in order to come to grips with their semantics.
