When Teepee Becomes Peepee

My nibling is producing real, actual words! …mostly. Like most toddlers, the quality of production leaves something to be desired. Teepee, for example, comes out more like “peepee,” and coffee sounds more like “cocky.”

Obviously, the parents are having a great time, and I’m dusting off my background in phonology to analyze every single word they say.

It’s a fairly common stage in language development: children go through phases where they consistently replace the consonants in a word (the “target form”) with one from elsewhere in the word. This practice is called consonant harmony, and it was very intriguing for linguists in the 90s because it rarely happens in adult languages, except when the consonants are right next to each other (e.g. pronouncing input as “imput”). On the other hand, vowel harmony — where all the vowels in a word become the same — is a pretty common process across the world’s languages, even when the target vowels are separated by one or more consonants.

This observation is a Big Deal for linguists who believe that child language never deviates from the possible permutations seen in adult language. One of the key ideas in generative grammar is that, even if children are making lots of mistakes as they acquire their language, the grammatical choices they are making actually correspond to completely valid possibilities in some other language. A kid who says “Want cookie” is not following the rules of English (which requires a subject), but is perfectly fine in Spanish.

So, if a child says “peepee” instead of teepee, and that kind of change doesn’t happen in adult languages, that’s pretty dang interesting.

What’s going well

They have successfully identified and produced lots of important phonetic features: they know that the /p/ and /t/ sounds are both produced without vibrating the vocal cords; they know that /p/ and /t/ should be followed by a small puff of air (= aspiration) when they appear at the beginning of a word; and they know that both of these sounds are the result of fully preventing air from escaping the lungs for a split second (= stop consonants), as opposed to only partially limiting airflow as you would in a sound like /f/.

In fact, the only difference between /p/ and /t/ is that one is produced on the lips, while the other is produced with the tongue behind the teeth.

For a long time, everyone more or less assumed that these supposed speech errors were random, the result of children being unable to perceive or produce the subtle distinctions between sounds of the same category (i.e. stops).

However, as I’ve mentioned previously on this blog, it’s not true that babies can’t hear the differences between sounds. They’re actually really good at it. More counterevidence comes from the “fis” phenomenon, which was a study in the 60s where researchers showed a toy fish to a child who consistently “fis” instead of fish. When they asked him, “Is this a fis?” the child answered no, rejecting the adult’s mispronunciation. This is further evidence that the ability to perceive sounds precedes the ability to produce them.

This probably means that whatever goes awry happens somewhere in the production process: that is, the child has a mental representation of the word teepee in their head, but something gets jumbled up when they try to coordinate all the muscles needed to produce the word.

What’s interesting, though, is that there seem to be fairly consistent rules governing child consonant harmony. The fact that these rules exist is a testament to just how much children already know about the language(s) spoken around them.

Why doesn’t “teepee” end up as “teetee?”

Let’s take a quick dive into linguistic theory. Optimality Theory, proposed by Prince and Smolensky in the early 90s, is used in phonology to explain why certain sound alterations occur in a given context. The idea is that that there are underlying “rules” that everyone has in their heads, ranked in order of importance. Higher-ranked rules cannot be broken, whereas lower-ranked rules can be broken if there is no other choice. The ultimate goal: produce a word that is as close to the representation in your head as possible.

Lots of these rules target not just one sound, but a group of sounds that share a feature (such as place of articulation). When we group the consonants in question according to where they are produced, then we have the following categories.

NamePlace of articulationStop consonants
LabialsThe lipsp, b, (m)
CoronalsThe tip of the tongue t, d, (n)
VelarsThe velar ridgek, g

An example of a rule that contributes to consonant harmony could be something like “Preserve all labials in the target word.” When a child then cannot stay true to the adult form of the word (for whatever reason), then they might refer back to the hierarchy of rules to decide which consonants they should prioritize.

Once they decide which segments can be changed and which ones have to stay the same, they can transform the word in some way that makes it easier to produce. This is a topic that I looked at for my Master’s thesis, so I can tell you firsthand that lots of children prefer to simply omit the challenging sound(s) altogether. This is why spot might become “pot,” or green might become “geen.”

Another option, of course, is to repeat a sound that is already available to you, especially if they share some phonetic features with the sound you want to (but can’t) produce. This is where consonant harmony may occur.

Since kids show different orders of preference for labials, coronals, and velars at this stage, the segment they tend to preserve can show us which consonant types they find most important (or maybe just easier to pronounce). This, in turn, can give us a hint about the ranking of rules in their head at any given stage. It’s time to analyze some data!

Looking back at “peepee,” it seems like labial consonants (/p, b, m/) outrank coronals (/t, d/), because the /p/ sound is what remains. Based on this information, we can make some predictions about how my nibling would pronounce other words at this stage:

  • Potty > “poppy”
  • Table > “babu” (/l/ is a whole other can of worms, so just go with me on this)
  • Tuba > “buba”
  • Boot > “boop” or “boo,” depending on whether deletion or assimilation is preferred at the end of a word

Pesky vowels

Now we have our first set of predictions, but they might need to be adjusted as we get more info. Where do velars (/k, g/) rank in comparison to labials and coronals?

Again, we can determine the order of priority based on how certain words are pronounced. Luckily, said nibling is in the middle of a vocabulary explosion and has given us two pieces of data to work with:

  • Doggy > “goggy”
  • Digger > “dadu”

Uhh… what gives?! In doggy, the /g/ stays, but in digger, the /d/ stays! There is no preference — it’s all just random after all!

This is the fun of analyzing linguistic data: just when you think you have a working hypothesis, something comes along to throw you off course.

You might be happy (or disappointed, if you were rooting against me) to know that there is a potential explanation for these two data points, and it has to do with the phonetic environment — that is, the other sounds that are occurring around the consonants in question. In our case, I suspect the first vowel is the culprit.

The /o/ in doggy is produced toward the back of the mouth, quite close to the velar ridge. On the other hand, the /i/ in digger (or the /æ/ in “dadu”) is articulated toward the front of the mouth. So, it seems plausible that in words featuring a front vowel like /i/, the coronal sound might be preserved, whereas in words containing a back vowel like /o/, the velar sound wins out. Let’s make some more predictions based on this hypothesis:

  • Kitty > “titty” (…it’s what the science dictates, okay?)
  • Taco > “kako” (according to American English vowels)
  • Gouda > “duda”

Since we only have, like, three words to work with at the moment, it’s hard to say for certain whether the vowels would also have an impact on consonant harmony in words containing a labial consonant. If the vowel has no impact, then we can assume that the labials would win out over velars. If the vowel does have an effect, then we can assume that front vowels would trigger labials, while back vowels would trigger velars.

A few more predictions:

  • Piggy > “pibby”
  • Copper > “kako”
  • Bugle > “bugu”
  • Baggy > “babby”

As an aside, it’s very hard to come up with words that a baby might know that also have a (roughly) CVCV structure. So please excuse the ridiculous examples above.

How is this useful?

Sometimes it’s just nice to be reminded that your child is actually supremely intelligent and already pretty adept at this “language” thing. They’ve laid a foundation in their native language(s) and know how it should sound, the implementation is the only problem.

You can also start to appreciate the steps that they take on their unique language journey. Instead of just waking up one day as a fully competent communicator, they make small strides every day that all eventually add up to adult competence. It’s cool to observe the process.

Or maybe you’re just like me and you love finding patterns in everything and solving puzzles. That’s also a perfectly good reason to learn anything. If you’re bored with the monotony of childcare one day, you can always whip out a notebook and start trying to figure out the patterns in their speech errors.

Until next time!

Why can’t we “hear” sounds from another language?

You’ve been learning a language for months, you’ve built up a pretty decent vocabulary, but there are just some words that cause you trouble almost any time you use them. UGH.

Unfortunately and ironically, for my parents, one of these words was part of our address in China. A decade of struggles later, nearly half the time we got in a taxi, my dad would make his best attempt and the following conversation would ensue:

  • Dad: “Xiù yáng lù.”
  • Driver (confused): “Xiù yáng lù?”
  • Dad: “Duì. Xiù yáng lù.”
  • Driver: “There’s no such thing as xìu yáng lù.”
  • Mom: “Xiù yáng lù.”
  • Driver (perplexed): “Eh??????”
  • Me: “It’s xìu yán lù.”
  • Driver: “Oooooh, okay!”
  • Dad: “THAT’S EXACTLY WHAT I SAID.”

Various taxi drivers and I tried to clue them in over the years, but alas — the psychology of language was against them.

Why oh why can’t we hear sounds that native speakers can distinguish easily? It all comes down to our unique neuropsychology.

You see, humans are capable of something called categorical perception, which basically means that we can tell when two things belong to the same category, even if they aren’t exactly the same. Two leaves may not look, feel, or smell identical, but we can examine each one and conclude that they are both, indeed, leaves. Categorical perception allows us (and a selection of other animals including brides and chinchillas) to focus on the features that make them similar, and ignore the differences.

Categorical perception is important for perceiving (and interpreting) the sounds of oral language and the handshapes/locations of signed language. When you say “meow” and I say “meow,” for instance, we will not sound completely identical. One of us might have a higher or lower voice; you might produce the e vowel for just a millisecond longer; or I might raise my intonation at the end to make it a question: “meow?” Although we can hear all of those differences, we also know that they are not important for determining whether the same word (and thus, meaning) is intended. We, as adult English speakers, know that both of our productions of “meow” are equally valid instances of the word that means the sound a cat makes.

On the other hand, there’s no reason that someone who has very little experience with spoken language (i.e., a baby) would assume that every little pronunciation difference doesn’t constitute an entirely new word. If they hear me say “Meow!” and “Meow?” with different intonations, how do they know those aren’t two completely different words?

We are born with the ability to perceive every sound used by oral language, and to distinguish even between sounds that are acoustically very similar. A famous study from the 70s showed that 4-month-old infants exposed only to English can still perceive the difference between three Korean consonants which English-speaking adults usually cannot distinguish. However, by 8 months old, they lose that ability. This is a result of certain neurological and psychological processes that help us focus in on the special features of the language(s) that are used around us — the development of categorical perception.

Categorical perception allows us to draw boundaries along a continuum to establish discrete categories. Vowels are a good example of such a continuum in language. They are essentially produced by opening your mouth and vibrating your vocal cords, using the tensing of your tongue and lips to alter the sound that comes out. The difference between /i/ and /e/ depends only on which part of the tongue is tensed; this fact means that you can switch back and forth between the two sounds in a single breath, in one continuous sound. In fact, all the vowels in oral language can be produced in a single breath, simply by varying the position of your tongue and lips. The same cannot be said of consonants.

I’ll pause here to let you test all that out.

Babies gradually learn to perceive vowels and consonants categorically, meaning that they learn to tolerate variations and figure out where to establish a barrier between sounds like /i/ and /e/. Different languages might have slightly different barriers, even between similar-sounding vowels. Other languages might distinguish sounds that don’t exist in English at all, like a vowel between /i/ and /u/. As a baby, your brain learns to pay attention to acoustic signals (sound waves, frequencies, etc.) that are important for determining whether a sound is /e/ or /i/ — in other words, to take something continuous like an acoustic signal and assign it to discrete categories.

Returning to my parents’ Mandarin conundrum, the problem was as follows: Mandarin has a vowel whose acoustic features are a mixture of the English /a/ in “father” and /e/ as in “egg.” I can hear the difference clear as day, but unfortunately, my parents’ brains long ago decided that they would perceive vowels only according to the boundaries that are relevant for English. The acoustic information that distinguishes this vowel in Mandarin is misinterpreted as /a/ because their brain does not have a better category to assign it to.

As a non-linguistic example, imagine you have never seen a butterfly (or any other flying insect) before. You are tasked with sorting pictures of bugs and birds into their respective categories. It’s pretty straightforward until you encounter the butterfly, and then suddenly it’s not. On one hand, this creature has more than 2 legs, just like the other bugs you’ve seen thus far. On the other hand, it also has wings — that isn’t very bug-like, in your experience. You absolutely have to sort these cards, and you cannot create a new category. What do you do?

(It has just occurred to me that this is the exact problem faced by biologists when they first discovered the platypus, which is an egg-laying mammal, so there’s another example for your records.)

When you learn a new language as an adult, you don’t start from scratch. You approach it with years of experience using, interpreting, and categorizing the components of another language. Unfortunately, this can make for some pretty strong biases. Even if you consciously know that this is not your native language, your subconscious is playing a completely different game, using the tools it has honed for the past however-many years of your life. This can make it very different for you to perceive and produce sounds that do not exist in your native language.

Some studies have shown that explicit training on the phonetic differences between non-native sounds can help adult learners perceive and produce them, but the results tend to vary across language pairs, sounds, learner types, and individuals.

Perhaps one day we’ll be able to tap into the mechanisms that babies use to master their native language(s). Until then, though, try to comfort yourself with the fact that the same neurological processes that make you a poor Mandarin speaker also made you an excellent English speaker — or vice versa.

How could your virtual assistant “spell” in Chinese? (Part 1/2)

** For relevant background on Chinese orthography, read my last post. **

As you may or may not know, in my spare/procrastinating time I like to read about machine learning and computational linguistics. As I do so, it’s made me more and more curious about how certain things in our daily lives work… like virtual assistants. Back when I first started taking a computational linguistics class during my Master’s program, I drafted a whole blog post about how much it taught me to appreciate the mechanisms behind virtual assistants. That appreciation does indeed continue to this day.

A few weeks ago I was reminiscing about a particular Mandarin teacher I had in middle school. Whenever we asked her how to write something, instead of writing it on the board, she would verbally dictate it to us using strokes, radicals, or other characters. It drove us crazy as teen language learners but of course, in hindsight, it really helped me understand the way that characters are constructed.

Anyway, that sparked an interesting idea: namely, what if I were able to ask the lady in my phone for help when I forgot a character nowadays? That would be pretty neat! I started wondering how we could actually make that happen from a linguistic perspective. That, of course, is precisely what I wanted to talk to you about today.

As far as I’m aware, this isn’t an actual function that’s available on any of the major current virtual assistants (although I’m not as familiar with the current offerings in Mainland China). So of course, this is all just a thought exercise – the best kind of exercise, IMO.

What goes into speech recognition

Natural language understanding (NLU) is a huge and rapidly-expanding field that I certainly can’t condense into a single blog post. But just for some appreciation, let’s think for a moment about how computers have to go about understanding spoken human language.

At its core, oral language is just a bunch of acoustic signals organized in a certain way. There are patterns and rules that govern it: for a simple example, the two sounds /p/ and /f/ never appear together at the beginning of a word in English. At the sentence level, there are also certain combinations of words that just never appear together – like the happily. And in other cases, there are groups of words that appear together quite often, like the and woman, or language and acquisition (okay, that one might be a stretch, but it is certainly the case in my sphere of influence). Discovering many of these rules and patterns is a simple matter of statistics and probability: babies amass a huge amount of firsthand data throughout the first few years of life and use their predictive abilities to determine what is and is not a possible word/sentence in their language(s) before they can even speak for themselves.

Computers, as you may know, are even better at math and statistics than babies… or at least, they’re faster. That means that we can teach them to recognize speech using statistical learning, just like babies! All they need is a bunch of data (and maybe some rules).

When it comes to creating a virtual assistant, the basic task is to teach the computer to link a particular sequence of sounds to an action. The tricky thing is that it also has to tolerate natural variation in the speech signal (because every voice is different and can be impacted by environmental factors like the position of the body), syntax (because there are many ways to say the same thing), and word choice. At the same time, the model you’re using shouldn’t be so tolerant that anything can trigger a response. This is the basic task that computational linguists working on NLU need to accomplish.

Generally, an ideally tolerant NLU system can be accomplished with a whole lot of data and machine learning algorithms. Since I’m first and foremost a linguist, the details I’m interested in have more to do with the way the training data can be prepared, understood, and validated, rather than the actual algorithms being used.

Suffice it to say that you need a lot of data (also known as “training data,” which is what you feed into the model), in the form of annotated audio clips to help the algorithm learn how to segment speech, interpret variation, and translate all of that into actionable requests. For a more concrete example of what that means, let’s look back at my original question and consider some potential problems it could present for NLU and, by extension, a virtual assistant.

Setting up the Answer

Before your favorite virtual assistant can respond to a request for information (like how to write a character), it of course needs to be able to access the answer. In this case, that means you need to compile a database of individual characters, their components, and rules for breaking them down. If you need a refresher on the ways that Chinese characters are written, check out my last post.

Learning Radicals

As we know, there are a finite number of radicals (somewhere around 200) used in Chinese characters, many of which evolved from pictographs thousands of years ago. Some radicals have different appearances depending on their position in the character: for example, 日 “sun” may appear more narrow when it makes up one side of a character, as in 明, or it can be short and stout when it appears on the top, as in 早. It might also be smaller or bigger depending on how much real estate it takes up compared to other parts of the character, since all characters should fit into a s square box. If we want the computer to be able to recognize all the radicals, then the different possible forms of each radical should be included in the training dataset. That means you need several instances of each radical so that the algorithm has an opportunity to note its appearance in multiple characters.

Learning Strokes

There are also a finite number of strokes, each with their own name, along with rules for the order in which you combine them. This means that the virtual assistant has to know the names of strokes as well as the proper ways to list them when describing a given character. That isn’t necessarily straightforward because the rules that govern stroke order tend to depend primarily on visuospatial characteristics. In other words, it isn’t an absolute rule that all horizontal strokes have to precede vertical strokes: a vertical stroke might come first if it is to the left of the horizontal one, or some such thing. Certain rules must give way to others, depending on the particular character in question. That feels quite tricky to teach a computer.

Two different learning mechanisms come to mind, depending on a variety of factors. The first option is to teach the virtual assistant these rules, including a rough hierarchy of how they are applied (i.e. when one trumps the other), and then have it dissect a set of characters in a supervised learning paradigm. The independent variables would be the list of rules, along with a set of characters decomposed into their individual strokes (that’s a separate task for another program – let’s just assume it’s already been done). The dependent variable is the order of strokes for a given character. Humans would then need to verify the output to check whether the appropriate stroke order has been proposed for each character. This might be less time-consuming to implement, since only a handful of rules need to be fed into the model, but would require more post-hoc verification. It would also require another program that first dissects characters into an unordered list of strokes.

The second option would essentially require the model to deduce stroke order given a data set consisting of characters and their corresponding stroke orders. This would require more data preparation initially: someone would need to determine the appropriate number of characters to include in the training data (to avoid under- or overfitting the model) and ensure that these characters accurately represent the types of possible layouts in Chinese characters. That could require taking the time to do a visual analysis of the types of possible structures and their frequency. Then they would need to make sure that the set of characters shown to the model as training data is proportional.

In this case, the independent variable would be a complete character, and the dependent variable would be the way the model breaks it down into its stroke, radical, and/or smaller character components. Training data would explicitly provide the computer with examples of how characters can be segmented (e.g. 饿 = [饣, 我]).

Once you have a way to break down characters, you need another database of that links the written forms with their pronunciations. To account for things like interspeaker variation, you might even have multiple speakers of different demographics (old/young, different genders, different geographic locations, etc) saying the same thing. Diverse data helps the model ignore the red herrings (like tone of voice) and focus on what’s really important. Again, assuming you have a virtual assistant that already understands Mandarin, the capacity to segment speech and identify words should already be present, and just needs to be tweaked a bit for our specific purposes.

Understanding the Words

Most words in Modern Mandarin are polysyllabic, combining the meaning of two or more single-syllable morphemes (e.g. 电 “electricity” + 脑 “brain” → 电脑 “computer”). There are also a ton of homophones: a single syllable can actually map onto different morphemes and, therefore, characters. For instance, shì can mean “to be”  是, “thing” 事, “person” 士, or “generation” 世. If this seems unnecessarily confusing, think about bear and bare in English and how easy it is for native speakers to figure out the intended meaning in an actual sentence. Lots of languages deal with homophony to some degree or another. Just like in English, in Chinese, we can clarify which particular meaning of a homophone is meant by using it in a word, sentence, or phrase in which it frequently appears, or where only one alternative makes sense. Going back to the English example, this would be like saying “How do you spell bear, as in ‘I cannot bear it’?”

When confronted with a particular phonetic (= sound) form that could map onto multiple characters, the virtual assistant can do the exact same thing. It can prompt the user according to the most likely target, or else ask the user for clarification.

If the assistant guesses the context (e.g. “Do you mean shì as in shì rén [‘scholar’]?”), then the user simply has to answer yes or no. But how would the assistant decide which one to guess first?

Once again, we can use statistics and probability! Looking back at the list of possible meanings of shì, you might have an intuition that some are more frequently used than others. The meaning “to be,” for instance, probably comes up a lot more often than “generation.” This information could help the virtual assistant determine which meaning to target first: it can ask about the more frequently occurring words first, and then move down the list of possibilities in order of likelihood.

Astute readers may have already noticed that this presents another psycholinguistic issue: more frequently-used characters should also more likely to be remembered, therefore users are presumably less likely to need help writing them by hand. Is the solution then to start with the least frequent character? For some homophones, starting with one or the other might not make a huge difference, if there are only a couple possibilities. However, for a syllable like shì, there are probably 2 dozen possible characters, some of them very niche or archaic. If you instruct the virtual assistant to always start with the most (in)frequent possibility, it might take several minutes of dialogue to arrive at the real target – not a very user-friendly experience.

Perhaps more effective would be simply allowing the user to specify the target character right off the bat, or with prompting if necessary. Most Mandarin speakers would do this anyway when room for confusion exists (for what it’s worth, I would never just ask someone “How do you write shì?” with no other context).

If the virtual assistant either prompts the user to provide context or the user does so in the initial query, then this issue can already be solved by whatever mechanisms exist for handling homophones in other contexts. Like I said, Mandarin has a whole lot of homophones, but in most cases the intended meaning can be discerned easily from context and world knowledge. If the virtual assistant already supports Mandarin as a language, then mechanisms to handle homophones would already be in place. At a basic level, n-gram probabilities (i.e. the likelihood of a certain word appearing in a certain context), the same algorithm that is used for predictive text, could process the user’s request to determine which is the target syllable: namely, the one that most frequently appears in combination with the words around it.

These are some of the word-level considerations that would need to go into a functional character “speller” for virtual assistants. There is, however, yet another level that we have to consider when programming our robot: the syntactic and semantic level. Namely, how does the virtual assistant interpret different word orders, vocabulary choices, and syntactic dependencies?

Tune in next week to find out!

Why It’s Really, Really Hard to Pass as a Native Speaker

Ask any language learner to list the challenges of acquiring a foreign language, and “accent” will inevitably come up. Most would agree that, even at very advanced levels of proficiency, it’s difficult to recreate the sounds or, in the case of signed languages, handshapes and movements that are not present in your native language with the precision and effortlessness of a native user. The good news is that you’re not a fool; there are aspects of our implicit linguistic knowledge that cause you to struggle more with the perception and production of non-native sounds/shapes than the sounds that exist in your native language(s). (The bad news is, we have yet to find a “cure” for such difficulties, aside from the tried-and-true method of constant, unrelenting exposure.) The point of today’s post is to discuss some of the possible reasons that acquiring a foreign accent is so difficult, and to remind all my fellow language learners out there to not be so hard on themselves.

Crash course in phonetics

Phonemes are the smallest units that make up a word. In some languages, like Spanish, each phoneme corresponds to a letter (or group of letters), so that you can break a word down into its phonemes roughly based on how many letters it has (e.g. gato contains four phonemes: /g/ /a/ /t/ /o/). English is a bit more complicated, as we have far more phonemes than letters – about 44, give or take depending on your variety. This is why letters like a have so many different pronunciations depending on the word they appear in – the phoneme it represents in apple is distinct from the phoneme represented in arm, which in turn is distinct from the one in alien. I could go on and on about English spelling but that’s not what I set out to talk about today, so let’s put that aside from now. The takeaway is that words can be broken down into phonemes; they don’t mean anything on their own, but when combined in a certain way, they create a word.

Different languages (and, actually, different varieties of the same language) have different phonemic inventories, which basically means the words in Language X might not be built using the same individual sounds as the words in Language Y. In more concrete terms, English does not have the same phonemic inventory as Spanish. While you may not be explicitly aware of these things, you probably know them because, if you’re reading this, you speak English. Much of our knowledge about the language(s) we speak is implicit, meaning you probably wouldn’t be able to list the number of English phonemes to someone if asked, but you know when something sounds “right” in your native tongue. You also know implicitly, though you might not be able to explain why, that certain phonemes like the trilled /r/ in perro don’t sound like English (in most varieties, anyway).

You are an expert in your native language(s)

This implicit knowledge about how your language is supposed to sound has been accumulating since before you were born (if you are hearing) or immediately after (if you are d/Deaf raised by signing adults). In other words, you have a lifetime of experience with the precise way that your native language(s) sounds, looks, and varies from user to user. You are able to detect relatively tiny changes in the way those around you produce words and, although you may not be able to explain exactly what those changes are, you might even be able to imitate different varieties of your native language with some degree of accuracy. You are able to do this, in short, because you have both a general template for how words in your language are supposed to look/sound, but you also can tolerate deviations from that template. The prototype of a word might be a heavily enunciated version, with every phoneme clearly produced, but we can also understand when people slur their words together in day-to-day speech, or people with different accents, or even speech errors (like my friend the other day who said “kestup and muchard”).

All of these fantastic linguistic abilities are developed across years of practice and exposure; small children and babies aren’t nearly as adept at recognizing the same word produced by different speakers, or parsing the speech of someone with an unfamiliar accent. When I worked with kindergarteners acquiring Latin American Spanish, for example, it took them weeks not to get upset whenever I said cinco with a lisp (because I learned Spanish in Spain, where c’s are produced like the /th/ in thin). Gradually, of course, and with continued exposure to different types of language use in a huge variety of contexts, they get better and better at tolerating these kinds of alternations.

Learning a foreign accent

When you learn another language, the circumstances differ in a number of important regards. First, if you are learning in school, you may only have a handful of different speakers to listen to, many of whom are deliberately tailoring their speech to foreign language learners (i.e. they speak more clearly than the average person). They might also teach you a standardized version of the language, which might not be particularly helpful if you want to understand native speakers who were never formally taught this dialect, or if you travel to a region whose variety differs significantly from the standardized version. Then there is length of exposure: unless you are in a fully immersive context, you probably aren’t receiving the same amount of input (from a similarly diverse repertoire of speakers) as small children. All of these things might make it more difficult to tolerate deviations from the expected template of a word.

Furthermore, when you learn a language as an adult, you are starting from a fundamentally different point than infants: your brain is used picking up the different acoustic or visual cues that are important for distinguishing one word from another, and discarding “unimportant” cues. We were all born with the capacity to detect lexical tone; however, by around 8 months old, if you were only exposed to a non-tonal language, your developing brain decided that tone was unimportant for distinguishing one word from another, and thus changed the way that incoming speech was processed, for efficiency’s sake. In other words, you’ve been trained to only focus on the relevant features for your native language.

When acquiring the sounds of a new language, one of the trickiest parts is learning to discard what is only important for your native language, and focus instead on features that distinguish one phoneme from another in your target language. This is why it might be hard to English speakers to hear the difference between Korean ㅋ k, kk, and g, because the acoustic features that contrast those consonants aren’t relevant in English, so their brains decided long ago to not even look for them.

Most of the differences between similar phonemes in a given language are minuscule; it’s honestly impressive that we can hear them. English /p/ and /b/, for example, differ only in the amount of time (measured in milliseconds) it takes for the vocal cords to start vibrating, but we are aware this is a meaningful distinction. /s/ and /sh/ are produced by limiting the airflow between the tip of your tongue and the roof of your mouth; however, /sh/ places your tongue just a few millimeters further back in your mouth. If you are an ASL user, the L handshape and ‘one’ handshape differ only by the position of your thumb – objectively not a massive contrast, but effective nonetheless. Our brains are really, really good at noticing these tiny distinctions when we are babies, but that only comes after months of nonstop exposure, followed by years of practice.

My SO and I have spent a lot of time talking about the variety of German spoken where we live in Germany. After a couple weeks of living here and struggling to understand locals’ accents, I realized that people often produce the g or ch at the end of words like lustig and endlich as /sh/, rather than /x/, the velar fricative I was taught on Duolingo. Again, the difference is minor: /sh/ occurs behind your teeth, with the tip of the tongue, whereas /x/ is made with the back of the tongue in the same place in your mouth as /g/ and /k/. After that discovery, I started recognizing dozens more words than I previously did, all because of one little sound.

When I point this out to native speakers, they’re typically surprised, even if they regularly make the same sound change. It’s not that they didn’t know this at all; it’s that they didn’t know it explicitly. A lot of foreign language learning, especially when it comes to accents and phonology, involves knowledge that is implicit for native speakers, and that is part of what makes it so difficult to a) teach your native language and b) reach a native-like level of proficiency.

So, why is is difficult to pass as a native speaker, even at very high levels of proficiency? Much of it boils down to the way your brain processes speech sounds or signs. From infancy, your brain began to specialize in the sound or visual patterns of your native language, pruning away information that did not help you to quickly and efficiently identify words in your target language. Now that you’re an adult learning a foreign language, you essentially have to rewire your brain to not ignore cues that, up until this point, it did ignore with roaring success.

As many have said before me, there’s unfortunately no quick and easy solution to acquiring a foreign accent. Personally, learning about phonology and phonetics has helped me, but that also isn’t particularly appealing to everyone. Like most aspects of language learning, the only real way to overcome is to engage with as wide a variety of native speakers as possible, and to regularly practice in a relaxed, supportive setting. You may never be mistaken for a native, but that shouldn’t be the goal anyway: so long as you are understood, you should consider your learning a success.

Explaining Homophony in English

EDIT 24/03/20: It has occurred to me that some people may not know how to read IPA, because I didn’t explain how to read IPA. The vowels concerned in this post are as follows:

  • /i/ as in feet
  • /e/ as in cake
  • /ɛ/ as in egg
  • /æ/ as in apple
  • /a/ as in father
  • /a/ as in awesome when Crush says it in Finding Nemo
  • /ə/ as in incredible when you say it fast
  • ː is a diacritic meaning that the preceding vowel lasts longer than it would otherwise

Here’s another fan request for you: explaining the homophony between most General American (GA) and Received Pronunciation (RP) productions of bear, bear, and bare. This one comes from my father, who asked me to do an experiment on this distinction, but there’s only so much you can experiment on during a lockdown, so instead I’m going to provide a general overview of the diachronic changes that led to the whole bear/bear/bare merger. Hopefully, this satiates his interest for the time being.

First, for anyone who doesn’t know (including speakers of a dialect that doesn’t have this homophony), there are three distinct semantic meanings and parts of speech for bear (an animal), bear (to carry), and bare (uncovered), but all three share the pronunciation /bɛːr/ (GA) or /bɛːə/ (RP). Occasionally this ambiguity is exploited in comic strips and one-liners (“I went outside with bare feet–” “Bear feet?!”) but for the most part it flies under the radar, because native English speakers are able to differentiate between the different uses based on syntactic, semantic, and contextual cues, as well as minor prosodic differences on the rare occasion that more than one interpretation may be appropriate (think of how you would say “bare feet” versus “bear feet” – in the latter case, emphasis falls on the first word).

I am not a historical linguist, but I do enjoy reading about historical linguistics and particularly the history of the English language, so I figured I would at least be able to synthesize and provide a general overview of the process that converged the pronunciation of these three words. Unsurprisingly, many cases of homophony can be traced back to a series of sound changes over the years that just happened to affect certain items in just the right way so as to make them phonetically indistinguishable. This is exactly what happened in the present case; bear, bear, and bare began as three distinct lexical items that gradually converged on a single pronunciation.

Proto-Indoeuropean

The association between bears and honey doesn’t start or end with Winnie the Pooh; this is a link that spans back perhaps as far as the first Proto-Indoeuropean (PIE) tribes, who lived somewhere in the area north of the Black Sea which is made up of present-day Romania, Ukraine, Kazakhstan, and Moldova c. 4000 BC. They, like the majority of prehistoric civilizations, were illiterate, so there is no record of their language or how it was used in day-to-day interactions. However, being the ancestor of nearly every language currently spoken in continental Europe, historical linguists who work on this period have reconstructed a good deal of the syntax, morphology, and phonology of PIE by comparing its descendants and applying their general linguistic knowledge to the findings, essentially working backwards to create a model of what PIE might have sounded like. These scholars have spent thousands of hours studying the earliest records of Germanic, Slavic, Baltic, Celtic, Semitic, and other languages that evolved from PIE, comparing individual words and phrases crosslinguistically and hypothesizing various explanations as to how they may have developed. Thus, while we have no primary sources detailing the structure of PIE, we can get a pretty solid idea based on the detective work of historical linguists.

Circling back to bears, the exact etymology of the word remains a mystery, although its relation to bees and honey appears rather well-established in the literature. A number of languages descended from PIE used roots for bee and/or honey in their term for bear, e.g. honey-lover (Old Irish), honey-eater (Common Slavic and possibly Germanic), honey-pig (Middle Welsh), bee-wolf (Old English i.e. Beowulf), and bee-bear (Lithuanian). This common thread has led linguists to conclude that the origin PIE word for bear likely also made semantic reference to bees/honey, although the precise reconstruction varies slightly from scholar to scholar (one possibility is hr̥dko). As for the verb bear (i.e. to bear children), its origin has been reconstructed as PIE bher-, meaning approximately the same thing as its modern English equivalent. This is the same root as the word barrow (e.g. a wheelbarrow) as well as a number of other related words in modern IE languages.

Old and Middle English

Moving beyond PIE, the word bear (the animal) first appears concretely in Old English, which was spoken by the Anglo-Saxon settlers of what is now Great Britain beginning in 500 AD until about 1100 AD. It was pronounced /berɑ/, while the verb “to bear [a load/a child]” was /berɑn/. The word bare also arrives on the scene at the same time, also from a PIE root meaning “uncovered” or “naked.” However, at this point it was pronounced /bær/, with the short A sound as in apple, as opposed to the long A sound as in hair. So in Old English, the three words were not homophones.

In the transition from Old to Middle English (c. 1100–1500), several changes occurred that brought bear, bear, and bare closer together, but they were still not identical. First, the /æ/ and /ɑ/ vowels merged into a single sound, /a/, changing /bær/ to /bar/, and /berɑn/ to /beran/. I’m not sure whether /berɑ/ also underwent this change, because due to another sound change that targeted unstressed vowels during this period, the vowel at the end of /berɑ/ was reduced to a schwa and then eventually deleted altogether – so whether it was first changed to /a/ is moot. Another process known as open-syllable lengthening lowered the main /e/ vowel and then lengthened it to give us /bɛːr(ə)/. The same process happened with the /e/ in /beran/, giving /bɛːran/.

On top of all these sound changes, English also began to lose its rich inflectional system on verbs; if you speak another Indoeuropean language, particularly one of the Romance languages, you may be familiar with the concept of verbal inflections, which change the word depending on who is performing the action. PIE and many of its early descendants had an even richer system of inflection for both nouns and verbs. However, by the time we reach the end of the Middle English period, thanks to extended contact between Old English and Old Norse, English was rapidly losing its inflectional system for nouns and replacing it with a stricter word order to indicate each word’s role in the sentence. This was the beginning of the end of English inflections in general, which would extend to verbs as well by the time we get to Modern English.

Modern English

Now we are at the end of the Middle English period, and the three terms are still pronounced distinctly: /bɛːrɑn/ for to bear, /bɛːr/ for bear, and /baːr/ for bare. The next major event in the phonological development of English brings them still one step closer: the Great Vowel Shift (c. 1400–1700), one of the most well-known cases of vocalic chain shift, affected every long vowel in the language at the time and is the root of the modern bear/bear/bare merger. During the course of the GVS, the /ɛː/ vowel raised back to /eː/ for the noun and verb bear. Thus we get /beːr/ bear and /beːran/ to bear. The GVS also caused the /aː/ vowel to raise to /æː/ and ultimately /ɛː/ by the beginning of the 18th century, giving us first /bæːr/ and then /bɛːr/ for bare. Now the merger is nearly complete.

How do we account for /beːran/? As I mentioned, English gradually lost its inflectional system as it evolved through the years, including verbal inflections (except for the 3rd person singular -s ending, which we still use today). Early Modern English (c. 1600) maintained a handful of distinctions, such as the -st ending for 2nd person indicative (e.g. Dost thou love me?) and a few other inflections for person/number/tense. At some point, then, speakers must have converged on a single form of the verb to bear, which of course ultimately ended up as /bɛːr/. I would hazard a guess that one of two things happened: i) only the stem of the verb (without the -ɑn ending denoting it as an infinitive) was preserved and all other endings except -s were eliminated; ii) the first person singular conjugation /beːre/ prevailed and the -e was promptly deleted due to the same rule that deleted it from /bɛːre/ in Late Middle English. I’m not sure which, if either, is the proper explanation, but both seem reasonable to me – if I come across later information I’ll update the post.

In summary, many of the homophonic relationships seen in English can be attributed to centuries of sound change and phonological mergers, which in some instances result in a loss of distinction between certain words. Of course, 99% of the time in Modern English, you can tell which form of /bɛːr/ someone means based on context, so when this does happen, the consequences are null.

WordOld EnglishMiddle EnglishLate Middle EnglishEarly Modern EnglishModern (American) English
bear (v)berɑnbɛːrɑn bɛːrɑnbeːr(e?)bɛːr
bear (n)berɑbɛːre (later /bɛːr(ə)/)bɛːrbeːrbɛːr
bare bærbaːrbaːrbæːr bɛːr
The convergence of the /bɛr/s

Bonus Info

Some dialects may have raised the vowel in each of the instances of /bɛːr/ to an /eː/, so your results may vary slightly if you are a speaker of another English dialect.

On the other hand, in some varieties, an additional sound change occurred which actually disambiguates the pronunciation of bear/bear from bare. This is because the bear and bear vowel is actually not representative of the GVS’s effect on most words which started off with /ɛː/ in Middle English; most often, this vowel actually became /iː/ in Modern English, as is the case with words like meat and feet (pronounced like mate and fate prior to the GVS). However, in RP and GA dialects, this for some reason didn’t happen for certain words (often those ending in -r), including bear. Therefore, for dialects like Scots English, bear and bear remain homophones, but they are distinguished from bare by their vowel: the bears would be pronounced /biːr/, while bare is still /beːr/.

Like I said, I’m not a historical linguist, but I would be a terrible daughter if I didn’t at least attempt to address my father’s favorite linguistic ambiguity during this time of crisis. Hope you enjoyed this foray into the history of English.

Plosive Epenthesis in Some Pronunciations of ‘Downstairs’

epenthesis (noun)

The insertion of a letter or a sound in the body of a word (such as \ə\ in \ˈa-thə-lēt\ athlete)

-“epenthesis.” Merriam-Webster, 2020.

Today we’re going to be comparing five sample pronunciations of the word downstairs. Having all this quality time to spend with my roommates as of late, the other day I noticed that, when I say “downstairs,” it sounds like I’m adding the tiniest little /t/ in between the /n/ of down and the first /s/ of stairs. At least one of them, on the other hand, does not. I was intrigued, so I decided to examine the actual phonetic differences to see what’s going on there, and whether there is actually some epenthesis going on with me and possibly others when we say this word. Now you, dear readers, get to examine the results.

I recorded myself, my partner, and our three roommates (N = 5; 2 male) saying “I’m putting it downstairs,” then imported the recordings into Praat and spliced the last word out of each file to be analyzed independently. While I’m sure there’s plenty to unpack in other segments of downstairs, this is a blog post, not a thesis, so what I’m interested in is solely the transition between the /n/ and the /s/, and whether my feeling that I epenthesize a stop (aka plosive) has any basis in the acoustic facts.

Below you’ll find some spectrograms, which are visual representations of human speech, for each of us, whose phonemes I have labeled very approximately along their corresponding representations in the spectrogram (middle) and waveform (top).

Now it’s time for a brief phonetics lesson: If you look at the general shape of the waveform (the top bit of the image), you’ll notice it changes in accordance with the particular phoneme being produced. Different consonants and vowels produce a number of distinguishing acoustic features, but two very commonly-cited ones are Formant values and amplitude. Formants are found in the spectrogram (middle bit), and are divided into four or five “bands” of frequencies (measured in Hz). The lowermost red dotted line in the spectrogram below indicates Formant 1 (F1) , while the top row of dots is F5 (but we don’t really care about that one – only F1 through F4). Changes in formant values correspond to changes in the frequency of the sound being measured, which is one of the ways our brains distinguish different phonemes from one another. On the other hand, amplitude is measured in the waveform, and it basically measures how loud the sound is. More sonorous sounds like vowels tend to have a larger amplitude, while the least sonorous sounds (stops) have a very small or no amplitude.

Looking at the part of the spectrogram and waveform where we transition from the /n/ to the /s/, I found that there is indeed some epenthesis happening in most of our productions – I’ll explain why in a moment. The interesting thing is that it’s almost like our alternations for this word exist on a continuum – L and J sit at opposite ends of the spectrum with regard to the strength of their inserted stops, while everyone else falls somewhere in between: J has basically no epenthesis while L’s is the most salient.

Moving on to what exactly gives me this impression: Stops, such as /p, t, k/, share the feature of completely cutting off the flow of the air from the vocal apparatus (thus the name “stop”). As I mentioned, this means that they have a small amplitude on the waveform. On the spectrogram, we should expect to see a portion of silence, where there is no air being emitted, followed by a “burst” of energy at the point where you transition from a stop to the next phoneme in the word. If you look at the middle section of the image below (the part with the red dots), this characteristic translates to a wide band of white (meaning minimal output to the target Formants) followed by a very thin band of black, which indicates the release of the stop. As you can see, there is a very obvious band of black right at the onset of the /s/ for both me (second image) and my roommate L (first):

…however, this is not to say that other three don’t show any of the features of stops. In fact, as you can see, we all produce some burst of energy in transitioning between the two consonants; the difference is that there is no complete occlusion in our case (meaning the air has not been entirely cut off), while there is in L’s. The variation is most obvious when you compare the two extremes of the spectrum; J’s spectrogram (left below) transitions gradually from the /n/ to the /s/, while I (right below) have an intermediate stage after the /n/ where the spectrogram turns mostly white and is then interrupted by a thin black band before the onset of the /s/.

Everybody else falls somewhere along this continuum. I produce almost as clear of a pre-/s/ burst as L does, but mine maintains some output at F2, whereas a full stop would have none. R and W both have faint black bands prior to /s/, but are still outputting sound at F1, F2 and F3 (meaning there is sound occurring at those particular frequencies – not a complete closure). In sum, the acoustic data supports my hypothesis that I – as well as others – epenthesize a stop between the two syllables of downstairs.

What is perhaps more interesting, though, is why we would do such a thing. There is no phonological reason, so it must be phonetic. I looked at a handful of abstracts on glottal stop epenthesis cross-linguistically and found that it seems to be a strategy for avoiding two highly sonorous segments back-to-back. Sonority is basically “vowel-ness” and both nasals (like /n/) and fricatives (like /s/) are among the more sonorous consonants. There is a theory called the Sonority Sequencing Generalization (SSG) which says that syllables like to be more sonorous in the middle (i.e. the vowel) and least sonorous on the edges (beginning and end phonemes).

Honestly, though, I have no idea what’s going on aside from possible avoidance of excess sonority. I also considered that it has to do with the positions and shapes that the tongue has to move in to form each sound: /n/ is produced with the tip of your tongue fully touching the roof of your mouth, while /s/ is produced with the body slightly concave to allow air to escape from the center. I suspect, though, that it has something to do with the transition from a nasal consonant to an oral consonant. Nasals are actually also a type of stop, but produced by routing air through the nasal passage instead of the mouth, so they aren’t affected by the complete closure seen in orally-articulated stops like /p t k/. It also happens that the /n/ sound is articulated using the same tongue position and shape as the /t/ and /d/ sounds, just behind the teeth at the alveolar ridge. If the air is re-routed from the nasal passage to the oral cavity before the tongue has moved into the necessary position for /s/, then the result would sound something like a /t/ or /d/. Basically, the epenthesis I feel in myself could be the result of not having synchronized the transition for all the necessary components of my vocal apparatus.

Thanks for tuning in to this week’s episode of Unnecessary Linguistic Explorations. I’m sure there will be plenty more in the weeks to come. Stay home, stay safe, and stay curious. Peace out.

J’s (smooth and gradual) n-to-s transition
R’s n-to-s transition, with a minor burst of energy at the onset of /s/
W’s n-to-s transition, with a clearer burst of energy but continued output at F1-F3
My n-to-s transition, which still has output to F2 prior to the /s/ onset
L’s n-to-s transition, which has minimal output at F1-F4 immediately preceding the black band

Ventriloquism for dummies

It’s that time of week again: I’ve spent the last five days only conducting Serious Research™and now it’s time for me to put those skills to another use. Last night, just as I was drifting off, my best friend sent me a very simple question:

…so naturally I didn’t go to sleep until after midnight. Here I present to you the fruits of my research into the science of ventriloquism.

In case you’ve never had the privilege to attend a performance, ventriloquism refers to the act of speaking without moving one’s mouth (or moving it very minimally) to give the appearance that the sound originates from a different source, often a puppet (a ‘dummy’) or some other inanimate object. The word originates from the Latin words venter ‘belly’ and loqui ‘to speak’, although it has departed rather significantly from its original use, which described the so-called ‘internal speech’ of those possessed by demons and/or the spirits of the dead, which apparently lived in the stomach according to the ancient Greeks. Its modern meaning emerged sometime after the 17th century.

The key component of ventriloquism – producing speech without any visible indication – however, has been used for centuries for various purposes. It has been used by prophets of various religions, as well as by individuals who fancied themselves sorcerers, exorcists, con artists, and the like. In other words, ventriloquism is nothing new. Scientific explanation for the phenomenon, on the other hand, is much more recent.

Last night’s pre-bed research led me to this lovely thread from the Linguist List circa 2002, in which linguists from around the world discuss the phonetic aspects of ventriloquism. Unsurprisingly, one of the most oft-addressed questions is how these performers are able to produce labial sounds – those produced with the lips, such as /b/, /p/, and /m/ – without arousing suspicion. The standard recommendation in the ventriloquism textbooks is to replace them with another sound with the same manner of articulation and voicing, i.e. /d/ for /b/, /t/ for /p/, and /n/ for /m/. This is actually a technique not unique to ventriloquism; foreign language learners often substitute phonemes from their native language into the target language using similar criteria, choosing a sound as many features identical to the target sound as possible. This is why, for example, some German learners of English substitute /z/ for /th/ in words like ‘the’: the voiced interdental fricative /th/ doesn’t exist in their language, but they do have another voiced fricative that is produced in a similar part of the mouth. Thus, it is not surprising that ventriloquists also adopt such a technique to avoid producing labial consonants but still let the audience members perceive them as such. How, then, do they trick the audience into hearing /p/ instead of /t/?

This is an opportune time to introduce something called the McGurk effect, which is very clearly illustrated in videos like this one. If you don’t feel like watching, the TL;DR is as follows: our brains rely on input from a combination of mediums in order to interpret speech sounds, including visual information. Researchers have done experiments where they had participants identify the consonant at the beginning of a syllable spoken by a person in a video saying, for example, “ba ba ba…”. At first, the participants identified the initial sound as /b/. However, when the researchers showed them a new video with the same audio overlaid onto the mouth movements associated with a labiodental sound, i.e. /f/ or /v/, their answers changed – now they reported, quite confidently, that they were hearing /f/. The audio had not changed, only the video.

Another speech perception phenomenon, the Ganong Effect, shows that we are also biased towards interpreting auditory input as a real word, given sufficient evidence. This is a little more complicated to explain, but here goes nothing: in English, one of the acoustic cues by which we differentiate voiced and voiceless stops is something called voice onset time (VOT), which is essentially the amount of time it takes for the vocal cords to begin vibrating after the stop has been released (i.e. your lips open). Voiced stops have shorter VOT than voiceless stops. There is a very particular threshold of VOT at which English speakers stop perceiving a /b/ and start perceiving a /p/, namely 25ms. If the VOT is shorter than 25ms, they will perceive a /b/ sound. If it is longer than 25ms, they will perceive a /p/. However, if that phoneme is put into the context of a word – for example, ‘beef’ – where the alternative is not a word in English (so, ‘peef’), they essentially increase their tolerance for variation in the VOT in favor of whichever interpretation gives a real word. So even if ‘beef’ has a VOT of, say, 27ms, they still perceive the first sound as a /b/ because that is what results in an actual word.

Thanks to the Ganong and McGurk effects, we know that humans are considerably biased towards perceiving actual words in our language, and that our sensory systems can be easily tricked. Both of these facts lend themselves quite readily to a ventriloquist’s task, allowing them to use the ‘T for P, D for B’ trick because their audience is already biased towards hearing real words like ‘pretty’ as opposed to non-words like ‘tretty’, and they are not receiving any contradictory visual cues from either the dummy or the ventriloquist’s mouth.

Other phonetic ‘tripping hazards’ in ventriloquism may include production of rounded vowels and interdentals. Rounded vowels, such as /o/ and /u/, require you to round your lips. Luckily for dummies in the Anglosphere, English doesn’t have the unrounded counterparts of /o/ and /u/, so producing these forms would in all likelihood be sufficient to mask the lack of rounding, so long as the other feature specifications (backness and openness) were met. Dentals, namely the /th/, /f/, and /v/ sounds, can be dealt with according to the same rules as stops: simply replace them with the nearest phoneme that has the same manner of articulation and voicing, in this case /s/ and /z/. Depending on your anatomy, though, /f/ and /v/ may not pose much of a problem for ventriloquizing, since they can be more or less effectively produced without much visible movement of the lips.

Most linguistic research that I could find in ventriloquism has, unsurprisingly, dealt with the phonemic substitutions mentioned above, or with the perceptual effects that make it possible for a skilled performer to appear as though they are not moving their lips. I would also venture to attribute at least some of the effectiveness of ventriloquism to the audience’s suspension of disbelief – unlike in times gone by, nobody nowadays goes to a ventriloquist’s show expecting to see actual witchcraft, yet they enjoy themselves nonetheless. The desire to be entertained and amazed can far outweigh even the most basic scientific principles.

A Phonology of Simlish

This may in fact be the most pointless research I’ve ever conducted, but sometimes you just have to carpe diem and do pointless things instead of working on your real research.

If you have never and will never play The Sims (or you have no interest whatsoever in phonology), then this is not the post for you. That said, almost everybody I know has at least dabbled in the game once in their lives, and have at least commented in passing on the features of the Sims’ language. A very brief overview, just in case you forgot: in The Sims, you are essentially God (or your omnipotent deity of choice), creating and controlling the lives of little virtual people who live in an idealized version of suburban America. Some people play exclusively to build houses, others to live out their wildest dreams, and still others to release their inner psychopath. Regardless of the reason you play, it is more than likely that in your time as a Simmer, you’ve encountered the fictional language used in the game, known as Simlish.

Although it doesn’t actually have a complete syntax or lexicon, Simlish is still an impressive feat and an ode to the game’s creator’s attention to detail. According to a recently-published article on the evolution of Simlish, it was originally devised by the voice actors hired to provide the male and female vocalizations for the original edition of The Sims, released in 2000. The creator, Will Wright, had originally envisioned a ‘language’ that was a hodgepodge of features from languages such as Navajo, Ukrainian, and Latin; eventually, this idea was scrapped in favor of the actors’ ad-libbing using gibberish syllables that (mostly) conformed to English phonetics and phonotactics.

Simlish is not intended to be a ‘learnable’ language; its purpose is to make the game accessible for players around the world, regardless of their mother tongue. It allows players to superimpose their own details onto their Sims’ conversations and outbursts, resulting in a much more customizable and imaginative storyline than if the characters spoke a real language. However, Simlish does appear to follow a number of rules as far as its pronunciation is concerned, and that is what I’ve spent the better part of my evening looking at in detail.

The data for this post is taken from the following sources:

Phonemic Inventories

BilabialLabiodentalAlveolarPalatalVelarGlottal
Plosivep bt d k g
Nasal m n ŋ
Fricativef vs zʃh
Laterall
Approximantr
Glides w j
Affricatet͡ɕtʃ dʒ
The consonant inventory of Simlish. In keeping with IPA tradition, voiceless phonemes are on the left and voiced on the right. All nasals are voiced.
FrontCentralBack
Closei
ɪ
u
Close-Mideǝ o
Open-Midæʌ ɔ
Openɑ
The vowel inventory of Simlish. Vowels to the left are unrounded.

If you’re at all familiar with the phonemes of English, you’ll probably notice several similarities. Unsurprisingly, considering both the original voice actors were American, the phonology of Simlish bares a striking resemblance to Mainstream American English. It is worth noting, however, that the Simlish in the original Sims game also sounds significantly different than later iterations – for example, in the first minute of this video, you can hear a flap/trill, a /ʒ/, a /ɣ/, and a /t/ that more closely resembles the Spanish /t/ than its English counterpart. Sometimes I suspect a bit of retroflex on some of the plosives, but it’s hard to say for sure. Regardless, this is an artificial example of the influence that language contact can have on sound shifts over time: in the past 20 years, Simlish phonology has taken on more English-like features as a result of extended contact with the English-speaking world, especially since most of its ‘users’ (i.e. actors and writers) are native English speakers.

Next, we’ll break down some of the phonotactics (= rules governing how individual phonemes are combined) of Simlish.

Phonotactics

  • The most complex syllable is (C)(C)V(C)(C)
  • Permissible simple onsets: /p/, /b/, /t/, /d/, /k/, /g/, /m/, /n/, /f/, /v/, /j/, /w/, /r/, /l/, /s/, /z/, /ʃ/, /t͡ɕ/, /dʒ/, /tʃ/
  • Permissible simple codas: /l/, /m/, /r/, /g/, /f/, /b/, /t/, /ʃ/, /ŋ/, /k/, /r/, /s/
  • Permissible complex onsets: /bl/, /pl/, /sp/, /kl/, /gl/, /mj/, /sk/, /fl/, /fw/, /tw/, /bw/, /mw/,
  • Permissible complex codas: /mg/, /nd/, /ps/, /bz/, /kt/, /lt/
  • Allows syllabic /r/ e.g. [grb]
  • Primary stress tends to fall on the last syllable
  • Some of the canonical vowels are diphthongized similar to MAE: /e/ becomes [eɪ], and /o/ becomes /oʊ/ in an open syllable
  • Other diphthongs include /aɪ/ (as in hi) and /au/ (how)
  • /v/ sometimes seems to be in free variation with /b/, e.g. /’bʌdiʃ/ thank you can also be pronounced /’vʌdiʃ/
  • Word-initial voiceless plosives (/p/, /t/, and /k/) are aspirated
  • Word-final nasal + consonant clusters may delete the second consonant
  • Light and dark /l/ are in complimentary distribution with one another, like in MAE
  • Intervocalic /t/ and /d/ may become flaps, like in MAE

Comparison to English

Although the influence of American English on Simlish phonology is extremely evident, there are some minor differences between the two. First, Simlish lacks dental phonemes, particularly /θ/ and /δ/ (sounds at the beginning of thin and the, respectively) which are prevalent in many dialects of English. Simlish is also more conservative when it comes to syllable structure: English permits up to three sequential consonants in the onset of a word (e.g. strong) and four or five in the coda (e.g. sixths /siksθs/ or angsts /æŋksts/, depending on dialect); on the other hand, Simlish allows a maximum of two consonants in each position. There are some other, more minor differences as well regarding the distribution of phonemes in words/syllables. Simlish allows many, but not all, of the same phonemes in a simple coda as English: the /d/, /l/, and /z/ phonemes are unattested in the data I looked at, although /l/ and /t/ appear together in a complex coda.

Many English speakers have compared Simlish to ‘baby talk’ and, after examining the phonology of the language, the reason becomes clear: Simlish allows plenty of consonant + /w/ clusters that are typically associated with children’s early attempts to produce words containing a consonant + /l/ or /r/ (e.g. /bwu/ for blue), and don’t exist in adult English. It also seems to contain a greater number of words that begin with glides (don’t quote me on that – I haven’t actually run a formal analysis), which may also influence English-speaking listeners to perceive it as infantile.

Conclusion

There is an obvious influence of English phonology and syllable structure on the phonology of Simlish. The Simlish phonemic inventory consists almost exclusively of English vowels and consonants, and many of its rules governing stress assignment and allophonic variation are borrowed directly from English. However, there are some non-trivial differences between the languages: Simlish syllables are maximally CCVCC, whereas English allows up to three consonants in the onset and four/five consonants in the coda of a single syllable. The types of consonant clusters also differ, with Simlish allowing several more C + glide combinations as well as the /mg/ cluster in final position. These rather salient features of Simlish phonology may be to blame for English speakers’ assessment of the language as ‘baby talk.’

Of course, the above assessment has been made on the basis of limited acoustic data and without the input of native Simlish speakers (mostly because they don’t exist). And again, seeing as the language lacks a concrete syntax or morphology, it is rather difficult to postulate underlying forms for the surface forms of words and phrases presented in The Sims – rather, I have operated on the assumption that Simlish is a fully faithful language when it comes to mapping underlying forms onto surface forms. Future researchers of the language would do well to analyze sample utterances using a parsing software such as Praat, or otherwise contact potential informants for deeper insights as to its structure.

And with that, I’m off to finish some reading for my dissertation. Dag dag!

From Mouth to Pen

Orthography (n.) – “1. a) the art of writing words with the proper letters according to standard usage; […] (b) the representation of the sounds of a language by written or printed symbols 2. a part of language study that deals with letters and spelling”

In the last couple weeks, I’ve gotten really into a podcast called History of English which, well, is exactly what it sounds like: a history of the English language. I’m fifteen episodes in and we haven’t even reached A.D./C.E. yet, so to say it goes in depth might be an understatement — but it’s super informative and absolutely fascinating all the same. This, combined with having revived my middle school efforts at learning Korean, has naturally prompted me to think some more about the different ways that cultures represent speech orthographically, and how much of a marvel the invention of writing truly is. So today, I’m going to share some semi-organized thoughts on writing systems and their relation to the language they represent.

First, let me emphasize the fact that linguistics in general does not normally deal with written language – writing is, after all, a social construct used to represent spoken language, and actually hasn’t existed all that long in the grand scheme of things. While there are certainly linguists who study orthography both independently and in relation to spoken language, when we typically speak of linguistics in general, the intended referent is spoken language rather than written. I just happen to be casually interested in orthography – possibly due to my upbringing, possibly because it’s just objectively a cool thing to study.

Most English speakers think of “writing” as fundamentally dependent on the phonology of a language – that the written symbols should somehow relate to the way the word itself is pronounced. English and most Indo-European writing systems are phonemic: they use a set of symbols (26, in the case of English) to represent individual phonemes, and combine these symbols to represent a sequence of phonemes, i.e. a word. If you have only studied Spanish or German, it is easy to see why one might infer that all orthographies follow this pattern – it is efficient, after all. So it’s not all that surprising that when I tell people I speak Mandarin, more often than not, they’ll ask me to explain how the alphabet works, to which I must dutifully explain that Chinese is a character-based language and doesn’t actually have an alphabet. At this point their eyes typically glaze over and they start to regret asking to begin with, probably because the concept of “characters” isn’t as self-evident to the average outsider as it may seem to users of such languages. Perhaps even more shocking is the fact that characters are not the only alternative to an alphabetic system. Syllabic orthographies, such as Cherokee and Japanese, break their words down into constituent syllables, rather than individual phonemes (so cookie would be represented using two symbols: one for coo– and one for -kie). Chinese is kinda-sorta syllabic, but mostly logographic – individual characters represent words, but these words also happen to be monosyllabic. There are also orthographies that highlight other salient features of a language’s phonology, such as syllable boundaries, while excluding other features deemed less important, such as vowels. It all seems very complicated, but don’t worry friends, we’re going to work through it together, starting with perhaps the oldest form of writing on Earth: Chinese characters.

Mandarin is my second language and one of my majors in undergrad, but I started learning it in China when I was eleven, so I wouldn’t say my acquisition experience has been representative of the norm. When my partner started learning it at uni, it made me start to think about the language from the perspective of an adult learner, especially the writing system. Explaining the makeup of characters proved to be exceptionally difficult — how does one actually define a character? They’re not just tiny pictures, although a handful of modern characters may have evolved from early drawings of the item they represent. Even these have been abstracted far beyond recognition to the uninitiated observer; nobody would look at the contemporary 马 and peg it as horse, although it very likely did look like an actual drawing of a horse at one point thousands of years ago. Chinese characters are so old, in fact, that nobody really knows exactly when they were created, or when exactly they crossed the threshold from “symbol” into “writing”; the earliest archaeological evidence from 6500 B.C. remains controversial. Over the past six thousand years or so, the written language has been systematized and reduced to a certain subset of permissible strokes and patterns; however, it still bears no relation to the spoken language. In a way, this is useful because it means that Chinese scholars can still decode texts written thousands of years ago, but it also means that it is harder to know exactly how the spoken language sounded at any given time, since the characters remained the same regardless of how you said the word they represent (if you’re wondering how the phonologies of prior iterations of Chinese were reconstructed, look up the Qieyun and Chinese rhyming dictionaries). Not to mention, it requires a lot of memorization in order to fluently read and write – but that’s exactly how it works, much to the horror of English speakers the world over. Learning to read and write Chinese is simply a matter of memorization and learning the rules which govern character composition.

So how do you actually read Chinese? As I’ve said, the Chinese writing system has no relation whatsoever to the way the words are actually pronounced, barring the approximations you can sometimes derive from 形声字 (lit. ‘description sound characters’). Each individual character corresponds to a single syllable, which is also a morpheme (= the smallest unit of a language that conveys a meaning). For example, 人 rén is ‘person,’ and 女 nü is ‘female,’ so 女人 is a woman. Neither of these characters have any relation to the way that the word is pronounced, but they do carry their meaning into other contexts: 妈 mā means ‘mom’ and, if you’ll notice, contains the character meaning ‘female.’ So, although characters do not relate to the phonetic content of the words they represent, they often relate in some way to the semantic (= meaning) content. Sometimes, however, the relation between a character and its meaning is more opaque, often due to sociocultural developments and natural changes to the way a particular word is used. For example 男 nán, meaning ‘man’, can be broken down into 田 tian3 ‘field’ and 力 lì ‘strength’ – referring to men working the fields.

Suffice it to say, the Chinese language, both spoken and written, has been around for a hot second, and in that time, it has influenced a myriad of linguistic systems in other cultures. Korean, for one, has taken a heap of words from Chinese due to an extensive history of contact between the two cultures; as I recently found out, there is a whole class of verbs in Korean which are borrowed from Chinese, and pronounced in a remarkably similar manner. Having also taken a class on historical Chinese phonology, I find myself noticing more and more features of Korean pronunciation that existed at some point in the extensive history of the Chinese language, but have been subjected to considerable sound change over the years. For example one of the words for ‘food’ is 食 shí (pronounced like ‘sure’ without the /r/) in Mandarin, 식 sik in Korean, and sik in Cantonese; in Middle Chinese, the word has been reconstructed as something like /ʂik/ or /ʂiŋ/, depending on the particular reconstruction you’re using. Both /ŋ/ and /k/ were permitted final sounds in Middle Chinese, but at some point in the course of its evolution into Standard Mandarin, the /k/ sound (along with other voiceless stops like /p/ and /t/) was omitted from the syllable-final position and replaced with one of several possible endings, depending on the main vowel and other phonological features of the word. On the other hand, it was preserved in Cantonese (and several other dialects) and Korean. These types of similarities are what linguists look for when determining whether a relationship exists between two or more languages; Korean and Cantonese syllable structures mirror each other in a number of ways, but it is Middle Chinese phonology that allows us to generate a full explanation for these similarities. Most linguists now believe that Sino-Korean (the language spoken on the Korean peninsula prior to the emergence of Early Modern Korean) was in near-constant contact with both Old and Middle Chinese, so it makes sense that, for a long time, Korean used Chinese characters as its primary writing syste.m

Hanzi ‘Han characters’ in Chinese, known as Hanja in Korean, were used for thousands of years on the Korean peninsula but, seeing as they lack any inherent relation to pronunciation, were used to represent spoken Korean instead of spoken Chinese, and would also be organized differently according to Korean grammar. The word Hanja is actually another shining example of phonological change: the syllable zi in Modern Mandarin is pronounced /tɕa/ in Korean, again likely stemming from a common pronunciation in Middle Chinese. You see the same zi -> ja transformation in other Korean words like 남자 namja ‘man’ (from Chinese 男子 nanzi) and 여자 yeoja ‘woman’ (from Chinese 女子 nü zi). In the 15th century A.D. the Korean emperor Sejong the Great personally created the Hangul writing system which is used today, and completely unique to the logographic system of Hanja. Although to the untrained eye, Hangul writing may look similar to characters, it is actually a quite brilliantly-constructed alphabetic system that assigns each phoneme of the Korean language to a specific symbol (like English), allowing Koreans to represent every possible word in the language given a comparatively small inventory of written correspondents. As I mentioned, the logographic orthography of Chinese takes far longer for the average learner to master than, say, the Latinate alphabet because in a logographic system, one must assign every possible morpheme (and/or syllable) to a unique symbol. This means that, while a Korean or English speaker only needs to know 20-some unique letters in order to read a newspaper in their language, a Mandarin speaker must know approximately 10,000 different characters in order to read the same text. With this fact in mind, it is no wonder that many civilizations across history (including the Greeks, Romans, and Eutruskans) abandoned their own syllabic writing systems once the notion of an alphabet was introduced – it is far more efficient, and adapts much more readily to natural language change.

The Korean orthography possesses a bonus feature that Latinate alphabets lack: it automatically marks syllable boundaries. Here’s how it works. First of all, Korean phonology allows far less variation in its syllable structure than English or German; the maximal syllable is (C)V(C)(C), where C is a consonant and V is a vowel, and segments in parentheses are optional. So, the writing system represents words as both phonemes and syllables, each syllable contained within the boundaries of a square, which is then divided into quadrants, one for each constituent phoneme. The syllable 읽 for example, is made up of the individual letters ㅇ, ㅣ, ㄹ, and ㄱ, read from left to right, top to bottom. The next syllable is written in a new square: 읽다. You may notice that 다 is only two letters, so the constituent letters “stretch out” to fill the excess quadrants. Spaces are used to separate words, so at first glance, you can see both how many words are in a given utterance and how many phonemes are in each syllable. Pretty cool, in my admittedly biased opinion.

As I mentioned, in the podcast I’ve been listening to, the host just gave a brief history of the alphabet used by the vast majority of Indo-European languages today. Interestingly, he asserts that nearly all modern alphabets are descended from a common ancestor: the Phoenician writing system, which spread from the Phoenician civilization to the ancient Greeks in the North and the Hebrews in the Southeast, which would eventually evolve into the alphabets used by all of the Indo-European and Semitic languages, respectively. An intriguing difference exists between these two systems, which continued to branch off as they were adapted to represent the unique sounds of each new language: only the Indo-European alphabets had vowels. This is because the Phoenicians omitted vowels from their writing system — the consonants were sufficient to indicate which word was intended. This was also the case with many of the Semitic languages, many of which had a comparatively sparse vowel inventory and a high number of consonants, so they didn’t feel the need to add additional symbols for vowel sounds. On the other hand, the Greeks and later adopters of the Phoenician alphabet had quite a rich vowel inventory, and therefore needed their writing to reflect vowel sounds as well. Of course, English and the other modern Indo-European languages have both vowels and consonants in their writing systems, while many modern Semitic languages like Arabic and Hebrew only use consonants, leaving the vowel sounds to be inferred.

It is truly incredible, when you think about it, how much the invention of writing has impacted the human race and its linguistic development. Nowadays, literacy is a given in most first-world countries, regardless of whether that refers to an alphabetic, syllabic, or logographic writing system. Much of twenty-first century communication is fundamentally dependent on written language, and has contributed to the rapid spread of information and Internet-specific ways of expressing oneself. Plenty of civilizations existed for thousands of years without written language, but now it is almost unfathomable that a society could exist without centering itself around written language. It is remarkable to think about how few cultures independently developed their own systems of writing, yet this invention was so groundbreaking that it managed to spread to all corners of the globe. On a more somber note, it is also important to remember that the contemporary prevalence of the Latin alphabet is attributable to colonization and direct efforts to impose European culture on the oppressed nations. In many cases, colonized peoples’ lack of a written language was cited as evidence that their societies were uncivilized, and therefore deserving of European intervention. In other countries, indigenous scripts were replaced by the Latin alphabet in order to make things easier on the colonizers. Although academically intriguing, the widespread use of the Latin alphabet today also symbolizes the loss of linguistic and cultural autonomy in many parts of the world.

This post links to external sites which may offer further explanation or clarification of the topic at hand. These are links that I personally found helpful at the time of writing, and are certainly not exhaustive of the linguistics/language-learning resources available online. Unless otherwise noted, I am not affiliated with any of the linked authors and am not responsible for the content of their blogs/websites.

The Mystery of the Caterkillar

Anyone who’s spoken at length to a small child knows that they tend to make a lot of mistakes — not just grammatically, but in the pronunciation of individual words and phrases. Caregivers the world over are surprisingly good at inferring what exactly their infant means when they say “buh” and point at the fridge, even though if an adult speaker were to do the same, they might be less forgiving. Despite the general responsiveness of their immediate family, of course, children don’t remain in this phase forever; eventually, almost overnight, they develop into articulate speakers of their native language. The path that children take to acquiring not just the sounds of their language, but the rules for how those sounds combine, is paved with difficulties and occasional confusion on the part of the adults around them, yet in the vast majority of cases, they somehow come out of it just as competent in their language as all the generations before. It’s pretty fascinating, if I do say so myself.

I still remember taking my siblings-in-law to the park one summer when we were plagued with these disgusting, tiny caterpillars that gave a nasty rash to anyone they touched. Being the responsible caregiver I am, as we arrived, I reminded the kids not to mess with any caterpillars they encountered on the play structure, and off they went to frolic in the summer sun. Not five minutes later, the two-year-old was shrieking to me with a mixture of horror and elation that there were, in fact, “caterkillars” crawling up and down one of the slides. As I marched over with a wood chip in hand, ready to scrape the buggers off and save the day, caterkillars echoed in my mind — not only was it adorable, it was linguistically intriguing.

At the time I was a baby linguist, and a barely-viable fetus when it came to language acquisition — I was bordering on clueless when it came to any sort of theory-based explanation. What I did know, however, was that a) certain sounds are more easily acquired than others; b) small children tend to flub a few particular consonants; and c) this particular instance was a shining example of what linguists call consonant harmony — the phenomenon by which one consonant assimilates to the articulatory features of an adjacent one. Consonant Harmony (CH) is common-ish in the world’s languages, but not nearly to the extent that it is in child language. Back when caterkillar first made its debut, though, I was only familiar with CH in name, so I chalked it up to “one of those acquisition things” and moved on.

Now that I’m more of a teenage linguist, I have taken some time to delve into the literature to try and figure out exactly why kids do this. I’ve been mulling over caterkillar for more than four years, which is far too long to have not written a single thing about it, so without further ado, here is my attempt at explaining the caterkillar phenomenon.

First, let’s start with a few foundational concepts in phonology. It is generally assumed that, somewhere deep in the recesses of our brains, there is a catalogue of rules that govern the way that sounds in language interact with one another. The exact way that they interact is a topic of contentious debate in the world of phonology. At its most basic level, there are two camps: generativists (also called Chomkyans, or those that subscribe to Noam Chomsky’s theory of generative grammar) and… everyone else. For our purposes, the main difference in the way that these two sets of researchers approach phonology is that generativists believe that we as humans possess specific, innate knowledge that allows us to acquire and manipulate language in a way that other species can’t – this is called Universal Grammar, or UG. They focus on analyzing data from a variety of spoken languages in order to derive the underlying rules that govern not just English, but all the world’s languages. On the other hand, “everyone else” asserts that humans’ unique ability to produce language and encode phonological rules can be attributed to more general cognitive processes such as statistical learning, theory of mind, etc. They emphasize the study of language in its natural, spoken form, which can be readily observed without resorting to abstract theories of how it is mentally represented.

Regardless of which perspective you prefer, it is evident that there are rules governing the way sounds of our language combine: a very simple example is the plural -s in English, which children learn fairly early and can apply to novel words by generalizing the rule to all nouns to indicate plurality. Phonetically, however, this suffix varies depending on the phonology of the root word that it is attached to – it is subject to voicing assimilation, which means that it adopts the voicing feature of the preceding segment. In practical terms, this just means that the -s in the word cats is different from the -s in the word dogs. Why? The last sound in the word cat is a /t/, which is a voiceless consonant (your vocal cords do not vibrate when producing it – try saying /t/ while touching your throat and you’ll notice it doesn’t vibrate), while the last sound in dog is a /g/, a voiced consonant (now, your vocal cords vibrate). So, when -s is added to dogs, it assimilates to the voicing feature of the /g/ and becomes its voiced counterpart, which actually is a /z/ sound. The phonetic transcription of dogs, then, is something along the lines of [dagz]. This process is an example of partial consonant harmony in adult language, and actually occurs in many languages other than English, as well. Notice, however, that it is limited to consonants that are directly next to one another.

Caterkillar is an intriguing instance of “progressive CH,” which just means that the harmonization is moving from left to right across the word’s segments. Generative phonologists, who subscribe to the view that the rules of language-specific sound patterns are all just different “orderings” of the same underlying rules (AKA Optimality Theory), would explain CH as a symptom of inappropriate rule-ordering. Some of the most well-known generativists who’ve studied child CH have concluded that part of our linguistic capacity as humans obliges us to assimilate adjacent consonants to the same place (or manner) of articulation. Basically, there is some pre-programmed rule (often called “AGREE”) that drives kids to make adjacent consonants more similar to one another. The generativist argument is that the phonological systems of children who harmonize consonants where adults do not differ from adult speakers in two important ways:

1. They view vowels and consonants as completely different entities that do not interact at all and are subject to different rules/processes (= “planar segregation”)

2. They have ranked one universal rule, AGREE, above another rule which blocks CH in many environments — whereas adult English speakers rank them in the opposite way.

This explains why child consonant harmony seems to ignore intervening vowels (which CH in adult language, like we saw above, doesn’t do – it only applies to consonants that are directly next to each other), and why it appears at all in acquisition. If we apply this analysis to caterkillar, it seems that my sibling-in-law is ignoring the intervening vowels because the highly-ranked rule AGREE does not apply to them – only to consonants.

Non-generativists have proposed that CH is little more than an articulatory error — a slip of the tongue. Various authors have argued that an underdeveloped articulatory system is to blame, and that children simply do not have the years of practice coordinating their vocal apparatus that adults do. Hansson (2001), for example, notes the similarities between child CH and speech errors in adults, noting that adults slip up far less frequently. This doesn’t seem to be the case for caterkillar, seeing as it’s consistently produced this way, whereas a speech error would be more of a one-off. Similarly, connectionist theories argue that the issue lies in the connection between the child’s mental representation of the word and the physical production – perhaps some features get “lost” along the way, causing them to fill in features from a segment that didn’t lose any. So, somewhere along the way from the child’s mental database to their mouth, the labial feature of the /p/ sound got lost, so the equivalent place feature was filled in from the closest available consonant, which in this case was a /k/.

Last but not least, in one of the largest studies to date, Marilyn Vihman analyzed the CH patterns for 13 children acquiring six different languages. In her dissertation, she examined the apparent triggers of CH as well as how consonants were assimilated. She looked at each child individually as well as trends across the group, and found that CH was used the following contexts:

  • When a child could not yet pronounce a certain sound, such as /r/, it was replaced with a contextually-appropriate segment that was easier to produce
  • When attempting to produce a newly-acquired segment in the same word as an older, similar segment (such as the /k/ and /t/ sounds in cat, if the child is just learning the /k/ sound), children would assimilate the newer consonant to the older one
  • When dealing with multi-syllabic words
  • When dealing with alveolar and palatoalveolar sounds (those that are produced at the ‘ridge’ just behind your teeth, such as /d/, /z/, and /sh/)

Given the above findings, what is the best explanation for caterkillar? It’s not a gap in productive abilities; I’ve heard the same kid produce /p/ in a huge array of phonological environments. /p/ is also not an alveolar sound, so that rules out that explanation. In all likelihood, if we assume Vihman’s findings are representative of typical child CH patterns, it must be the length of the word itself that causes the error. Admittedly, caterpillar is a hefty word – four syllables and eight constituent phonemes – so it’s no wonder that my younger SIL would have trouble with it at two years old.

However, we haven’t yet discussed a crucial aspect of the word and its production; there is one glaring detail based on which the entire argument may need to be altered. Can you guess what it is?

That’s right – the /t/. So far, we’ve been discussing the progressive assimilation of the /k/ to the /p/, completely ignoring another consonant in the middle. If planar segregation explains why vowels aren’t assimilated, what’s to be said for the /t/?

Lucky for us, the child in question still produces caterkillar consistently, so I called them up to confirm my suspicions and, sure enough, found that the intervening /t/ has been deleted, giving us [kæ’ə’kɪ’lə]. Phoneme deletion is far from unusual in child speech, and is often used to simplify consonant clusters or reduce the syllable count of a complex word – more evidence for Vihman’s analysis of complex word simplification. This particular /t/ is situated between two vowels and heads a non-stressed syllable, making it a prime candidate for deletion.

I wish I could say with 100% certainty that this little exploration has put my mind at ease regarding the origins of caterkillar, but honestly, this whole process has just raised more questions for me. It’s impossible to decide which explanation best fits the given data without, well, more data – and unfortunately I’m some 3,000 miles away from the originator of caterkillar, at the moment. Maybe one day, we’ll have an answer… and maybe I’ll be the one to find it!

Below are the main articles I consulted for this post:

Generativist AuthorsFunctionalist Authors
Goad, Heather. “Consonant Harmony in Child Language: An Optimality-Theoretic Account.” Focus on Phonological Acquisition, by S.J. Hannahs and M. Young-Scholten, John Benjamins, 1997, pp. 113–42, https://roa.rutgers.edu/files/213-0897/roa-213-goad-2.pdf.
Gerlach, Sharon Ruth. The Acquisition of Consonant Feature Sequences:Harmony, Metathesis and Deletion Patternsin Phonological Development. University of Minnesota, Dec. 2010.
Pater, Joe, and Adam Werle. Direction of Assimilation in Child Consonant Harmony. p. 24.
Pater, Joe. “Minimal Violation and Phonological Development.” Language Acquisition, vol. 6, no. 3, July 1997, pp. 201–53. DOI.org (Crossref), doi:10.1207/s15327817la0603_2.
Gormley, Andrea Lucienne. “The Production of Consonant Harmony in Child Speech.” T. N.p., 2003. Web. 20 Nov. 2019. https://open.library.ubc.ca/collections/ubctheses/831/items/1.0090893. Retrospective Theses and Dissertations, 1919-2007.
Hansson, Gunnar Ólafur. “On the Evolution of Consonant Harmony: The Case of Secondary Articulation Agreement.” Phonology, vol. 24, no. 1, May 2007, pp. 77–120. DOI.org (Crossref), doi:10.1017/S0952675707001121.
McAllister Byun, Tara, and Sharon Inkelas. Child Consonant Harmony and Phonologization of Performance Errors. 2012, https://linguistics.berkeley.edu/~inkelas/Papers/NELS43McAllisterByun-Inkelas.pdf.
Vihman, Marilyn May. Consonant Harmony: Its Scope and Function in Child Language. Jan. 1978.