Code-Switching in APAC Customer Conversations: An NLP Problem Nobody's Solved Cleanly

Linguistic diagram showing code-switching between English and Bahasa Indonesia in customer messages

The research is clear, even if the industry hasn't absorbed it. Code-switching — the practice of alternating between two or more languages within a single conversation or even a single sentence — is not a linguistic error or a sign of incomplete language acquisition. It's a stable, rule-governed communication practice used by bilingual and multilingual speakers worldwide, and it is the dominant pattern in informal digital communication across Southeast Asia. Any NLP system that treats it as noise or tries to force-normalize it into a single language before processing is discarding semantically significant information and introducing systematic errors into its predictions.

What Code-Switching Actually Looks Like in Support Chats

Consider a real example from a Level3 AI deployment at a Malaysian telco. A customer's message reads: "Hi, boleh tolong check my account ke? My internet plan dah expire but I tak dapat the notification." This single sentence mixes formal Malay (boleh tolong, dah expire), informal Malay discourse markers (ke, tak), and English nouns and verbs (check, account, internet plan, notification) in a pattern that is entirely natural and immediately intelligible to any Malaysian reader, but that poses significant challenges for language identification and intent classification systems.

A language identification API will typically return ambiguous results or incorrectly classify the entire message as English or Malay, depending on which words dominate numerically. An intent classifier trained on monolingual data will attempt to parse this mixed-language text with its monolingual vocabulary, missing the pragmatic content of the Malay particles and potentially misidentifying the core intent (plan expiry inquiry plus complaint about missed notification) as a generic account inquiry.

The Three Distinct Code-Switching Patterns in APAC

Code-switching in APAC customer support contexts follows three primary patterns that require different NLP handling approaches. The first is intersentential switching — language alternation between full sentences or clauses. "My package still haven't arrive. Dah 5 hari ni." This pattern is the easiest to handle because sentence-level language detection can identify each segment separately before intent classification applies.

The second pattern is intrasentential switching — language mixing within a single clause. "I nak cancel my subscription tapi the button tak muncul." This requires token-level language identification before any semantic analysis, which most production NLP pipelines don't implement because the compute overhead is significant and the training data for token-level bilingual language identification in APAC language pairs is limited.

The third pattern is pragmatic or discourse switching, where the language choice carries emotional or relational meaning rather than purely propositional content. A Singaporean customer might switch from English to Mandarin mid-conversation specifically because they're frustrated — code-switching into their first language in a high-stress moment is a documented sociolinguistic behavior. An NLP system that doesn't flag this switch as a sentiment signal is missing information that a skilled human agent would use to adjust their response tone.

Why Standard Language Detection Fails

The standard approach to language detection in NLP pipelines — libraries like langdetect, fastText language identification, or the language detection built into major cloud AI platforms — was designed to classify entire documents or paragraphs, not mixed-language customer support messages of 10-30 words. These systems work by comparing character and word n-gram distributions against trained language profiles. When a 15-word message contains 7 Malay words, 6 English words, and 2 morphologically mixed compounds, the n-gram distribution is ambiguous by design — the message has no dominant language at the document level.

FastText's lid.176 model, which is commonly used for production language identification, achieves 87% accuracy on standard Wikipedia article excerpts across 176 languages. On code-switched customer support messages from APAC language pairs, its accuracy on correctly identifying the languages present in a mixed message drops to approximately 34% based on our internal testing. This is not a criticism of fastText — the model wasn't designed for this task. The problem is that many production AI support systems are using it as if it were.

The NUS Code-Switched Corpus and Why It Matters

The National University of Singapore has published several code-switched corpora for Singlish, including the National Speech Corpus and various academic datasets from the NUS Department of Linguistics. These corpora, combined with publicly available Taglish data from Philippine social media research and Manglish samples from Malaysian NLP studies, form the academic foundation for building code-switching-aware NLP models for APAC language pairs. The problem is that academic corpora are not customer support corpora — the register, vocabulary, and intent categories are entirely different.

Level3 AI's approach has been to build proprietary training datasets for each language pair by collecting and labeling code-switched customer support conversations from enterprise deployments, with customer consent. Over 14 enterprise deployments, we've accumulated labeled examples of code-switched Singlish-English, Bahasa Indonesia-English, Tagalog-English, Manglish-English, and Thai-English support conversations. The models trained on these domain-specific code-switched datasets significantly outperform models trained on academic code-switching corpora for the specific intent categories that appear in customer support contexts.

The Tokenization Problem in Thai and Vietnamese

Thai presents a structurally different challenge from code-switching: it uses no spaces between words. Thai sentence segmentation requires word boundary detection before any intent classification can occur, and word boundary detection in Thai is context-dependent — the same character sequence can represent different word boundaries depending on surrounding context. The standard approach uses dictionary lookup combined with statistical models, but dictionary-based approaches fail on product names, brand names, and technical terminology that isn't in the dictionary.

When Thai is mixed with English (as it commonly is in customer support chats where product names are typically English), the tokenization boundary between Thai text and English text creates additional segmentation ambiguity. A product name like "TrueMove H" embedded in a Thai sentence has no spaces separating it from the Thai words on either side in informal typing. Tokenizers trained primarily on formal Thai text will frequently mis-segment these boundaries, producing tokens that are fragments of the brand name or mis-merged sequences of Thai characters and English letters.

How Level3 AI's Pipeline Addresses These Issues

Our multilingual processing pipeline uses a three-stage architecture specifically designed for APAC code-switched content. Stage one is script-aware tokenization, which identifies character scripts within the message (Latin, Thai, CJK, Devanagari) and applies script-appropriate tokenization rules per segment before any language identification step. This approach handles Thai word boundaries without requiring language identification of the full message first, and correctly segmentizes mixed-script messages without forcing a single tokenization strategy across different script blocks.

Stage two is token-level language tagging using a sequence labeling model trained specifically on APAC code-switched data, not adapted from document-level language identification. Each token receives a language tag that is used in the intent classification stage. Stage three is intent classification using a model that was trained on multilingual labeled data with language-mix information as an additional input feature — not a monolingual model applied to translated or normalized input.

Where the Field Is Heading

The large language models (GPT-4, Claude 3, Gemini 1.5 Pro) have significantly better multilingual capabilities than their predecessors, and they handle many code-switched inputs more gracefully than earlier systems when prompted appropriately. The issue for customer support applications isn't understanding — it's accuracy at the specific intent classification task with the calibration required for automated action. A model that correctly understands a message 90% of the time but is overconfident in its predictions 100% of the time will issue automated responses and actions in the 10% of cases where it's wrong, with no reliability signal to trigger human review.

The research problem that the field hasn't solved cleanly is calibration of code-switched intent classifiers — specifically, producing reliable confidence scores for code-switched inputs that trigger appropriate escalation when confidence is genuinely low. This is distinct from accuracy: a model can be highly accurate on average but produce high-confidence wrong predictions on specific code-switching patterns it hasn't seen before. For production AI support systems handling millions of conversations, calibration matters as much as accuracy, and the calibration problem is harder in code-switched contexts than in monolingual ones.