LINGUISTIC OBFUSCATION TECHNIQUES




By


Matthew A. Mairs






A DISSERTATION




Submitted to

The University of Liverpool




in partial fulfillment of the requirements

for the degree of



MASTER OF SCIENCE









ABSTRACT


LINGUISTIC OBFUSCATION TECHNIQUES

By


Matthew A. Mairs



Linguistic Obfuscation (LO) has been, is and will continue to be an effective method to hide, disguise and/or authenticate communication. This dissertation surveys and evaluates past and current examples, then demonstrates several methods by which LO can augment or replace algorithmic and mathematical cryptology. By implementing LO of this sort users can not only produce encoded texts which are meaningful to themselves and the parties with whom they exchange them which also offer enhanced resistance to automated data-mining and human interceptors lacking the requisite linguistic knowledge, they can additionally produce communications which, when encrypted, present greater difficulties for statistical analysis and which offer increased resistance to brute-force cryptanalysis.

This paper uses both qualitative and quantitative methods. It documents some of the history of LO, its psycho-linguistic effects and the cognitive foundations of its effectiveness, and the environments in which it has been, is and may be useful. Criticism of successful and unsuccessful methods and their psycholinguistic and cultural foundations, insofar as those are known, or may be veridically conjectured, is also integral to this work.

It also investigates the possibility that cognitive biases result in an overconfidence in mathematical and algorithmic encryption methods. Due to the structure and strictures of human cognition, every human enterprise is colored by perception and expectations (Reimer, 2000, p72). There is no reason to believe that cryptology is immune to this effect. Technological progress occurs at an accelerating pace that almost certainly outstrips our naturally evolved capacity to apprehend it (Smith, 2011). The history of computing's and cryptology's interdependence is already exhaustively documented. How quickly they are both now changing may not be so well; such documentation, creating it or truly understanding it, may no longer be entirely humanly possible (Kupiec et al, 1995).

In addition, several prototypes have been created to demonstrate possible applications of LO. Proofs are offered to support the assertion that the texts produced by these programs have different entropy than the plaintexts from which they are transmogrified yet still linguistically resemble, that they are less likely to match any dictionary apt to be used in brute-force attacks and/or that they are apparently innocuous while containing messages that may not be.


DECLARATION

I hereby certify that this dissertation constitutes my own product, that where the language of others is set forth, quotation marks so indicate, and that appropriate credit is given where I have used the language, ideas, expressions, or writings of another.

I declare that the dissertation describes original work that has not previously been presented for the award of any other degree of any institution.


Chapter 1. Introduction 1

Scope 1

Problem Statement 1

Approach 3

Outcome 5

Chapter 2. History of Linguistic Obfuscation 7

Prehistory 7

Evolution of LO 7

Written Examples 9

The Voynich Manuscript 9

The Rohonc Codex 11

Others 12

Spoken Examples 16

Native American code-talkers 17

Code-switching 18

Chapter 3. Psycho-linguistic Bases of Linguistic Obfuscation 20

Causality 20

Biases 21

Conditioning 22

Emotion 24

Perception of Writing 25

Why Written LO 25

Chapter 4. LO Based Solutions 27

Overview 27

Transliterator 27

Scrambler 29

Substituter 30

Chapter 5. Methods and Realization 31

Initial Conception 31

Concept Refinement 31

Additional Developments 31

Chapter 6. Results and Evaluation 33

Transliterator 33

Scrambler 33

Substituter 33

Overall 33

Evaluation of Adversaries 34

Attacks on LO 35

Chapter 7. Conclusions 36

Lessons Learned 36

Future Activity 37

Prospects for Further Work 37






  1. Introduction

Scope

This dissertation surveys, criticizes and proposes historical, current and new means of Linguistic Obfuscation (LO). Humans' need to conceal, disguise and authenticate information will continue so long as we persist in having adversarial relationships with one another (Anderson, 2009, Chapter 1). It is possible that this is inherent in our psychological makeup due to the forces of natural selection (Tooby & Cosmides, 2001), it is certainly inevitable given our current social, political and economic systems (Bourdieu & Wacquant, 2000). Insofar as these systems are the result of biological, psycholinguistic and cultural evolution they were and are inevitable and inescapable (Tooby & Cosmides, 2005). Very little appreciable effort is being expended in their reform due to the inherent conflict between having the ability to foster change and lacking the motivation to do so (Lammers et al, 2010). Therefore it is worthwhile to explore and perhaps leverage any and all means by which we can protect the privacy, integrity and identifiability of our communications.

This paper reviews the evolution of linguistic information concealment, disguise and authentication from its hypothesized prehistory and known roots through its continued use alongside mathematical and algorithmic developments. It asserts and supports that there are now technologies available and more being developed which directly impugn the effectiveness of mathematical and algorithmic cryptology as currently practiced. It therefore proposes a reevaluation of linguistic methods to identify when they, or a hybrid approach, may be useful.

Problem Statement

At the same time, cryptology as practiced today may often be both overkill from a usability standpoint (Anderson, 2009, Chapter 2) and inadequate from the perspective of absolutely provable protection (Fisher, 2012). It isn't generally very easy to set-up, use or maintain correctly for many people (Rescorla, 2012), and the ciphertexts is produces are usually useless as communications until deciphered. At the same time much of cryptography relies on “Non-Polynomial Hard” problems which have not been exhaustively proven to be NP Hard (Lund et al, 2007). Proving NP hardness is a non-trivial problem in itself (Ibid.); to a certain degree the easier a problem is to describe, the easier it is to prove the polynomiality of the complexity of its solution or the lack thereof (Aloupis et al, 2012). Many of the problems used in cryptography are simple enough to describe in outline, but many of the possible permutations and details of solution algorithms may, as of yet, remain undiscovered and/or unpublicized. Insofar as people have come to rely on the prime-factorization of large numbers in encryption they are relying on an unproven “NP-hard” problem. As mathematical solutions to it have evolved from brute-force through Eratosthenes sieve and the Kraitchik family algorithm to Pollard's sieve they have continued to provide ever more efficient methods to work the problem. And Shor's algorithm promises to make it a Polynomial-time problem when a computer exists that can implement it (Shor, 1999).

LO offers a means by which communications may be made relatively more secure than plaintext while still being readily intelligible by the proper parties. And it guarantees that, in combination with mathematical encryption, should that encryption be broken the text may still be unrecognizable, or at least not easily comprehensible by other parties. This fact can also impede brute-force cryptanalysis: since the “plain-texts” generated by LO won't generally match dictionaries or other standard corpi; even if they do that match may not reveal the true meaning, ergo brute-force may yield “correct” solutions that will not be recognized as such or which are actually not correct.

Even as difficult to use perfectly as many mathematical algorithmically based encryption and authentication technologies are, they still cannot always absolutely guarantee protection and authenticity (Kleinjung et al, 2010). Cryptanalysis improves at a rate that threatens standardly accepted ciphertexts with solution regularly. Cryptology is as prone to the Red Queen effect as any other competitive enterprise (Diffie & Helman, 1977). While it does generally appear that cryptography manages to stay a little bit ahead of cryptanalysis for the most part, this is an endless race, and it is quite possible that some of the technologies and processes in the latter are not public knowledge (Bamford, 2012). If Moore's law continues to be accurate it assures a continuous processing power improvement, which, in combination with technologies like parallel and cloud computing affords problem-solving capacity that is improving at, conservatively, an exponential rate. If it doesn't, this will likely be because of a paradigm shift associated with an explosion of computing power, quite likely from the harnessing of sub-atomic quantum phenomena (LaFlamme, 2004). Assertions like “for a 400 digit product ... the computer would have to run for 10176 times life of the universe” (Davis, 2003) are simply not meaningful without specifying an additional time horizon for their validity. But that time estimate must necessarily be imprecise. Future events are simply not, for the most part, predictable with any appreciable accuracy. Quantum computing is making notable progress (Mearian, 2011) but the true nascense of it's usability is as of yet unknown, and while some of its first applications will be cryptological (Bernstein, 2009) the advent of their true utility can't be firmly predicted. But the paradigm shifts this will induce, if and when it does occur are almost certain to obsolete many, if not most, of our current encryption algorithms (Dagdalen, 2010).

Approach

This paper begins by surveying the history of LO, looking for the most effective methods implemented so far and critically appraising the possible reasons for their success. It focuses on written LO, for reasons to be explained in Chapter 3, and on the English language, for reasons to be explained in Chapter 4. It then explores the intrinsic and extrinsic ramifications of cognitive bias to LO and cryptology. It concludes by demonstrating some ways in which LO can be applied and evaluating their efficacy.

One application of LO this paper then proposes is graphemic (as much as practicable) transliteration. The most basic protection this method offers is that numerous alphabets are available and it may be possible to choose one that is not well known generally but familiar to the communicating parties. Such communications will therefore be incomprehensible to a majority of people while remaining meaningful to those who know the alphabet(s) and the transliterated language. The greater resistance to statistical analysis this affords after additional encryption due to the lack of one to one mappings between alphabets and to brute-force attacks due to lack of comprehensive and standardized transliterated examples can be proven.

Another application to be explored is intra-word letter scrambling. This affords resistance to statistical analysis by enhancing the entropy of a text (Shannon, 1950) and to brute force analysis because the 'plain-text' itself is not truly recognizable as such by most automated methods. It remains, however, comprehensible to human beings with minimal effort (Rayner et al, 2009).

A substitution not unlike transliteration, operating at the word rather than the letter level, will also be demonstrated and evaluated. It's resistance (or susceptibility) to statistical analysis at the word level should be perfectly standard; however “solutions” at this level are highly unlikely to reveal the message. Statistical analysis at the sentence and higher levels will also still appear within normal bounds. However, again the messages revealed will not necessarily be the messages obfuscated.

Since it advocates leveraging linguistic techniques this paper must involve some linguistic terminology including: phonemes ("... abstract units of the phonetic system of a language that correspond to a set of similar speech sounds ...”, (Webster.1, 2012)), morpheme (“a distinctive collocation of phonemes … having no smaller meaningful parts”, (Websters.2, 2012), grapheme (“a set of units of a writing system … that represent a phoneme” (Webster.3, 2012), glyph (“a symbolic figure or character” (Webster.4, 2012) phonotactics (“the analysis and description of the permitted sound sequences of a language” (Webster.5, 2012)), semiotics (the “theory of signs and symbols that deals especially with their function in both artificially constructed and natural languages (Webster.6, 2012)), transliteration (“to represent or spell in the letters of a different alphabet” (Webster.7, 2012)), alphabet (“a set of letters or other characters with which one or more languages are written” (Webster.8, 2012)), abjad (“... writing systems having symbols for consonants only” (OED, 2012)), syllabary (“... set of written characters each one of which is used to represent a syllable” (Webster.9)) and logograph (“a letter, symbol, or sign used to represent an entire word (Webster.10)). They are explained in this introduction before proposing leveraging linguistics in a field currently dominated by algorithms and mathematics, as such a preliminary definition of terms may be in order in the event that they are unfamiliar to some readers.

In order to make meaningful assertions and propose language-based solutions in an efficient manner some linguistic terminology, or jargon, must be used. Phonemes, so that we can address spoken obfuscation and the interior phonemic representation that literate people generate when reading; morphemes so that we can group this audial and/or internal representation into units that convey meaning in some language; graphemes so that we can refer to these units in their written forms; glyphs so that we can differentiate between characters with known graphemic correspondences, whether letter, syllabary, abjad or logograph, and those without; phonotactics so that we can deal with such phenomena as 'accents' and also with rare and/or difficult to reproduce phonemes and morphemes; semiotics so that we can apply all of this terminology to language, what it symbolizes and its obfuscation, transliteration because it is common in LO and one of the proposed solutions leverages it, and alphabets, abjads, syllabaries and logographs so we can discuss various of the graphemic representational systems available.

Outcome

Numbers are semiotic, much the same as words. They stand for ordinalities and cardinalities as words stand for objects, actions and concepts. But since the nearly universal acceptance of the Arabic numeral system they have generally been single characters in a base ten system. They do not represent portions of sound, as do the letters of an alphabet, symbols in a syllabary or abjad. They are certainly not strictly phonemic and arguably not even morphemic; they might in fact properly be analyzed as logographs. However, since the advent of digital computing technologies, letters are very commonly represented as strings of numbers. In most digital computers these are binary numbers. Ideas are generally framed in sentences which are composed of words which may be broken down further into letters in languages which use alphabetic systems, and these may then be broken down even further into strings of ones and zeroes. Today Unicode is a fairly mature way to achieve this mapping from binary into almost any alphabetical and semiotic numerical system. The proposed examples leverage that encoding. Alternatives like ASCII, EBCDIC and the ISO family of standards are mentioned in passing.

Surveying the history of this evolution allows us to compare and contrast the successes and failures of representational systems and the ways in which they have been, and are, algorithmically and/or linguistically manipulated to obscure messages. Quantum and massively parallel computing change how effective algorithmic manipulations are (Bernstein, 2009), and LO is an option for how they might be implemented to be more potent. Because this race of purely mathematical algorithms may not be leveraging the natural strengths of human intelligence; we are generally pretty good at language, computers, as of yet, are not. Without a fuller understanding of cognition and fuzzy algorithms it is unlikely they will be very soon (Wang & Hao, 2007). So, given technology's continued progress and the probability that massively parallel and quantum computers are obsoleting or have already obsoleted numerous cryptological algorithms (Tosi et al, 2011) it may well be worth reexamining LO. Since computers are not yet highly adept at natural language processing (Bulshakov & Gelbukh, 2004, P5) and humans are (Chomsky, 1957) it is likely beneficial to analyze successful examples so as to determine why they were or are successful and to derive from this those aspects that may enhance, or in some cases supplant, algorithmic numerological cryptology. Although quantum computing may offer some promise in the area of Artificial Intelligence and Natural Language Processing as well (Elitzur et al, 2009) (Vitello, 2001), and classical computers are achieving greater fuzziness (Branavan, 2011) as well. Still, this dissertation focuses on attempts to find solutions to the privacy and authenticity problems currently solved most commonly by algorithmic numerical manipulations by leveraging linguistic manipulations.

  1. History of Linguistic Obfuscation

Prehistory

It is almost certain that the history of Linguistic Obfuscation (LO) is as old as humans' use of language. As other animals have been observed to engage in dishonesty and misdirection, it seems safe to assert that our history of prevarication extends further than that of our use of language (Smith, 2005). Therefore it is likewise highly probable that as soon as humans learned to speak we learned to lie. It can even be argued that speech developed at least partially as a means to augment dishonesty (Ibid.). Interpersonal communication is very, if not most, often a means to convince or coerce (Priest, 2004). To maximize the effectiveness of these uses some prevarication is frequently beneficial. From “politically correct” speech to unproven (or unprovable, or provably wrong) aphorisms, language, culture and social relationships are rife with LO. To abstractly represent reality words are to some degree required (Clark, 1993); to truly misrepresent it they are supremely powerful. In the absence of telepathy, language is our only means of communication across time and distances greater than the reach of our immediate senses. Asynchronous communication generally requires some form of recorded language. Transmission of communications beyond the range of our sight, hearing et al likewise nearly always requires a means to transmit language. Language can be used for deception, it can be used for authentication and at the same time it is integral to cognition itself (Clark, 2000).

Evolution of LO

It is also quite likely that when humanity became numerate LO rapidly evolved to include numeric obfuscations, encodings and manipulations. However, since most human beings are more readily capable of linguistic manipulation than mathematical, the latter has generally required some external tools, from a code stick to a supercomputer. As technology has improved numeric obfuscations have evolved into the science known as cryptology. This science is a major force driving the progress of computing; in fact there are numerous examples of how it could be credited for its genesis and development.

LO occurs primarily in two forms, as does much language: spoken and written. Language very often occurs in the context of interaction between people. There are many social cues and environments that contextualize and circumscribe it. Written language encompasses media from public organs like newspapers and websites through semi-private like email and postal service to 'truly' private like that protected by strong encryption or delivered under tamper-proof seals. We can expect to find written language altered both by the social context in which it is used and the perceived level of privacy (Klaehn, 2002). Spoken language is similar. Public speech is often intentionally obfuscated to some degree, whether it is because an individual doesn't want to communicate too bluntly (or accurately), because press organs don't want to cause civil unrest or for a plethora of other reasons (Antilla, 2008) (Bourdieu, 2000) (USCC, 2010). We might reasonably expect to find semi-public speech somewhat less concealing. And if people believe they are communicating in perfect privacy we might expect almost perfect honesty and accuracy, in so far as they are capable. Of course, the very make-up of human psychology and language make that unlikely, as very few communicants can or want to perfectly transmit their internal cognitive content to others, nor is any person fully cognizant of all aspects of any situation, or even their own involvement in it. Furthermore, truly private verbal communications are nearly impossible. Sound itself is a biological noesis of a manifestation of energy transfer through gas and that energy is rather difficult to contain. While numerous technologies (NSA, 2007) (Van Eck, 1996) demonstrate the vulnerability of electro-magnetic devices to snooping, audio eavesdropping is a much older and therefore even more mature application that now involves numerous highly advanced technologies (Google, 2012). In fact it has become so advanced that it may be used as another attack vector against computer security (Yang, 2005).

Written Examples

Circa 52 BC Julius Caesar wrote a message to Cicero in Latin using the Greek alphabet (Caesar, ~58). Presumably this was done in the hope that, were the communication intercepted, its contents would be unintelligible. Whether that was a reasonable expectation given the rather widespread use of the Greek alphabet is likely irrelevant. There is no record of its being looked at by anyone but Cicero. However, its very existence supports two points of this paper: that people believe transliteration heightens security and that belief strongly influences behavior.

Today written LO encompasses propaganda, misinformation, a growing corpus of transliterations akin to Caesar's but of considerably greater complexity, various steganographies, watermarking (Topkara et al, 2005), letter substitution/intentional misspelling and, arguably, source-code obfuscation and botnet command and control.

The Voynich Manuscript

The Voynich Manuscript is quite probably an excellent example of strong written LO. It is also possible that it is a hoax, and there is actually no meaningful information contained in it. However, if evaluations like Agata & Agata (2009) and Reddy & Knight (2011) are correct then statistical analysis does indicate a near certitude that the text actually does represent some kind of a cipher of actual information. The odds of a writer of the time of its creation having the requisite linguo-statistical knowledge to have created such patterns as it evinces is low. That it remains without a generally accepted solution hundreds of years after its creation is a testament to the effectiveness of some forms of LO.

Most scholars agree it was probably written between the 15th and 16th centuries CE in Europe, judging from such forensic evidence as the parchment on which it was written and the cultural milieu the content of its illustrations and their style appear to reflect. It is a hand-written document and the characters used bear little resemblance to any known alphabetic, abjad or logographic system. Because of these facts it's actually a little hard to say how many unique characters are truly represented.

It consists of 240 pages, 225 of which contain textual information (Ibid.). Currier's proposal that there are approximately 37 unique glyphs (Currier, 1976) is broadly recognized, and has allowed the generation of machine readable transliterations for cryptological analysis. However, neither letter or word level statistical analysis, nor brute force dictionary comparisons have yielded any generally accepted plaintext.

Nonetheless, it is actually possible that the manuscript has been deciphered. Stojko's assertion that it maps cleanly to Ukranian without vowels is fairly well-supported by his “translations” (Stojko, 2011). This also points to some of the strengths and weaknesses in LO. If the encoding is strong enough it may be impossible to prove that any particular decipherment is correct. Knight evaluates numerous “translations”, none of which are entirely convincing (Knight, 2009). While this might be considered a strength it is directly related to the liability that without some idea what the plain-text pertains to, any interpretation may be just about as good as any other. This can render the 'solved' document as good as meaningless (Anderson, 2008). However, if the recipient of a message does understand the context and can assume some of the probable content this is a great strength of LO indeed.

The Voynich Manuscript is so effective first, because its glyphs are unknown. Some bear a strong resemblance to several well known alphabets and abjads, but none are quite perfect and they certainly don't match any particular system in any appreciable number. Because it is hand-written it's even a little difficult to say which are perfectly unique. This makes mapping for statistical and brute-force analyses inexact. In numerous alphabets very small variations in characters make for completely different letters. How and when, exactly, this occurs in the Voynich is very difficult to say.

Secondly, the illustrations in the Manuscript are not very useful clues; they serve as distractions at least as much as “cribs”. They don't match known plants, cosmology or behaviors with any exactitude. The illustrations are beautiful and fanciful by and large, but they are themselves obfuscations if they bear any relationship to known reality at all. They may well be viewed simply as a distraction; however they likely do bear some relationship to the text and therefore most probably will have to be accounted for in any definitive solution.

Lastly, the provenance of the Voynich Manuscript is unknown. It's pedigree is fairly well documented, after it arrived in the possession of Baresch. But its creator and reason for creation remain a mystery; without these facts authoritatively deciphering it continues to be an unsurmounted challenge.

The Rohonc Codex

The Rohonc Codex is probably another good example of written obfuscation. Again, the characters in which it is written are not readily recognizable. They may be an alphabet, a short-hand, a code or meaningless. Probably written around the 16th or 17th century CE, if remains undeciphered and, relative to the Voynich, not well examined. Lang's analysis is probably the most thorough to date, but he does not describe anyone developing a concordance even as adequate as Stojko's for the Voynich (Lang, 2010) . In fact, most of the work on it that he does report is more descriptive than analytic, or incoherent and unrealistic (Ibid.).

Despite its illustrations bearing a rather clear resemblance to known biblical stories, not even these “cribs” have led to even a partial understanding. If it is a hoax, or written in a synthetic language, then there isn't much point in trying to decode it. However, Lang has demonstrated some patterns in its content, relations to the illustrations and of text recurrence, that show it is quite likely worthy of further study; decoding of some sort is likely possible. That little of such has occurred demonstrates how strong a deterrent LO is for messages of little perceived value.

However, his assertion that the obfuscated language is probably European because the illustrations appear to be is not entirely supportable. International trade routes far pre-date the earliest date that he supports for its creation. Lang establishes fairly convincingly that its earliest possible date of creation would have been 1538. However, that doesn't seem to make it entirely safe to so easily discount the chance that the encoded text is in at root an eastern language (Encyclopedia Britannica, 2012), or even, remotely possibly, American or African. If such a language has been transliterated or encoded here it has contributed to the production of an as of yet completely impenetrable cipher. Again, this demonstrates the considerable power of LO, assuming if this is in fact a valid example. And again it demonstrates a liability of it: that without knowledge of context or key material, the message can very easily be lost. Algorithmic/numerological cryptography might be said to share this weakness; however, at this time it does seem that the march of cryptanalysis is very nearly apace, and that all existing algorithmically and/or mathematically based ciphertexts will, in principle at least, be cracked.

The Rohonc remains unsolved partially because, like the Voynich, it's alphabetic encoding system is unknown. Unlike the Voynich, however, its illustrations may actually be useful “cribs”. It is likely that if and when it receives sufficient attention it can be solved. Therefore it also remains unsolved because few professional code-breakers have spent much time on it, probably not least because it has no specifically known value (Schmen, 2012).

Others

The Copiale Cypher is another written example that had not been well studied. It has now, however, most likely been deciphered (Knight et al, 2011). It is dated 1866 and consists of about 90 characters on 105 pages (Ibid.). That it has yielded to statistical analysis proves that this method can be effective against LO. Knight, Meyesi and Schaefer began with a table of character occurrences, similar to what has been tried against the Voynich. They then performed statistical analysis of letter frequency. After theorizing that all non-Roman characters were meaningless and discarding them, they found this analysis to match natural language character distributions (Ibid.). However, when they brute-forced this text for both language identification and decipherment they found no particular resemblance. Thereupon they reintroduced the non-Roman characters and the brute-force revealed a slight statistical resemblance to German. From that point on they had an idea of the underlying language, and focusing their efforts on its phonemic rules yielded a solution. The Copiale may have been easier to solve than the preceding examples because of its neat handwriting, more recent creation, and/or because it is based on a well-known language.

Rongorongo is an as of yet largely undeciphered script found primarily on Easter Island. Probably graphemically and phonemically representative of the Rapanui language at the time it was written, it hasn't yielded a definitive, comprehensive translation as of yet (Rjabchikov, 1998). There are numerous examples of it, but the current pidgin state of Rapanui and the failure to have produced a concordance in the face of continuing loss of information about Rongorongo's mapping to it virtually assure that it will never be completely or accurately deciphered. At the same time, the facts that it maps to a somewhat living language and that there is a rather large corpus of material written in it leads to some question why it hasn't been deciphered to any appreciable degree with any wide acceptance. Like the Rohonc Codex, Rongorongo has no specifically identified value as of yet.

Lorem Ipsum is a sequence of apparently nonsense words used by some publishers, web designers and other content providers to hold the place of meaningful text. It was actually derived from a Latin text (Lorem Ipsum, n.d.) and therefore approximately matches Latin languages in letter frequencies and graphemic rules, such as that three consonants are unlikely to occur without an intervening vowel. It also follows word frequency rules like Zipf's (Newman, 2008), that words of fewer characters are statistically more common. As such it should generally roughly match most Latin alphabet languages for statistical analysis scores at the letter and word level. Therefore it could also be used as a decoy via Unicode substitution or other difficult to distinguish modification to obfuscate messages steganographically or via watermarking.

Unintentional written Linguistic Obfuscation demonstrates how easily meaning can be obscured; whether it is preserved or lost entirely. That this corpus continues a seemingly never-ending growth in a mammoth amount of perfectly public media virtually assures that the complete solution of all LO is prohibitive. Currently in the author's possession is a DVD of presumably Chinese manufacture of a reissue of an American movie. It's summary reads: “A new Yorker moves to Los Angeles in order to figure out his life while he housesits for his brother, and he soon sparks with his brother's assistant. Florence Mall is the Greenberg family's personal assistant. His day job is to help her deal with all sorts of things employers to meet their various requirements. And Greenberg's luxury home in Hollywood high-profile, elegant life in stark contrast to the Florence Mall lives a simple, low-key life. She was a person living in a small apartment room. Time to time, the Florence will sing in the night markets. Look, she seems to be a promising new singer” (Focus, n.d.). This example is worth examining for several reasons. From a native English speaker's grammatical perspective, it is safe to say that almost every sentence is flawed, and that, overall, it is wrong: it does not properly convey it's intended meaning. However, some understanding of Chinese grammar helps in decoding it, and it may be a nearly correct instance of Chinglish (Wenzhong, 1993): Chinese sentences written with English words. The first sentence is, on the surface, merely ungraceful. It does not obey the standard of proper noun capitalization, uses excessive pronouns and doesn't clearly delineate precedent and antecedent. However, it is in fact actually rendered perfectly ambiguous by its uncontextualized usage of the word “spark”. Whether this is a direct translation of the Chinese or lifted from another English summary is unknown at this time, however without the additional context of the remaining sentences it is impossible to say (and even with, not entirely clear) which of the six definitions of “spark” (Webster.11) this one is using. The second sentence is fine, demonstrating no obvious grammatical or semantic failures. The third sentence however, isn't and does. Presumably the pronouns have been switched in gender; this isn't terribly unusual considering “he”, “she” and “it” are homonyms in Mandarin Chinese. At about the two nouns “employers” and “things” the syntactic meaning starts getting fuzzy. By the end of the sentence it is lost entirely. The next sentence appears to contrast the life of the Greenberg's home with that of Florence. The next sentence is complete and correct, although the switch to past tense is interesting. The following is marred only by the definite article, but “night markets” is more likely a direct translation of Chinese idiom than a transplant from the original DVD cover. And the last sentence is fine, except the imperative “Look” is a little out of place. The point of this analysis is not to criticize Chinglish, or whatever pastiche one believes this to have been written in. It is rather to point out that for all of its mistakes, the gist of it remains fairly clear. But it seems almost certain that any sort of automated analysis would have some trouble making sense of it. It seems equally likely that brute-force analysis of this text, were it encrypted, might have some trouble identifying the “correct” solution. The effects on entropy and statistical analysis of this type of LO will not be explored in this paper.

Bayesian models have made it possible to use computers for language decipherment. They can simultaneously capture both low-level character mappings and high level morphemic correspondences (Snyder et al, 2009). If simple weighting algorithms can be productive in this application, it is quite likely that more complex weighting systems like those available in neural networks will be even more so (O'Reilly & Munakata, 2000). If such systems are brought to bear against the aforementioned unsolved examples of LO it seems certain that solutions should be found eventually, assuming they actually exist.

There are several methods of LO being explored via digital automation today. If one considers computer programming languages to have semantic content, then code obfuscation is linguistic obfuscation. Its applications range from amusement (Wikipedia, 2012) to attempts at enhancing efficiency and protecting code from reverse engineering or copying (Chow et al, 2001) (Linn & Debray, 2003).

Spammers often obfuscate their spam to try to avoid detection (Liu & Stamm, 2007). Unicode is an effective means to achieve this as the duplication of symbols between code pages allows identical or nearly identical graphemic representations to be created which do not clearly match spam signatures. Identifying this type of obfuscation and distinguishing the underlying unsolicited message has led to an interesting algorithm (Ibid.) that could be reversed to produce another form of LO.

Botnet command and control is obfuscated to some degree, whether to hide its existence and methods or simply to improve communications efficiency (Command Five, 2012). Insofar as these communications carry semantic information this obfuscation may be classified as LO. Most botnets rely on some form of malware on unwitting victims' computers and are used to perform ethically suspect or illegal activities. Hiding their existence is important both from the perspective of protecting them from the computer's legitimate user who would most likely destroy the malware on discovery, and from computer and network security personnel who would do likewise or at least corral it and examine its behavior, and from law-enforcement or cyber-activists who might bring charges against the perpetrators or attack their computing infrastructure.

Steganography is fairly easy and common in binary files. So long as the added data doesn't cause a notable corruption of the audio, picture, video or what have you, it is unlikely to be discovered. And by adding encryption and/or algorithmically scattering the secret data within the file(s) it's possible to make it hard to recover even if it is revealed. But steganography is a fairly mature application of LO as well. From specific word choices (Muhammad et al, 2009) to overall syntactic replacement (Chand & Orgun, 2006), language is a fertile field in which to conceal meaning.

Spoken Examples

Any visitor to or resident of a region where their primary language(s) are not the common ones can attest to the difficulty of understanding others and of making themselves understood. This effect is possible to empirically quantify, but it is challenging given the subjective definition of difficulty and the widely varying human capacity for language acquisition (Takashima, 2009). All the same, the validity is measurable; non-native language processing requires more effort than native language processing (Undorf, 2011).

There are numerous examples of spoken LO in human history: from the 'Shibboleth' incident of the Tanakh (Jewish Virtual Library, 2012) to modern code-switching in polyglot populations (Nilep, 2006). One of the most well documentated examples is that of the Native American “Code Talkers” employed in the First and Second World Wars. Private and authenticated verbal communication may be possible even on such a public medium as unencrypted radio waves if one can employ speakers of adequately obscure languages.

Simple code-switching (Ibid.), chosen inappropriate idiom and phonotactic steganography are just a few among the many possible obfuscations available in speech, but this paper focuses on written language. The threats posed to current encryption paradigms more directly effect written communications at this time. The technologies for intercepting, recording and analyzing audio are certainly powerful, but spoken language has proven flexible enough, and volatile enough, that automated decipherment of any of the above methods and many others will probably remain a challenge for the foreseeable future.

The ancient Jewish Tanakh describes an incident of linguistic authentication that is commonly referred to as the “Shibboleth” incident (Jewish Virtual Library, 2012). The phonemic repertoire of the Ephraimite tribes did not include a “sh” sibilant. As a result, the Gileadites were able to identify them and kill them. In so doing they not only executed a successful genocide, they did so by leveraging linguistic information in another domain, namely that of ethnic authentication.

Native American code-talkers

The “code talkers” employed by the United States Army in the first and second world wars are among the most-well researched and documented cases of spoken LO in history. Those of the second particularly well, as the relative currency of their usage benefits our understanding of it in both that the language in which it was documented remains clear and that some relatively modern scientific method has been applied both to their usage and the documentation of it (Meadows, 2002). Therefore we have both relatively unambiguous documentation of what they did and why, and a reasonably accurate evaluation of how effective it was.

America was rather unique in the early twentieth century in having not only indigenous populations that spoke languages which were not widely documented or understood, but also the resources to mount large-scale warfare. This created a situation where speakers of such languages could be effectively employed and supported as Linguistic Obfuscaters. Native American languages are phonemically challenging to non-native speakers. Many have proven to be impossible to transcribe accurately in the Latin alphabet (Ibid.).

The phonemic challenges presented by Comanche, Navajo and several other Native American languages extend to both comprehension, the morphemes can be hard to distinguish, and to reproduction, they are difficult to enunciate correctly. As such code-talkers presented not only the obvious obstacle to deciphering intercepted communications, their use also provided an intrinsic means of authentication; these communications were exceedingly difficult to falsify.

The success of the code-talkers also supports the assertion that cognitive bias is significant in both discouraging adversaries and in misjudging the difficulty of problems. Bias prevented eavesdroppers from identifying patterns that might have been obvious in standard encryptions of the day. At the same time, evidently bias kept the Axis armies from hiring a translator who could make sense of the communications. Certainly such an acquisition was not beyond their means, but whether because they were convinced of their superiority and therefore judged any communication as of little value or for some other reason based on emotion or logical fallacy, the record shows that they missed deciphering a great deal of LO information.

Code-switching

When Fodor averes that ”understanding what someone says typically requires knowing what form of words he uttered” (Fodor, 2007) he is stating more than the obvious philosophical/psycholinguistic point about observation, cognition and inference. He is also making a point about how not to be understood. Polyglots can also testify how easy it is to mask their meaning to some listeners by changing what language they speak mid-utterance. Code switching is the practice of shifting languages in a “single” discourse.

Code switching can be a rather complicated phenomenon. Whether people change languages mid-discourse as a sociological class-tagging behavior, to obfuscate some words or phrases from some listeners or simply because the words they want to say are not readily available in the first language or in their vocabulary of it can be quite complex. The motivation may consist of some part(s) of any or all of these (Nilep, 2006). But, pragmatically speaking, code-switching in public is almost certainly LO to some listeners.

  1. Psycho-linguistic Bases of Linguistic Obfuscation

Causality

Language is very likely used even more for intra-personal cognitive processing than it is for interpersonal communication. Stream of consciousness continues whether we are alone or interacting with one another. Whether one chooses to argue for strong causality (Lau, 2004) or weak (Walshe, 1963), it is generally an accepted proposition that language is fundamentally integral to human cognition (Gopnik, 2009). Causality may be defined in this context as the degree to which one's spoken (and thought) language(s) define(s) the thoughts one is able to have. “Specifically, language is the vehicle of non-modular, non-domain specific, conceptual thinking” (Carruthers, 2002). Human beings have been so successful as a species at least in part due to a capacity for this sort of abstracted cognition (Dupoux et al, 2001) If language is fundamental to human communication and thought itself, then the ability to obfuscate it is foundational in concealing, disguising and/or verifying the authenticity of such.

It is also clear that humans are capable of self-deception. It may, in fact, be an inherent part of our cognition, mental-health or both (Trivers, 2001). Language is part of the cultural scaffolding by which people have supported and continue to support their activities (Clark, 2000). Without culture historical progress as we know it would not exist. Without language, culture as we know it would not exist. Since human sensory inputs are too limited to fully apprehend reality, to some extent language must fail to describe reality. How much of this failing is intentional or accidental, how it is recovered from or not and if it dovetails into further failures is LO. Therefore, as discourse is possible within one human being, LO must occur even within individuals.

Biases

The mere act of describing reality requires language. Whether the representation is internal or shared between people, some words are likely to be involved. Human cognition in itself can be described as an exercise, at some level, of linguistic analogy formation (Hofstadter, 2009). Choosing where to focus the analogizing power of language determines the nature of reality as perceived by individuals (Gopnik, 2009). This focus, is to some extent, itself an act of misrepresentation (Trivers, 2001). People choose what to believe, partly based on cultural conditioning, partly on emotion and somewhat filtered by cognitive bias.

The very language one uses is a psychological framework for what can be expressed and how it will be received (Good, 2001). This may be a conscious choice, but it is often merely reactive. Bakhtin's assertion that “verbal discourse is a social phenomenon – social throughout its range and in each and every of its factors” (University of Texas, 1981) is fairly axiomatic; there isn't much cause to dispute it as it is directly supported by the evidence. Then, combined with the above-mentioned analogizing it can be reasonably extended to posit that language is a socially shared analogizing process and product. So some common and important means of LO are jargons and lingos. But all language is unclear to some audience, and it unifies some social groups as it delineates differences. Jargons and lingos also enable brevity, and possibly clarity in many cases. The use of linguistic terminology in this paper is an example.

Fundamental to all cognitive bias may well be the Dunning-Kruger effect. Cognitive biases are probably somewhat less likely to affect decision-making and perception if one is aware of them. Dunning-Kruger states that human beings are very often ignorant of their own ignorance (Dunning, 2011). Another foundation of bias is the human emotional system. Perception itself is colored by emotional states and events (Betti et al, 2009). And as this is plainly a feedback loop, with perceptions further affecting emotion, emotional manipulation can be an effective means to exacerbate cognitive bias.

Another rudimentary cognitive bias is the egocentric (Paul, 2007). The cause of this bias may be somewhat complicated, but is likely rooted in the impression that the self is distinct from others and the environment and the emotional investment in supporting this belief; and it is probably of some benefit to mental health (Gilovich et al, 2000) and has, most likely, like all extant cognitive biases, been positively naturally selected for. However, its symptom is the conviction that one deserves more credit for an event or situation than objectively appropriate. It therefore directly jeopardizes the objective accuracy of an individual's belief system.

Also very salient to this paper's theses both that information can frequently be hidden in plain sight by leveraging LO, and that the effectiveness of cryptology itself must often be prone to misevaluation, is the phenomenon of confirmation bias (Nickerson, 1998). Again, natural selection must have favored, or at least allowed for, confirmation bias or it wouldn't exist. But it causes beliefs to color perception (Ibid.), such that, if one could predict to some degree what beliefs would be produced by a text in some readers (such as that it was meaningless or worthless) one might very well successfully eliminate these individuals as adversaries. Confirmation bias also ensures that those who believe in the irreversibility of the algorithms on which they base their encryption may fail to see existing weaknesses; however, by the same token, if this belief comes to be widely enough held it may also color the perceptions of cryptanalysts, thereby hindering their objective ability to solve.

Conditioning

Belief is the result of experience and conditioning. This paper argues that it may be possible to leverage this conditioning and these biases to hide communications in ways that are very easily deciphered but which, given these human limitations, very frequently will not be. Any visitor to a foreign country or other alien isogloss is likely to be familiar with the mental exhaustion that comes from hours of trying to comprehend a non-native language or dialect. From the neophyte who has no hope at all of understanding anything to those with a high-level of competence who can even deal with complex constructs like idiom and metaphor, it has been noted that all non-native speakers will experience some level of cognitive strain, and after some period of time, fatigue. Written language isn't much different in that regard. While the time dependency is completely disparate in that spoken language often must be assimilated and responded to in real-time, the ubiquity of alternate alphabets introduces another obstacle when dealing with writing. And the problems of vocabulary, syntax, grammar and higher-level structures is identical. While one may have functionally infinite time to analyze a piece of writing the basic inability to penetrate those impediments often remains. However much time one has to devote the problem may be insoluble without the proper resources such as dictionaries or transliteration tables, and at the very least can produce some emotional reactions like frustration and boredom.

It is also certain that cognitive biases and social conditioning must affect cryptologists' faith in their systems as they do all human enterprise (Mascareño, 2008). Computer science and cryptology are exceedingly rigorous pursuits. This does not make them immune to bias (de Waal, 2003). How difficult an encryption method truly is to crack generally is itself an NP problem. Whether the algorithm relied upon itself is is actually often unproven. Mathematical proofs are usually rather straightforward and cryptology has developed a powerful and detailed vocabulary for creating proofs of its own (Rogaway & Shrimpton, 2004). However, these proofs cannot possibly completely, accurately and immediately reflect the changing environment in which encryption and attacks on it take place. Not only do novel technologies, like cloud-computing and light/quantum computing, continuously change the billions of instructions per second which are available for encryption and its attackers (Kilin et al, 2007), it is always possible that new algorithms may be developed that astronomically reduce the time required to crack an algorithmic encryption. It is, in fact, perfectly possible that some of this has already taken place and not been publicized (Bamford, 2012), especially given the likelihood that only a state security agency would have the resources to accomplish it and would therefore most likely be impelled not to make it public knowledge for reasons of “national security” or the like.

Emotion

Cognition is influenced by emotion (Cosmides & Tooby, 1992). There is some literature that suggests that it may even be formed, or at the very least heavily informed by it (Eich, 2000). By properly manipulating emotion it is highly probable that cognition can be affected. Producing communications which appear alien via LO of one sort or another is helpful in creating a mood of hopelessness or at least annoyance in a human adversary. Unless this adversary is a professional, or at least experienced, code-breaker it may well be possible to induce them to simply give up the effort. This will be especially productive if the communication is, or is at least perceived to be, low value. An indeterminate value may be adequate in fact; the equation of inconvenience to known value is a complicated one, as adversaries vary widely in competence and conviction and as a communication's value to any specific one of them will vary. It may be productive to simply drown some adversaries in noise. Certainly, defeating data-mining, statistical analysis and dictionary matching is a first, and often relatively easy, step.

This paper assumes that there may be many, many such communications. And given the ever-present and growing surveillance and analysis capabilities of states and corporations (ACLU, 2006) (Donnely, 2011) (Tollet, 2011) there are also a huge, and most probably rapidly increasing, number of adversaries. There are governments building datastores that may be capable of storing an appreciable amount of the world's communications (Bamford, 2012) for analysis. But, in order to analyze this volume of communications it is necessary to automate some of that process. If strong encryption is resource prohibitive to people, whether from available computing power, inconvenience or other reasons, LO can offer a quick and easy way to defeat trivial analyses and data-mining. And if cryptanalysis defeats some forms of cryptology, adding LO before encryption can help ameliorate that.

Perception of Writing

If geons are the fundamental form by which human minds recognize and classify geometric shapes (Biederman, 1987), then graphemes must be effectively constituted of them. How much or how little phonemic information is conveyed by them individually is actually not relevant to this observation. It is trivial to argue that numbers are composed of geons, letters are composed of geons, syllabic characters are composed of geons and logographs are composed of geons. Perceiving written language correctly so as to interpret what it signifies requires recognizing these elements. Individuals who learn different alphabets must recognize very different geon patterns which often represent very different graphemes which generally represent different phonemes which build different morphemes that obey different laws of phonotactics. Superficial physical resemblance is not uncommon, in fact given the Western character sets' parallel evolution from Proto-Indo-European and Sanskrit (Coulmas, 1989) this similitude occasionally carries through all the way up to pronunciation, but in general this resemblance is only that and leads to incorrect interpretations of meaning. Still, that “incorrect” interpretation can be the correct one if the writing is a code system and both the writer and the reader know that.

Why Written LO

This dissertation focuses on written LO. The possibilities for obfuscating language as an audio signal are manifold, as are related tasks like authentication and steganography. However these are largely algorithmic exercises given the rather straightforward heuristics for the digitization of signals, their verification or the addition of information non-destructive to their manifest content; and the intent of this paper is to explore fuzzier solutions. Code talking, streaming encryption and voice print recognition are all somewhat fuzzy audio applications related to LO. But written language, at this time, seems better suited to available computing power. Computers are continuously getting better at dealing with natural language (Simonite, 2012). But they are still not nearly as strong at it as most human beings, especially when it is presented as an audio signal.. Focusing on writing will allow this paper to present solutions that may be useful today.

Even flagship commercial voice recognition technologies such as Dragon Dictate clearly demonstrate current computer inadequacies with spoken language. They must be extensively trained to cope with the idiosyncrasies of a single user's speech. They are also prone to errors when dealing with changing health and emotional conditions of speakers. For example, correct interpretation of speech is adversely effected if the user has a cold. Therefore, expecting a commodity computer system to further analyze such linguistically slippery usage as idiom, code-switching and/or dialects at this time is probably unrealistic. This is not to say that artificial intelligence technologies aren't making great strides in using and learning from natural language, they most certainly are ((Branavan et al, 2011) for example); just that combining these with voice recognition to achieve a method of disguising information without destroying it utterly represents a level of complexity beyond this paper.




  1. LO Based Solutions

Overview

Language offers functionally unlimited prospects for methods to conceal and authenticate information. As the limitations of human cognition have yet to be strictly delineated, nor those of language itself, there is no reason at this time to believe that systems which leverage them are very limited. Chapter 3 presented some evidence to the contrary. Because humans represent some information internally with words, words are a type of information. Arguments ranging from ontology to teleology may have some value, but no extreme interpretation matches the evidence. Assertions that language is discrete from human cognition are not easily supportable.

A large variety of work has been and continues to be done in LO. This paper focuses on written language because the author has the most experience with that and believes that a worthwhile contribution is most likely by focusing on that domain given the available time and resources. However there is a great deal more to study, and by no means is this meant to suggest that even written LO can receive anywhere near a full treatment here. However, as some focus is required this is it.

This paper also concentrates on LO vis a vis the English language. Again, this is not the most commonly spoken language on Earth by most measures, nor necessarily any more well suited to obfuscation than any other. It is, however, the author's native language and arguably the lingua franca of global discourse at the present time. The fact that Shannon based most of his work on it also makes it very convenient for this paper's arguments about entropy.

Transliterator

One application that is somewhat amenable to automation of Linguistic Obfuscation is transliteration. While Oh et al point out that linguistic information is generally most accurately preserved by a hybrid or correspondence-based model (Oh et al, 2006), this paper isn't looking for the least destructive mode.

Often employed in the teaching of languages, transliteration is the process of representing phonemes (or graphemes) from one language with characters bearing the greatest phonemic resemblance in another alphabet. This is often done to help students with no familiarity with the taught language's alphabet to approximate the sounds using characters with which they are familiar and it is usually done following the phonotactics of their native language as closely as possible. It is currently done most often by human beings, because the rules of phonotactics in any given pair of languages present a challenging problem due to some arbitrariness in every system of semiotic sound representation: rules of letter order, silence, modification of other letter sounds in the same word, and tone (where it is phonemic) and/or stress (where it determines grammatical function) for example. Combined with the fact that very few phonemes are truly identical between languages (or dialects for that matter), this makes the exercise a rather inexact, best effort one. While great strides are being made in making machine-transliteration efforts of nearly equal quality (Ibid.) this paper demonstrates a brute-force, fairly destructive method for the reasons set forth in Chapter 3.

Transliteration is effective as LO far several reasons. First, the characters it produces may be chosen for obscurity with the goal of diminishing the possibility for easy recognition. Second, because the phonemic and graphemic mappings between alphabets are inexact, yet can bear that best effort resemblance, the process can produce irreversible noise and loss in and to a message while not rendering it unintelligible. Third, by requiring Unicode the process will produce information that is, at the hex and binary level, not in the same range with other, more standard code pages, which can inhibit cryptanalysis if communications are additionally encrypted.

Scrambler

Another possibly useful application of computer-assisted LO is letter scrambling. There is a growing body of literature exploring some of the ways in which information can be destroyed by letter interpolation and replacement (Rayner et al, 2009) yet still recognized. An interesting footnote is the “history” of this research, which was either documented incorrectly or before it occurred (Davis, 2003) (Rawlinson, 1976). While this allegory is itself a fine demonstration of LO, or more generally, lying, and the psychological power of memes to be suggestive to behavior, it's only mentioned here in passing. What is more germane to the theses of this paper is that letter scrambling destroys information in that the effected words are no longer cogent, and yet preserves it in that they can still be read by human beings with some amount of effort. This amount depends on such variables as the involvement of first and last letter scrambling and intra-word distance letters travel (Rayner et al, 2009) By restricting scrambling to the interior of words, that is by not moving first and last letters, and by allowing some control of how far letters are likely to move within a word, this demonstration attempts to make that effort both as minimal, potentially, and as tunable as possible while still ensuring the basic destruction of the words' validity.

Substituter

A third possible application of LO would be a general substitution at the word level. The proof of concept program uses a lookup table of words and some associated crossword clues. Tables consisting of dictionary entries have also been tested. A table of synonym correspondences might also be used, which could be additionally be tuned to produce steganography (Muhammad et al, 2009 ). The resulting texts from this method will be perfectly normal to statistical analysis at word and sentence level. There will be nothing to brute-force, evidently, because the language is perfectly comprehensible. However, the apparent contents will not be the actual communication.

  1. Methods and Realization

Initial Conception

The first proposed implementation of LO when this paper was conceived was transliteration. Initially it was planned as a more accurate phonemic carrying over of letters from one alphabet, through an International Phonetic Alphabet interim representation, into the destination alphabet. It was also under consideration whether a rotation through shared alphabets would enhance entropy sufficiently to excuse the additional in or out-of-band keying effort this would have entailed. However, as there are many other transliterators currently under development, and as Dr. Vella pointed out (Vella, 2012) that even this one would still have been simply a form of substitution cypher, development on this method was halted at a simpler alphabet to alphabet model, with a choice of random or sequential (selection of available near phonemic matching letters when one side or the other is overloaded) modes.

Concept Refinement

As the dissertation widened its focus from transliteration to more general ideas of obfuscation, additional methods which literally obliterated information while leaving it still linguistically comprehensible were sought. Rawlinson, Davis and Rauner et al's work on letter scrambling was evaluated and found to fit the bill perfectly. Dr. Vella also expressed curiosity about the effect of replacing words with crossword clues (Vella, 2012) which spawned the substituter.

Additional Developments

Initially the transliterator was proposed to be written in C with data-stores in XML. As development progressed in Perl as a quick to prototype, easily ported to C language, it was found that performance on large texts was quite acceptable. Therefore the other ideas were also coded in Perl. As the abandonment of an IPA middle state reduced the need to track individual character identities and function, the simpler tab-delimited data store was also kept from the transliterator prototype, and also used for the substituter.

Working with a competent reader of Arabic elicited several phonemic and graphemic suggestions which have been incorporated into the Arabic/Roman data-stores. As a result the output has been confirmed to be comprehensible with some effort. Unfortunately the additional recommendation to reverse the direction of output (left to right versus right to left) has not been reliably accomplished at this time.

  1. Results and Evaluation

Transliterator

Transliterator output is legible with some effort if one knows the target alphabet and the transliterated language. This has been proven both by Thai/English bilingual and Arab/English bilingual individuals reading the raw output, and by reading the output de-transliterated by re-inputting it against the opposite table.

Transliteration provably changes the entropy of texts. Shannon computed the entropy of an “average” English text to be ~2.3 bits per letter, given the 27 meaningful characters (letters and space), letter frequencies and likelihood of letter pairs (Shannon, 1950). Therefore, transliterating into any other alphabetical system will raise or lower entropy according to the number of letters available, if some letter pairs are available as single characters, or whether single characters require a pair. For example, the Thai alphabet contains over 100 characters. As, in the absence of phonemic tonality, four of these are roughly equivalent to the letter 'k', the entropy of this character is increased. In addition, as the letter pair 'ng' is represented by a single character, the likelihood of its occurrence will be the same as that pair after transliteration, and the likelihood of 'n' proceeding 'g' is almost eliminated. No phonemic match exists for the character 'x', however. Its most common pronunciation might be represented with a 'k' sounding and and 's' sounding character (of which there are three available in Thai). In addition, the sounds an English speaker associates with the letters “v” and “w” are not phonemically distinguishing in Thai speech, and there is therefore only one character available for them. Clearly the entropy of transliterated texts requires recomputing, but that computation will depend on the transliteration methodology implemented.

Scrambler

Scrambler output is readable with an amount of effort roughly proportional to the intensity (floating point weight 0-1) with which the program is run. This has been tested by reading the output.

Given that current estimates are that the English language contains over three-quarters of a million words (Oxford Dictionary, 2012) and thousands of new words are being added yearly, brute-force analysis of scrambled words is probably not a realistic proposition. Using this dissertation's algorithm a four letter word has two possible permutations, a five letter six, a six letter twenty four and so forth following an N-letter minus two factorial curve; multiplying the number of these variations by the English base yields over a billion “wrods” before one gets to ten letters.

Substituter

Substituter output is perfectly comprehensible. Without the proper substitution table the actual communication is invisible, however.

Overall

None of these methods are terribly strong on their own. This is intentional, as one of the theses of this paper is that algorithmic and mathematical encryption are often stronger than they need to be for many purposes. But the transliterator does introduce a generalized unbalanced inconvenience between users and most adversaries, and entropy modification at the letter level; the scrambler does destroy standard dictionary matches and increases entropy at the word level; and the substituter hides messages in plain sight and will also support strategies like Muhammad's et al's (2009), Chand & Orgun's (2009) and Topkara et al's (2005).

Evaluation of Adversaries

Because these methods are recommended for their ease of use and transparency, it is very important to understand both ones' adversaries' capabilities and commitment, and the true sensitivity of the message to be treated with LO. The NSA is not a unique state agency so far as having the mission of intercepting, archiving and analyzing all information that it possibly can. It may be rather unique in the extent of its capabilities (EFF, 2010). All governments are in the business of coercion to some extent (Glass, 2010). Information gathering and analysis is an important part of maintaining control over a populace. In addition, as of the time of this writing, almost all known governments have somewhat adversarial relationships with at least some other governments. Therefore state security agencies must capture and analyze data on, from and in other nations as well. The NSA is such a superlative entity both because of its heritage and the amount of resources it is allotted (Bamford, 2002). But it is probably safe to say that any state intelligence agency will be capable of breaking LO. Mistakes like those of the Axis powers documented in Chapter 2 are unlikely to be repeated. Some effort is likely expended to decipher all communications collected, egocentric bias is probably not permitted to impede evaluation or solution to the degree that it was and the ease of access to a large proportion of human knowledge and thought that such technologies as data-warehousing, the Internet and the World Wide Web have enabled makes solution less problematic. If a Linguistically Obfuscated message is interesting to a state agency they almost certainly will decode it. At the same time, this decoding will require some effort. If one has any desire to challenge the primacy of states and their security and intelligence apparatuses it might be argued that one has an obligation to use methods at least as powerful as LO to waste their time if nothing else.

Organized crime is also heavily involved in information gathering and analysis. It is also often in the coercion business. While criminals most obviously need to hide information from law enforcement to avoid detection, they also need to authenticate it to positively identify partners in crime and they must also expend some resources on decoding it if they think it contains information useful to them. Again, so long as criminals exist one might argue that causing them inconvenience obligates one to use LO at minimum.

Attacks on LO

Transliteration is easily attacked if one recognizes it. Interestingly, preliminary experiments have found that it often is not. The author has posted some examples on the Internet, and they have not been solved to date. In fact, to get input from a competent Arab speaker it was necessary to explain the process of brute-force transliteration in some detail. It does seems unlikely from this evidence that such rough transliteration will be readily recognizable by a great many people without some explanation or context.

When it is recognized it still presents a couple of challenges. There are alphabets with geonically similar but graphemic/phonemically distinct characters. It will therefore be requisite that an attacker positively and accurately identify the alphabet in which the message is written. The remaining challenge is then to identify the underlying language being (mis)represented. The preponderance of English makes this trivial for many cases, but certainly not for all. Triglots and better may even use shifting underlying languages. However, once the representational system and the represented language(s) have been identified transliteration is a good as solved.

Scrambled words represent one basic challenge. They are wrong. As such they do not exist is most dictionaries. One possible attack on scrambled text is, therefore, to build such a dictionary. Given the equations above, this task is rather prohibitive.

  1. Conclusions

Lessons Learned

LO is generally not an extremely strong method of concealing or authenticating information. Given enough time, a sufficiently committed adversary will crack it. Of course, the same can probably be said of algorithmic and mathematical encryptions given the progress documented in the introduction. Still, it is generally not a suitable method to protect bank transactions or military secrets (anymore). However, in the absence of such commitment several examples do remain unsolved, more continue to be created, and those that have been solved did cost some effort. LO is therefore an adequate method for messages of low, or safe to assume will be be assumed low, value and those with highly time sensitive contents.

Human beings are, as of yet, relatively good at deciphering noisy and/or lossy linguistic information compared to computers, within limits circumscribed by knowledge and determination. We can leverage this innate skill to create a layer of obfuscation that will enhance entropy and resistance to statistical and brute-force analyses of added encryption, or which can provide adequate security and/or privacy in situations where information is time volatile or not truly sensitive; and take advantage of these limitations due to adversaries' unwillingness to bother with the deciphering or inability to complete it before the information is obsolete, or recognize that deciphering is necessary.

The literature survey shows that LO can be very effective. It also correlates and evaluates literature on the psychological causes and implications of the need for obfuscation and some of the cognitive biases that often enhance its effects. That the same and similar biases, and the unending advance of technology also demand careful and continuous evaluation and re-evaluation of information concealment methods is also supported. In fact, cognitive bias is so prevalent it may in fact be an integral part of cognition and should therefore probably be studied in any analysis of the use of language (Chomsky, 1957). This paper evaluates this assertion intrinsically to cryptological problems as to how they may benefit from the addition of LO and extrinsically as to how we may miss advantageous applications and disadvantageous weaknesses, and associated attack vectors, due to preconceptions and/or other cognitive filters. The proof of concept programs show some ways in which LO might be implemented.

Future Activity

It is probably reasonable to expect that all forms of LO covered in the literature survey will continue to be used and improved. Coded books will continue to be created, if more for entertainment (LaFarge, 2007) than message concealment. Media bias and it's attendant message manipulation will likely continue as corporations and governments continue to control large press organs. Source code obfuscation will presumably continue and evolve for all the reasons mentioned in Chapter 2 and transliteration will continue and improve for reasons like Knight and other's research, word borrowing, language teaching and transfer, and spam obfuscation. This dissertation is certainly not alone in looking for ways to obfuscate messages by manipulating language. Many more applications have been and are being found.

Prospects for Further Work

Therefore, neither has this dissertation exhaustively explored all possibilities for written LO specifically. A great many more are possible and may be worth looking into. Synthetic alphabets like those used in the Voynich, Rohonc and others have particular promise as extremely difficult problems. However, computers use of specific character sets makes implementation somewhat challenging.

In addition, voice recognition and natural language processing are coming into their own and may soon present fertile ground for LO. The opportunities this will present for real-time obfuscation and analysis make that a very exciting avenue to pursue. Computers can hold a reasonable facsimile of a conversation already. How long until they can lie and tell jokes?

REFRENCES CITED

ACLU (2006) Eavesdropping 101: What Can the NSA Do? [Online]. Available from: http://www.aclu.org/national-security/eavesdropping-101-what-can-nsa-do (Accessed: March 18, 2012)

Agata, T. & Agata, M. (2009) 'Determining the Possibility of Deciphering an Unintelligible Text by Text Clustering: The Case of the Voynich Manuscript', Library and Information Science, No. 61, Pp1-23 [Online]. Available from: http://mslis.jp/pdf/LIS061001.pdf (Accessed: January 21, 2012)

Anderson, R.J. (2008) Security Engineering – A Guide to Building Dependable Distributed Systems, 2nd ed. Wiley Publishing, Inc. Indianapolis, IN

Antilla, L. (2008) Self-censorship and Science: a geographical review of media coverage of climate points [Online]. Available from: http://pus.sagepub.com/cgi/content/abstract/19/2/240 (Accessed: April 6, 2010)

Bamford, J (2012) The NSA Is Building the Country's Biggest Spy Center (Watch What You Say) [Online]. Avavlable from: http://www.wired.com/threatlevel/2012/03/ff_nsadatacenter/all/1 (Accessed: 17 March, 2012)

Bamford, J. (2002) Body of Secrets: Anatomy of the Ultra-Secret National Security Agency from the Cold War Through the Dawn of a New Century [Online]. Available from: http://site.ebrary.com.ezproxy.liv.ac.uk/lib/liverpool/docDetail.action?docID=10137477 Accessed: 11 August, 2011)

Bernstein, D.J. (2009) Grover vs. McEliece [Online]. Available from: http://cr/yp.to/grovercode-20091123.pdf (Accessed: 12 May, 2011)

Betti, V., Zappasodi, F., Rossini, P.M., Aglioti, S.A. & Tecchio, F. (2009) 'Synchronous with Your feelings: Sensorimotor γ Band Synchronization and Empathy for Pain', The Journal of Neuroscience, 29(40) [Online]. Available from: http://www.jneurosci.org/content/29/40/12384.full (Accessed: May 30, 2011)

Biederman, L. (1987) 'Recognition-by-Components: A theory of human image understanding', Psychological Review 94, Pp 115-147.

Bourdieu, P. & Wacquant, L. (2000) Neoliberal Newspeak: Notes on the New Planetary Vulgate [Online]. Available from: http://sociology.berkeley.edu/faculty/wacquant/wacquant_pdf/neoliberal.pdf (Accessed: April 23, 2010)

Branavan, S.R.K., Silver, D. & Barzilay, R. (2011) Learning to Win by Reading Manuals in a Monte-Carlo Framework [Online]. Available from: http://people.csail.mit.edu/regina/my_papers/civ11.pdf (Accessed: July 14, 2011)

Bulshakov, I. & Gelbukh, A. (2004) Computational Linguistics [Online]. Available from: http://www.gelbukh.com/clbook/Computational-Linguistics.pdf (Accessed: 24 May, 2012)

Caesar, Julius (~58) The Gallic Wars, translated by McDevitte, W.A. & Bohn, W.S. [Online]. Available from: http://classics.mit.edu/Caesar/gallic.mb.txt (Accessed: 28 January, 2012)

Carruthers, P. (2002) The cognitive functions of language [Online]. Available from: http://drum.lib.umd.edu/bitstream/1903/4339/3/Cognitive.Functions.of.Language.pdf (Accessed: 2 March, 2011)

Chand, V. & Orgun, C.O. (2009) 'Lexical Steganography: Design and Proof-of-Concept Implementation', Proceedings of the 39th Hawaiian Conference on System Sciences [Online] Available from: http://www.computer.org/csdl/proceedings/hicss/2006/2507/06/250760126b-abs.html (Accessed: May 27, 2011)

Chow, S., Gu, Y., Johnson, H. & Zakharov, V.A. (2001) An Approach to the Obfuscation of Control-Flow of Sequential Computer Programs [Online]. Available from: http://oberson.postech.ac.kr/bibliography/ISC/2200/22000144.pdf (Accessed: 10 October, 2011)

Clark, A. (1993) Associative Engines: Connectionism, Concepts and Respresentational Change. London, Axford University Press.

Clark, A. (2000) Mindware: An Introduction to the Philosophy of Cognitive Science. London, Oxford University Press.

Command Five Pty Ltd (2012) Command and Control in the Fifth Domain [Online]. Available from: http://www.commandfive.com/papers/C5_APT_C2InTheFifthDomain.pdf (Accessed: 4 April, 2012)

Cosmides, L. & Tooby, J. (1992) 'Cognitive adaptations for social exchange', The adapted mind: Evolutionary psychology and the generation of culture, Pp.163-228 [Online]. Available from: http://mudrac.ffzg.hr/~dpolsek/Pages_from_0195101073_The_Adapted_Mind2.pdf

Coulmas, F. (1989) The Writing Systems of the World. Oxford: Blackwell Publishers

Currier, P. (1976) New research on the Voynich Manuscript [Online]. Available from: http://voynich.nu/extra/curr_pdfs.html (Accessed 15 April, 2012)

Dagdelen, Ö (2010) Random Oracles in a Quantum World [Online]. Available from: http://arxiv.org/PS_cache/arxiv/pdf/1008/1008.0931v1.pdf (Accessed: May 12, 2011)

Davis, M. (2003) MRC cognition and brain sciences unit [Online]. Available from: http://www.mrc-cbu.cam.ac.uk/people/matt.davis/Cmabrigde/ (Accessed: 15 February, 2012)

D-Wave (2012) The Quantum Computing Company [Online]. Available from: http://www.dwavesys.com (Accessed 1 April, 2012)

Davis, T. (2003) RSA Encryption [Online]. Available from: http://www.geometer.org/mathcircles/RSA.pdf (Accessed: 14 April, 2012)

de Waal, (2003) 'Silent invasion: Imanishi's primatology and cultural bias in science', Animal Cognition, 6, Pp. 293-299 [Online]. Available from: http://www.psych.utoronto.ca/users/anna/deWaal.pdf (Accessed: 28 May, 2012)

Diffie, W., & Hellman, W.D. (1977) 'Exhaustive Cryptanalysis of the EBS Encryption Standard', Computer Magazine [Online]. Available from: www-ee.stanford.edu/~hellman/publications/27.pdf (Accessed: 27 May, 2012)

Dunning, D. (2011) 'The Dunning-Kruger Effect: On Being Ignorant of One's Own Ignorance', Advances in Experimental Social Psychology, 1, Pp247-296 [Online]. Available from: http://www.mendeley.com/research/dunning-kruger-effect-ignorant-ones-own-ignorance/ (Accessed: 25 August, 2011)

Dupoux, E., Mehler, J. et al (2001) Language, Brain and Cognitive Development. Boston, MIT Press.

EFF (2010) NSA Spying FAQ [Online]. Available from: http://www.eff.org/nsa/faq (Accessed: 10 April, 2010)

Eich, Eric (2000) Cognition and Emotion, Oxford University Press [Online]. Available from: http://site.ebrary.com.ezproxy.liv.ac.uk/lib/liverpool/docDetail.action?docID=10142258 (Accessed: 12 October, 2011)

Elitzur, A.C., Schlosshauer, M.A. & Silverman, M.P. (2009) Mind, Matter and Quantum Mechanics [Online]. Available from: http://site.ebrary.com.ezproxy.liv.ac.uk/lib/liverpool/docDetail.action?docID=10279648 (Accessed 14 April, 2010)

Encyclopedia Britannica (2012) Marco Polo [Online]. Available from: http://www.britannica.com/EBchecked/topic/468139/Marco-Polo (Accessed: 5 May, 2012)

Fisher, D. (2012) Survey Finds Secure Sites Not So Secure [Online]. Available from: https://threatpost.com/en_us/blogs/survey-finds-secure-sites-not-so-secure-042712 (Accessed: 29 April, 2012)

Focus Entertainment (n.d.) 'Greenberg', DVD

Fodor (1984) 'Observation Reconsidered', Readings in Philosophy and Cognitive Science, ed. Goldman, A.I., MIT Press, Boston.

Donnelly, J. (2011) Paradigm Shifts [Online]. Available from: http://wikileaks.org/spyfiles/files/0/55_201110-ISS-IAD-T1-GLIMMERGLASS.pdf (Accessed:

Gilovich, T., Medvec, V.H. & Savitsky, K. (2000) 'The Spotlight Effect in Social Judgement: An Egocentric Bias in Estimates of the Salience of One's Own Actions and Appearance', Joural of Personality and Social Psychology, Vol. 78 No. 2, Pp211-222 [Online]. Available from: https://www.msu.edu/course/psy/101/snapshot.afs/altmann/gilovich-medvec-savitsky-2000.pdf (Accessed: 11 May, 2012)

Glass, C. (2010) “Chomsky's Inner Conservative”, Taki's Magazine, August 3, 2010 [Online]. Available from: http://www.chomsky.info/onchomsky/20100803.htm (Accessed: 28 May, 2012)

Good, Peter (2001) Language for Those Who Have Nothing, Kluwer Academic Publishers [Online]. Available from: http://site.ebrary.com.ezproxy.liv.ac.uk/lib/liverpool/docDetail.action?docID=10046971 (Accessed: 12 October, 2011)

Google (2012) audio eavesdropping [Online]. Available from: https://www.google.com/search?q=audio+eavesdropping(Accessed 1 May, 2012)

Gopnik, A. (2009) 'Theories, language and culture: Whorf without wincing' in Language Acquisition and Conceptual Development, ed. Bowerman, M. & Levinson, S.C., Cambridge University Press, Pp. 45-69 [Online]. Available from: http://dx.doi.org/10.1017/CBO9780511620669.004 (Accessed: 1 May, 2011)

Gottesman, D. (2007) Quantum Cryptography: A Tale of Secrets Hidden and Revealed Through the Laws of Physics [Online]. Available from: http://streamer.perimeterinstitute.ca/mp3/5dceef4d-3a15-4530-93f5-674cbd228d82.mp3 (Accessed: 14 April, 2012)

Hofstadter, D. (2009) Analogy as the Core of Cognition [Online]. Available from: http://prelectur.stanford.edu/lecturers/hofstadter/analogy.html (Accessed: 27 January, 2012)

Jewish Virtual Library (2012) Shoftim – Judges [Online]. Available from: http://www.jewishvirtuallibrary.org/jsource/Bible/Judges12.html (Accessed: 22 January, 2012)

Kilin, S. Zokowski, M. & Kowalik, J. (2007) Quantum Communication and Security [Online]. Available from: http://site.ebrary.com.ezproxy.liv.ac.uk/lib/liverpool/docDetail.action?docID=10196608 (Accessed:: 2 June, 2011)

Klaehn, J. (2002) “A Critical Review and Assessment of Herman and Chomsky's Propaganda Model”, European Journal of Communication [Online]. Available from: http://www.chomsky.info/onchomsky/2002----02.pdf (Accessed: 28 May, 2012)

Kleinjung, T., Aoki, K., Franke, J., Lenstra, A.K., Thom, E., Bos, J.W., Gaudry, P., Kruppa, A., Montgomery, P.L., Osvik, D.A., Riele, H.t., Timofeev., A & Zimmerman, P. (2010) Factorization of a 768-bit RSA modulus [Online]. Available from: http://hal.archives-ouvertes.fr/docs/00/44/46/93/PDF/rsa768.pdf (Accessed: October 29, 2011)

Kupiec, J, Pederson, J. & Chen, F. (1995) 'A Trainable Document Summarizer', [Online]. Available from: http://www.csie.ntnu.edu.tw/~g96470318/A_trainable_document_summarizer_.pdf (Accessed 1 May, 2012)

Knight, K., Megyesi, B. & Schaefer, C. (2011) The Copiale Cypher [Online]. Available from: http://www.aclweb.org/anthology-new/W/W11/W11-12.pdf#page=12 (Accessed 3 February, 2012)

Knight, K. (2009) The Voynich Manuscript – a mystery ]Online]. Available from: http://www.isi.edu/natural-language/people/voynich.pdf (Accessed: 12 March, 2012)

LaFarge, A. (2007) Codex Serafinianus | fictive art [Online]. Available from: http://fictive.arts.uci.edu/codex_seraphinianus (Accessed: 25 May, 2012)

LaFlamme, R. (2004) 'Harnessing The Quantum World', Perimeter Institute Public Lecture Series [Online]. Available from: http://www.perimeterinstitute.ca/Outreach/Public_Lectures/View_Past_Public_Lectures/ (Accessed: 22 April, 2012)

Lang, B. (2010) 'Why Don't We Decipher an Outdated Cypher System? The Codex of Rohonc', Cryptologia, Volume 34, Issue 2, Pp. 115-144 [Online]. Available from: http://mycite.omikk.bme.hu/doc/84510.pdf (Accessed: 7 January, 2011)

Lammers, J., Stapel, D.A., & Galinsky, A. (2010) Power Increases Hypocrisy: Moralizing in Reason, Immorality in Behavior [Online]. Available from: http://pss.sagepub.com/content/21/5/737 (Accessed: 1 March, 2012)

Lau, G.K. (2004) 'Chinese hostages to their writing system: A case for simplification and reform' China Daily [Online]. Available from: http://www.chinadaily.com.cn/english/doc/2004-02/07/content_304083.htm (Accessed 18 April, 2010)

Linn, C. & Debray, S. (2003) Obfuscation of executable code to improve resistance to static disassembly [Online]. Available from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.92.9662&rep=rep1&type=pdf (Accessed: 12 October, 2011)

Liu, C. & Stamm, S. (2007) Fighting Unicode-obfuscated spam [Online]. Available from: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.95.9653&rep=rep1&type=pdf (Accessed: 8 October, 2011)

Lorem Ipsum (n.d.) All the facts - Lipsum generator [Online]. Available from: http://www.lipsum.com/ (Accessed: 25 May, 2012)

Lund, C., Fortnow, L, Karloff, H. & Nisan, N. (2007) Algebraic Methods for Interactive Proof Systems [Online]. Available from: portal.acm.org/citation.cfm?id=146605 (Accessed: 21 November, 2010)

Mascareño, A. (2008) Communication and Cognition: The Social Beyond Language, Interaction and Culture [Online]. Available from: http://www.sociologia.uahurtado.cl/publicaciones/12124_2007_9046_OnlinePDF.pdf (Accessed February 12, 2012)

Meadows, W.C. (2002) Comanche Code Talkers of World War II [Online]. Available from: http://site.ebrary.com.ezproxy.liv.ac.uk/lib/liverpool/docDetail.action?docID=10217891 (Accessed: 21 January, 2012)

Mearian, L. (2012) IBM touts quantum computing breakthrough [Online]. Available from: http://www.computerworld.com/s/article/9224670/IBM_touts_quantum_computing_breakthrough (Accessed: 29 February, 2012)

Merriam-Webster Dictionary (2012) Phoneme [Online]. Available from: http://www.merriam-webster.com/dictionary/phoneme (Accessed: 21 January, 2012)

Merriam-Webster Dictionary (2012) Morpheme [Online]. Available from: http://www.merriam-webster.com/dictionary/morpheme (Accessed: 21 January, 2012)

Merriam-Webster Dictionary (2012) Grapheme [Online]. Available from: http://www.merriam-webster.com/dictionary/grapheme (Accessed: 21 January, 2012)

Merriam-Webster Dictionary (2012) Glyph [Online]. Available from: http://www.merriam-webster.com/dictionary/glyph (Accessed: 21 January, 2012)

Merriam-Webster Dictionary (2012) Phonotactics [Online]. Available from: http://www.merriam-webster.com/dictionary/phonotactics (Accessed: 21 January, 2012)

Merriam-Webster Dictionary (2012) Semiotics [Online]. Available from: http://www.merriam-webster.com/dictionary/semiotics (Accessed: 21 January, 2012)

Merriam-Webster Dictionary (2012) Transliterate [Online]. Available from: http://www.merriam-webster.com/dictionary/transliteration (Accessed: 21 January, 2012)

Merriam-Webster Dictionary (2012) Alphabet [Online]. Available from: http://www.merriam-webster.com/dictionary/alphabet (Accessed: 20 May, 2012)

Merriam-Webster Dictionary (2012) Syllabary [Online]. Available from: http://www.merriam-webster.com/dictionary/syllabary (Accessed: 20 May, 2012)

Merriam-Webster Dictionary (2012) Spark [Online]. Available from: http://www.merriam-webster.com/dictionary/spark (Accessed: 20 May, 2012)

Moravec, H. (1997) When will computer hardware match the human brain? [Online]. Available from: http://web.archive.org/web/20060615031852/http://transhumanist.com/volume1/moravec.htm (Accessed: 1 May, 2012)

Muhammad, Z.H, Rahman, S.M.S.A.A. & Shakil, A. (2009) 'Synonym Based Malay Linguistic Text Steganography' CITISIA [Online]. Available from:http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5224169 (Accessed: 31 December, 2011)

Newman, M.E. (2008) Power Laws, Pareto distributions and Zipf's law [Online]. Available from: http://arxiv.org/pdf/cond-mat/0412004 (Accessed: 20 September, 2011)

Nickerson, R. (1998) 'Confirmation Bias: A Ubiquitous Phenomenon in Many Guises', Review of General Psychology, Vol. 2, No. 2, Pp175-220 [Online]. Available from: http://psy2.ucsd.edu/~mckenzie/nickersonConfirmationBias.pdf (Accessed: 7 October, 2011)

Nilep, C. (2006) 'Code-switching' in Socio-Cultural Linguistics [Online]. Available from: http://colorado.edu/ling/CRIL/Volume19_Issue1/paper_NILEP.pdf (Accessed 4 March, 2011)

NSA (2007) TEMPEST: A Signal Problem [Online]. Available from: http://www.nsa.gov/public_info/_files/cryptologic_spectrum/tempest.pdf (Accessed: 3 February, 2010)

O'Reilly, R.C. & Munakata, Y. (2000) Computational Explorations in Cognitive Neuroscience, MIT Press, Boston.

Oh, J-H., Choi, K-S. & Isahara, H. (2006) 'A Comparison of Machine Transliteration Models', Journal of Artificial Intelligence Research 27, Pp119-151 [Online]. Available from: http://arxiv.org/pdf/1110.1391.pdf (Accessed: 21 May, 2012)

Oxford Dictionary (2012) How many words are there in the English language? [Online]. Available from: http://oxforddictionaries.com/words/how-many-words-are-there-in-the-english-language (Accessed: 28 May, 2012)

Oxford English Dictionary (2012) Abjad [Online]. Available from: http://oed.com/view/Entry/271930 (Accessed: 20 May, 2012)

Paul, R. (2007) The Miniature Guide to Critical Thinking Concepts and Tools [Online]. Available from: http://www.criticalthinking.org/files/Concepts_Tools.pdf (Accessed: 13 July, 2010)

Priest, W.C. (2004) “Media Concentration: A Case of Power, Ego, and Greed Confronting our Sensibilities”, American University Law Review, Vol. 53, Issue 6 [Online]. Available from: http://digitalcommons.wcl.american.edu/cgi/viewcontent.cgi?article=1118&context=aulr

Reddy, S. & Knight, K. (2011) What We Know About the Voynich Manuscript [Online]. Available from: http://www.isi.edu/natural-language/people/voynich-11.pdf (Accessed: 25 January, 2012)

Rauner, K., White, S.J., Johnson, R.L. & Liversedge, S.P. (2009) Raeding Wrods with Jubmled Lettres – There Is A Cost [Online]. Available from: http://www.cnbc.cmu.edu/~plaut/VisCog/papers/RaynerETAL06PsySci.jumbledLetters.pdf

Rawlinson, G. (1976) The Significance of Letter Position in Word Recognition [Online]. Available from: http://opentype.info/static/Letter-Position-in-Word-Recognition.html (Accessed: 8 March, 2012)

Recorla, E. (2012) Stone Knives and Bear Skins: Why does the Internet still run on pre-historic cryptography? [Online]. Available from: http://2011.indocrypt.org/slides/rescorla.pdfhttp://2011.indocrypt.org/slides/rescorla.pdf (Accessed: 5 May, 2012)

Reimer, J. (2000) Tucson 2000: A Whirlwind Tour [Online]. Available from: http://www.findthatfile.com/search-22345629-hPDF/download-documents-tucson-reimer.pdf.htm Accessed: 7 October, 2011)

Rjabchikov, S.V. (1998) Some Remarks on Rongorongo [Online]. Available from: http://rongorongo.chat.ru/artrr2.htm (Accessed: 18 May, 2012)

Rogaway, P. & Shrimpon, T. 2004) Cryptographic Hash-Function Basics: Definitions, Implications, and Separations for Pre-image Resistance, Second Pre-image Resistance and Collision Resistance [Online]. Available from: http://www.cs.ucdavis.edu/~rogaway/papers/relates.pdf (Accessed: 2 February, 2012)

Shannon, C.E. (1950) Prediction and Entropy of Written English [Online]. Available from: url=http://www.princeton.edu/~wbialek/rome/refs/shannon_51.pdf (Accessed: 2 May, 2012)

Schmen, K. (2012) The Pathology of Cryptology [Online]. Available from: http://www.tandfonline.com/doi/abs/10.1080/01611194.2011.632803 (Accessed: 27 January, 2012)

Schneier, B. (2012) Liars and Outliers. John Wiley and Sons.

Shor, P.W. (1999) Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer [Online]. Available from: http://eecourses.technion.ac.il/046232/papers_for_posters/Shor_303.pdf (Accessed: 5 May, 2012)

Simonite, T. (2012) 'Software Translates Your Voice Into Another Language', Technology Review [Online]. Available from: http://www.technologyreview.com/computing/39885/page1/ Accessed: 18 March, 2012

Smith, D.L. (2005) 'Natural-born Liars', Scientific American Mind, Vol. 16 No. 2, Pp. 16-23.

Smith, K. (2011) 'Learning Bias, Cultural Evolution of Language, and the Biological Evolution of the Language Facility', Human Biology, v. 83, no. 2, pp. 261–278

[Online]. Available from: http://ehis.ebscohost.com.ezproxy.liv.ac.uk/eds/pdfviewer/pdfviewer?vid=2&hid=103&sid=1c72d637-9b51-442f-b863-09f395f10b09%40sessionmgr104 (Accessed: 28 May, 2012)

Snyder, B., Barzilay, R. & Knight, K. (2009) A Statistical Model for Lost Language Decipherment [Online]. Available from: http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf (Accessed: 25 January, 2012)

Stojko, J. (2011) The Voynich Manuscript [Online]. Available from: http://en.wikibooks.org/wiki/The_Voynich_Manuscript/John_Stojko (Accessed: 25 January, 2012)

Takashima, H. (2009) Comparing Ease-of-Processing Values of the Same Set of Words for Native English Speakers and Japanese Leaerners of Enhglish [Online]. Available from: http://ehis.ebscohost.com.ezproxy.liv.ac.uk/ehost/detail?sid=a5a8b902-15ba-4ca1-abf0-01dd106eebde%40sessionmgr113&vid=1&hid=124&bdata=JnNpdGU9ZWhvc3QtbG12ZSZzY29wZT1zaXR1 (Accessed: 20 July, 2011)

The University of Texas, M.M. Bakhtin (1981) The Dialogic Imagination [Online]. Available from: http://www.public.iastate.edu/~carlos/607/readings/bakhtin.pdf (Accessed: 6 May, 2012)

Topkara, M., Taskiran, C.M. & Delp, E.J. (2005) Natural Language Watermarking [Online]. Available from: http://www.cerias.puurdue.edu/tools_and_resources/bibtex_archive/archive/PSI000441.pdf (Accessed: 12 November, 2011)

Tosi, G., Christmann, G., Berloff, N.G., Tsotis, O., Gao, T., Hatzopoulos, Z., Savvidis, P.G. & Baumberg, J.J. (2011) Sculpting oscillators with light within a nonlinear quantum fluid [Online]. Available from: arxiv.org/pdf/1111.7133 (Accessed: 10 January, 2012)

Trivers, R. (2001) A Scientific Theory of Self-Deception [Online]. Available from: http://www.warsocialism.com/_Biology/AScientificTheoryOfSelfDeception.pdf (Accessed: April 15, 2012)

Tollet, J. (2011) Interception at 100 Gbps and more [Online]. Available from: http://wikileaks.org/spyfiles/docs/qosmos/58_interception-at-100-gbps-and-more.html

Tooby, J.& Cosmides, L. (1998) The Evolution of War and its Cognitive Foundations [Online]. Available from: http://www.psych.ucsb.edu/research/cep/papers/EvolutionofWar.pdf (Accessed: March 20, 2012)

Tooby, J.& Cosmides, L. (2005) On the Universality of Human Nature and the Uniqueness of the Individual: The Role of Genetics and Adaptation [Online]. Available from: http://citeseerx.ist.psu.edu/viewdoc/download/doi/10.1.1.158.433 (Accessed: 31 December, 2011)

Undorf, M. (2011) Judgements of Learning Reflect Encoding Fluency: Conclusive Evidence for the Ease-of-Processing Hypothesis [Online]. Available from: http://ehis.ebscohost.com.ezproxy.liv.ac.uk/ehost/detail?sid=c5332bf2-74eb-4bb8-90e9-44cdf8eab773%40sessionmgr115&vid=2&hid=124&bdata=JnNpdGU9ZWhvc3QtbGl2ZSZzY29wZT1zaXRl#db=pdh&AN=xlm-37-5-1264 (Accessed: 27 January, 2012)

USCC (2010) External Implications of China's Internet-related Activities [Online]. Available from: http://www.uscc.gov/annual_report/2010/Chapter5_Section_2(page236).pdf (Accessed: 11 April, 2012)

Van Eck, W. (1996) Electromagnetic Radiation from Video Display Units: An Eavesropping Risk? [Online]. Available from: http://cryptome.org/emr.pdf (Accessed: February 4, 2010)

Vella, A. (2012) Group Discussion Board [Online]. Available from: https://elearning.uol.ohecampus.com/webapps/discussionboard/do/conference?action=list_forums&course_id=_208355_1&nav=group_forum&group_id=_180370_1 (Accessed: 29 May, 2012)

Vitello, G. (2001) My Double Unveiled: The Dissipative Quantum Model of the Brain [Online]. Available from: http://site.ebrary.com.ezproxy.liv.ac.uk/lib/liverpool/docDetail.action?docID=5004969 (Accessed: 22 August 2010)

Walshe (1963) 'Wet Pint' German Life and Letters Vol: 16 Issue: 3-4 ISSN: 0016-8777 Date: 04/1963 Pages: 290 - 293 [Online]. Available from: http://liv.summon.serialssolutions.com.ezproxy.liv.ac.uk/search?s.q=linguistic+causation (Accessed 18 April, 2010)

Wang, J-H. & Hao, J. (2007) 'An Approach to Computing With Words Based on Canonical Characteristic Values of Linguistic Labels', IEEE Transactions on Fuzzy Systems, Vol. 15, No. 4, Pp 593-604) [Online]. Available from: http://ieeexplore.ieee.org/iel5/91/4286958/04286980.pdf (Accessed: December 21, 2011)

Wenzhong, L. (1993) China English and Chinglish [Online]. Available from: http://en.cnki.com.cn/Article_en/CJFDTOTAL-WJYY199304003.htm (Accessed: 20 May, 2012)

Wikipedia (2012) Obfuscated Perl Contest [Online]. Available from: http://en.wikipedia.org/wiki/Obfuscated_Perl_Contest (Accessed: 20 May, 2012)

Yang, S. (2005) Researchers recover typed text using audio recording of keystrokes [Online]. Available from: http://berkeley.edu/news/media/releases/2005/09/14_key.shtml (Accessed: 28 May, 2012)


APPENDICES


            1. Obfuscator Source Code and Data-Stores

***trans.pl***

#!/usr/bin/perl


require Encode;


binmode STDOUT, ":utf8";


binmode STDIN, ":utf8";


#check for correct arguments


my $num_args = $#ARGV + 1;


if ($num_args < 1) {


print "\nUsage: trans.pl table_name [mode]\n";


exit;


}


$table=<@ARGV[0]>;


open TBL, '<:encoding(utf8)', "$table";


@out_chars="";


my @tbl=<TBL>;


my @inp=<STDIN>;


$mode=<@ARGV[1]>;


$icnt=0;


chomp (@inp);


foreach $word (@inp){


@chars=split(//,$word);


foreach $char (@chars){


@out_chars="";


$char=uc($char);


$icnt=0;


$ocnt=0;


foreach $entry (@tbl){


($key,$str)=split(/\t/,$entry);


chomp($str);


if ($char eq $key){


@out_chars[$ocnt++]=$str;


}


if ($char eq " "){@out_chars="";}


}


$ccnt=$#out_chars+1;


for ($ocnt=0;$ocnt<$ccnt;$ocnt++){


if ($icnt>$ccnt){$icnt=0;}


if ($mode eq "random"){


$out_char=@out_chars[rand($#out_chars+1)];


}else{


$out_char=@out_chars[$icnt++];


}


}


print $out_char;


}


}


***thai2rom.txt

***

A


A


B


C


CH


CH


D


D


E


E


E


F


F


G


H


H


I


I


J


K


K


K


K


L


L


L


M


N


N


NG


O


P


P


P


R


R


S


S


S


T


T


T


T


TH


U


U


V


W


Y


***rom2thai.txt

***

A


A


B


C


CH


CH


D


D


E


E


E


F


F


G


H


H


I


I


J


K


K


K


K


L


L


L


M


N


N


NG


O


P


P


P


R


R


S


S


S


T


T


T


T


TH


U


U


V


W


Y

***rom2cyr.txt

***

A А


B Б


V В


G Г


D Д


E Е


YO Ё


Z Ж


Z З


I И


Y Й


K К


L Л


M М


N Н


O О


P П


R Р


S С


T Т


U У


F Ф


H Х


TS Ц


C Ч


SH Ш


SH Щ


YU Ю


YA Я


***rom2arb.txt

***

A ا


B ب


T ت


TH ث


J ج


H ح


KH خ


D د


TH ذ


R ر


Z ز


S س


SH ش


S ص


D ض


DT ط


TH ظ


AH ع


RI غ


F ف


Q ق


C ك


K ك


L ل


M م


N ن


H ه


W و


Y ي


O ء


A َ


U ُ


I ِ



cyr2rom.txt and arb2rom.txt skipped for brevity as they are the same as their counterparts with columns reversed, as exemplified by rom2thai.txt vs. thai2rom.txt.

            1. Scrambler Source Code

#!/usr/bin/perl


use strict;


my @inp=<STDIN>;


my $weight=<@ARGV[0]>;


if ($weight==0){$weight=.5;}


our $word;


our @words;


my $char;


my @chars;


chomp (@inp);


foreach my $line (@inp){


@words=split(/ /,$line);


foreach $word (@words){


@chars=split(//,$word);


my $acnt=0;


my $lmt=$#chars-1;


foreach $char (@chars){


if ($acnt==$lmt){}


my $rnd=rand(1);


if($acnt==0){$acnt++;}


else{


if ($rnd<$weight){


my $tmp=@chars[++$acnt];


#swap

@chars[$acnt]=$char;


@chars[$acnt-1]=$tmp;


}else{$acnt++;}


}


if ($acnt==$lmt){print @chars};


}


print " ";


}


print "\n";


}



            1. Substituter Source Code and Data-Store

***subst.pl***

#!/usr/bin/perl


use strict;


#check for correct arguments


my $num_args = $#ARGV + 1;


if ($num_args != 1) {


print "\nUsage: susbst.pl table_name\n";


exit;


}


my $table=<@ARGV[0]>;


open TBL, '<:encoding(utf8)', "$table";


my @tbl=<TBL>;


my @inp=<STDIN>;


our $word;


our @words;


my @chars;


my $ocnt=0;


chomp (@inp);


foreach my $line (@inp){


@words=split(/ /,$line);


foreach $word (@words){


foreach my $entry (@tbl){


(my $key,my $str)=split(/\t/,$entry);


if ($word eq $key){


chomp($str);


print($str." ");


}


}


}


print "\n";


}


I have not received permission to use any crossword dictionary at this time, so I've created my own very small one for the proof of concept. While there has also been some testing performed with the wiktionary database that is not included.

***cross.txt***

aardvark has a nose for ants,


abacus count your beads,


bee stinging letter,


carrot food for the eyes,


dog fleas best friend,


elephant three blind men see it differently,


frog adult tadpole,


golf game with caddies,


house has many mansions,


is depends on what the meaning of is,


joke is on you,


kangaroo baby's in its pocket


lion simba-lic leader


my dog has fleas,


name a rose by any other,


octopus eight shoes required,


penguin bird with a tux,


question begets an answer,


ranch the country dressing,


slalom a race between poles,


small not large,


target concentric rings,


umbrella parasol's alter-ego

,

vampire leaves you with a fang-over

,

walrus not an egg-man

,

xenophobe X-rated racist

,

yarn kitty loves a good story

,

zamboni a real ice-raker,