Steganography aims to hide the very presence of covert communications by imperceptibly
embedding secret information into innocent-looking carriers, such as digital
images, videos, texts, etc. With the increasing use of computer and the development
of Internet, text is becoming increasingly popular in daily life. This motivates
the development of linguistic steganography, which refers to employing the knowledge
of natural language processing for camouflaging information into texts (Atallah
et al., 2000).
Linguistic steganography has formed a new research field in steganography,
which has recently attracted the attention of many researchers (Bergmair,
2007). The contemporary linguistic steganography in the literatures can
be grouped into three categories in terms of the way of generating the stego
The first class is generalized to the approach that directly produces a new
natural-like stego text without the support of a given cover text. The primary
methods mainly based on mimicking technique to convert the secret information
into a text that obeys the statistical properties of another particular normal
natural language text. In order to improve the readability and quality of the
produced texts, probabilistic context-free grammars are combined with mimicking
technique to generate stego texts (Wayner, 2002). Despite
the fact that the generated texts have almost regular grammar and syntax, their
contents are always meaningless and implausible to human readers. The transmission
of these texts between different communication parties may raise suspicion by
people inspection. Moreover, the existence of communicated secret information
can be detected by using statistical characteristics of correlations between
the general service words (Chen et al., 2008).
The second class which converts the cover text into a complete different normal-like
stego text, can be considered as an extension to the first class mentioned earlier.
Combining text-mimicking technique with PPT documents, Liu
et al. (2008) converted secret information into innocuous sentences
by exploiting the resource from a cover PPT document. Then the generated sentences
are written into the note pages of the cover document to enhance the security.
The efficiency and security of this steganography is greatly improved compared
with the methods of the first class. Another typical method of this class is
based on machine translation (Grothoff et al., 2005;
Stutsman et al., 2006; Grothoff
et al., 2009). Its critical point is to create multiple translations
for each sentence in cover text and then select one of these translated sentences
to encode information. The generated stego texts with the help of basic information
provided by cover texts have better readability and quality than those produced
by steganography of the first class. However, they also have a disadvantage
of causing suspicious from text steganalysis (Meng et
al., 2010). Steganalysis is the art of discovering secret information
in digital media. Once a steganographic method is defeated by steganalysis,
the secret information is no longer secure.
The approaches of slightly modifying an existing text to obtain a stego text
are defined as the third class. This class of methods manipulates lexical, syntactic
or semantic properties of a given text while preserving the meaning as much
as possible to embed information (Topkara et al.,
2005). Previous research on linguistic steganography is mainly focused on
the approaches employing synonym substitution and syntactic transformation for
The synonym substitutions-based approach embeds information in the way of replacing
the word by its synonym to represent designated value. In theory, the modifications
preserve the meaning of cover texts. However, simple synonym substitutions may
cause potential wrong syntax and semantics since the synonyms with the identical
or similar meanings still have semantic and pragmatic differences. Some more
superior methods were proposed to overcome the vulnerability of synonym substitutions
(Bolshakov, 2004; Topkara et
al., 2006a). Bolshakov (2004) proposed to directly
substitute an absolute synonym for a word and previously test a relative synonym
whether it is semantic compatible with collocations containing the replaced
word. Topkara et al. (2006a) used a quantitative
resilience criterion to just choose one most appropriate alternative for every
being replaced word in a text. However, the above efforts were just focused
on improving the compatibility of the synonym substitutions with the context
and do not consider the changes of the inherent statistical characteristics
of synonyms sequence, such as occurrence frequency of the synonyms. The synonym
substitution-based steganography has low resistance of the existing text steganalysis
attack (Taskiran et al., 2006; Yu
et al., 2008).
Another extensively concerned steganographic approach of the third category
is based on syntactic transformations (Atallah et al.,
2001; Liu et al., 2005; Wang
et al., 2008; Topkara et al., 2006b;
Murphy and Vogel, 2007; Meral et
al., 2007; Meral et al., 2009; Kim,
2008). This approach expresses a sentence into a different semantically
equivalent sentence structure for hiding information. For instance, it transforms
an active sentence into a passive one without any serious meaning and grammaticality
change. Atallah et al. (2001) were the first ones
to employ transformations described in post-Chomskian generative syntax to design
a natural language watermarking scheme. Topkara et al.
(2006b) attempted to resolve the implementation challenges syntactic transformations-based
steganography and designed an objective evaluation metric to evaluate the quality
of the stego text. By taking the characteristics of the language into consideration,
several sophisticated steganographic methods have been proposed using relevant
particular syntactic transformations of different languages such as Chinese
(Liu et al., 2005; Wang et
al., 2008), Turkish (Meral et al., 2007;
Meral et al., 2009), Korean (Kim,
2008), etc. Comparing with synonym substitution-based method, the syntactic
transformation-based method has high robustness to resist attacks but it also
demands deep structure analysis and its embedding capacity is limited. In practical,
a perfect steganography tool is hard to be implemented and widely used for lacking
enough natural language knowledge and matured reliable tools of natural language
Linguistic steganography is gradually improved recently but much more efforts
are need to avoid linguistic flaws in stego texts. However, it is a difficult
task of building a complex natural language processing systems. Some linguistic
steganographic methods possess of low popularization and application value or
the stego texts in practical always have degradation of quality in theory. And
texts always provide few redundancies of linguistics for embedding information
leading to low embedding bit rate of linguistic steganography. Moreover, most
steganographic methods have been defeated by using the distortions of statistics
with the development of text steganalysis (Taskiran et
al., 2006; Chen et al., 2008; Meng
et al., 2010; Yu et al., 2008). These
vulnerabilities and concerns of the above linguistic steganography motivate
to develop a linguistic lossless steganographic method with high resistance
of steganalysis attacks and improved embedding bit rate as well.
Multiple choices are the most popular and frequently used form in educational
testing, professional certification examination and many other sorts of tests
and surveys. Employing MCQs to hide information would not arouse suspicious
from eavesdroppers. Considering the characteristics of MCQs, a secure steganography
is proposed by taking MCQs as the cover carriers in this study. Firstly, by
further investigation on MCQs, two inherent attributes are introduced for steganography,
because of their peculiar nature. Then each MCQ is assigned to two determinate
attribute values based on the character lengths and the order of its options.
Selecting MCQs from a prepared MCQs bank according to their first attribute
values will generate a stego text. And modifying the second attribute values
of MCQs via equivalent transformations would embed more information.
MCQS-BASED STEGANOGRAPHIC METHOD
Within todays society, various examinations or tests designed to measure peoples knowledge, skills, aptitude, etc., in many different fields such as the education, professional certification, psychology, the military, etc. MCQs are the basis of a significant portion of a test. In a test that has items formatted as MCQs, a candidate is given a number of set answers for each question and the candidate must choose which answer or group of answers is correct. With the convenience brought by Internet, many websites offers abundance MCQs for personal research and private study. Especially, some special MCQs banks of many subjects are provided. Before attending a formal examination, some exercises are always done by using these materials. Communication via a text including MCQs retrieved from a question bank is commonplace, thus it is convenient to transmit secret information by using the MCQs as the cover mediums and could not arouse suspicious from eavesdroppers.
Attributes of MCQs: A MCQ always consists of a stem and a series of options. A test requires a test taker to choose all appropriate answers or only one answer from a list of options. Two vocabulary MCQs of primary English are provided in the following as examples:
||MCQ 1. The match was _____ because of the heavy rain.
A. called in B. called for C. called off D. called out
||MCQ 2. The badminton players were trained under a well-known
A. judge B. referee C. umpire D. coach
It is worthwhile to note that the MCQs in a test always serve the same subject.
In many formal standardized tests, a test question is often retrievable from
a question bank. From the appearance point of view, the arrangement of MCQs
in a test paper is always automatically made; the order of MCQs can be adjusted
based on requirements. On the other hand, options of well-written MCQs are independent
of one another and consistent in logic and grammar to the stem. Therefore, the
order of options can also be arbitrarily changed. With these characteristics,
MCQs can be successfully used to hide information. Before describing the proposed
steganography, two attributes, namely Q-attribute and P-attribute are introduced
respectively for each MCQ.
Given a MCQ t = <a1,a2,...,an>, where aj denotes the jth option. Let L(aj) denote the character length of the option aj. After rearranging the options in an increasing lexicographic order, the options are denoted as a new vector <a'1,a'2,...,a'n> . According to the character lengths of the reordered options, a Q-attribute denoted as g(t) is defined and calculated by:
where, bj = L(a'j) mod 2.
Apparently, g(t) is a random n-bit integer, whose value is in range of 0 to 2n. For the example MCQ 1 t1 = <called in, called for, called off, called out>, after rearranging the options in an increasing lexicographic order, a new vector of options <called for, called in, called off, called out>, is obtained, then, L(a'1) = 10 L(a'2) = 9, L(a'3) = 10, L(a'4) = 10, thus its Q-attribute value is 4. In the same way, Q-attribute value of the example MCQ 2 figured out at 14. MCQs Q-attribute values would be approximately independent and identically distributed (i.i.d.) in a large MCQs bank, since the character length of options in a MCQ is random and all options are mutually independent. MCQs with appointed Q-attribute values can be successfully retrievable from a large MCQs bank for embedding information.
Based on above analysis, we can also find that the options of a MCQ can be arranged into a required layout. Each possible arrangement can be assign to a unique value. For a given MCQ, the arrangement of its options will map to a determined value. Given the MCQ t = <a1,a2,...,an>, its n different options have n! number of permutations. Arranging the n! permutations in a certain order (lexicographic order, for example), the position of <a1,a2,...,an> locating at is defined as the MCQs P-attribute, denoted as f(t). In this study, we define the position starts at 0. The P-attribute value is a integer in range of 0 to n!-1. Generally, the options in a MCQ are always randomly placed in practical tests. Thus, the options can be reordered to make the MCQs P-attribute equal to a required value in a certain range without influencing the meaning of the MCQ and bringing any errors.
Next we illustrate the procedure of obtaining the P-attribute value of a MCQ.
Set the n options of MCQ t = <a1,a2,...,an>
in increasing lexicographic order is a'1,a'2,...,a'n.
For convenience, we set n = 4 forexample.
|| Diagram of embedding process
The 24 permutations in increasing lexicographic order are listed as follows:
Each permutation has its own position in the ordered permutation set. From
<a'1,a'2, a'3,a'4> to <a'4,a'3,a'2a'1>,
they are numbered 0 to n!-1, respectively. For a MCQ t = <a1,a2,a3,a4>,
if it is happen to a1 = a'1, a2 = a'2,
a3 = a'3, a4 = a'4, then its P-attribute
f(t) = 0. If a1 = a'1, a2 = a'3,
a3 = a'2, a4 = a'4, then f(t) =
2. For example MCQ 1 t1 = <a1,a2,a3,a4>=<called
in, calledfor, called off, called out>, after rearranging the options in
an increasing lexicographic order is obtained <a'1,a'2,a'3,a'4>=<called
for, called in, called off, called out>, then a1 = a'2,
a2 = a'1, a3 = a'3, a4
= a'4, thus f(t1) = 6. In the same way, the P-attribute
of example MCQ 2 is calculated to 9.
The embedding process: With the help of the two attributes of MCQs,
secret information can be securely embedded into MCQs. Before embedding secret
information, we collected enormous MCQs to form different banks based on their
subjects. In the embedding process, according to the secret information and
the options number of a MCQ in the collected MCQs bank, two integer sequences
will be calculated and encoded into MCQs. MCQs whose Q-attribute values orderly
equal to the first integer sequence are selected to generate a stego text. Then,
the options of the selected MCQs are reordered to make their P-attribute values
accord with the second integer sequence. Finally, the generated stego text consisting
of selected and modified MCQs is output. Diagram of embedding process is illustrated
in Fig. 1.
Now, we describe the embedding process of our MCQs-based steganography in detail:
||Classify the MCQs in the prepared MCQs bank into s groups
in terms of the number of their options. Record the options number of a
MCQ in the kth group as nk, k = 1,2,...,s. For each MCQ t, set
a marker Mar(t) and make Mar(t) = 0 in default
||Encrypt the secret information using an encryption algorithm and convert
the encrypted information into a larger integer denoted as M
||Initial i = 1, k = s
||Set n = nk, calculate , qi = M mod 2n,
in other words, qi equals to the value of the least n bits of
||Select a MCQ t = <a1,a2,...,aln>
whose Q-attribute g(t) = qi and Mar(t) from the kth groups in
the MCQs bank and modify Mar(t). If there are not any satisfied MCQ can
be successfully found, then k = k-1 and return to step 4
||Let M = M', calculate , pi = M mod n!, M = M' and reorder the
options of MCQ t into a new arrangement to make its P-attribute f(t) = pi
||If M =0, finish the embedding process and output the stego text composed
of the selected MCQs or else i = i+1 and return to step 4
Extraction process: In the extraction process, a larger integer is first
obtained from the two attribute values of
MCQs in the stego text. Then translating the integer into decrypted bytes can
retrieve the secret information. The Extraction process involves the following
||For each MCQ ti = <ai1,ai2,...,ain>
in the stego text, calculate its Q-attribute g(ti) and P-attribute
f(ti), the number of options of ti is denoted as ni,
i = 1,...N, N is the total number of MCQs
||Let MN+1, repeatedly calculate
||Retrieve the embedding information by converting M1
into bytes and decrypting the decoded information
RESULTS AND DISCUSSION
The proposed method can avoid making transformations under requirements of natural language processing techniques, such as natural language generation, understanding, part-of-speech tagging, syntax or semantic analysis, etc. Its implementation just needs a shallow analysis of identifying the components MCQs and a collection of enormous MCQs. Thus, it is easy to be implemented. In the experiments, we implemented it and carried out the performance evaluation.
First, many MCQs of different subjects such as English vocabulary, grammar, PMP (Project Management Professional) exam, Computer Fundamentals, IBM certification were collected from Internet. But, in practice, some MCQs are not well designed which include overlapping alternatives when one or more options are a subset of another. For example, the options contain the content of all of the above, none of the above, etc. These poorly constructed and ineffective options are unsuitable for our steganography. After removing the inappropriate MCQs, we obtained 9561 MCQs for present experiments.
Now, we provide an example to further explain the embedding process and extraction
process of the proposed MCQs-based steganography. Assuming the secret information
is China, its ASCII value is 67 104 105 110 97. For simplicity,
the secret information is directly hidden into MCQs without being encrypted.
The larger integer converted from China is M = 289514548833. Suppose
the options number of each MCQ in the prepared MCQs bank is 4, i.e., n = 4.
According to the embedding process, we calculate , q1 = M mod 2n
to obtain q1 = 1 and calculate M = M', , p1 = M mod n!,
M = M' to obtain p1 = 14. Then, we select a MCQ whose Q-attribute
value is 1 and rearrange its options to make the P-attribute value equal to
14. In the same way, we figure out q2,q3,q4,q5
= 9,4,9,13, p2,p3,p4, p5 = 4,0,7,0
and select four different MCQs whose Q-attribute values are 9, 4, 9 and 13 in
order and modify their options order to make their P-attribute values
equal to 4, 0, 7, 0. Finally, the selected MCQs are written into a stego text.
Conversely, the information can be recovered by converting the integer, which
is obtained from Eq. 2 with the Q-attribute and P-attribute
values of MCQs in the stego text, into characters. A concrete stego text with
the MCQs from an English vocabulary MCQs bank is given in the Appendix A.
Security analysis: Transmitting a stego text generated by our methods is much reasonable and secure. As the used MCQs are derived from existing test banks, stego texts consisting of MCQs are meaningful and comprehensible. In addition, the linguistic characteristics of both a single MCQ and whole MCQs are preserved, as the contents of MCQs are not modified during the embedding process. Thus, present method can obtain good imperceptibility.
During the embedding process, secret information is successfully hidden into MCQs by utilizing their Q-attributes and P-attributes. The two attribute values are determined by the length and arrangement of options. As we know, the least significant bit of the option character length and arrangement of options in a MCQ is random. In a normal text composed of MCQs, the MCQs Q-attribute and P-attribute values are approximately i.i.d. In the rest of this study, the normal text is especially the one just contains MCQs. On the other hand, the Q-attribute values of MCQs in a stego text corresponding to random integers from the secret information pieces are also i.i.d. In addition, the P-attribute values were maintained and evenly distributed, as the modifications on P-attribute values made by the proposed method are random according to the embedded information. In one word, the statistical characteristics of stego texts with MCQs are coincided with that of normal texts in terms of the distributions of Q-attribute and P-attribute. This will make steganalysis attackers fail to distinguish the generated stego texts from normal texts.
To demonstrate the above theory analysis, we conduct experiments on the distributions
of Q-attribute and P-attribute in stego texts and normal texts. First, 50 stego
texts containing 50-200 MCQs from four prepared MCQs banks were generated. The
MCQs in a text is about the same subject. The subjects of the four banks are
English vocabulary, PMP, Computer Fundamentals and IBM certification, respectively.
The secret information embedded in each stego text is different and encrypted
by using RC4 encryption algorithm. For comparison, 50 normal texts consisting
of 50-200 number MCQs are prepared.
|| The distribution of Q-attribute and P-attribute in normal
and stego texts (a) The distribution of Q-attribute in normal texts (b)
The distribution of Q-attribute in stego texts (c) The distribution of P-attribute
in normal texts (d) The distribution of P-attribute in stego texts
Both the MCQs in normal texts and stego texts have 4 options. The Q-attribute
and P-attribute of a MCQ has 24 and 4! possible values, respectively.
For each text, 16 probabilities of all the potential Q-attribute values can
be obtained. Thus, 800 probabilities can be obtained from 50 sample normal texts.
We order these 800 probabilities according to the ascending order of the Q-attribute
values. And the probabilities of the same Q-attribute value in different sample
normal texts are placed together. For example, the first 50 probabilities are
the ones when the Q-attribute equals to 0 in 50 different sample texts while
the last 50 probabilities are the ones when the Q-attribute equals to 15 in
50 different sample texts. All 800 probabilities are exhibited in Fig.
2a. In the same way, the probabilities of all the potential Q-attribute
values in 50 sample stego texts are exhibited in Fig. 2b.
The probabilities of all the potential P-attribute values in 50 sample normal
texts and 50 stego texts are exhibited in Fig. 2c and d,
From Fig. 2a-d, it can be observed that
both of the distributions of Q-attribute and P-attribute in stego texts are
approximate to those in normal texts. It is much difficult of taking the probability
of Q-attribute and P-attribute as clue to detect the existence of secret information
hidden in MCQs. Consequently; the proposed method has high resistance of steganalysis
attack and can provide high-level security for secret information.
Embedding bit rate: In this study, embedding bit rate is quantified in terms of embedded bits per language unit (sentence, for example). There are three parameters which affect the embedding bit rate. The three parameters are the number of used MCQs in a stego text, the average number of sentences per MCQ and the options number in a MCQ.
For a stego text composed of N MCQs, each with n options, let the average sentences per MCQ is L, then embedding capacity of the stego text is given by :
The embedding bit rate is defined as:
|| Average embedding bit rate comparison of stego texts of different
It is obvious that with N increasing, the embedding bit rate r increases. After substituting Eq. 3 into Eq. 4, the transformation of Eq. 4 is:
The smaller L is and the larger n is, the larger r is. In theory,
. The minimum of L is Lmin = 1+n, when all the stem of each used
MCQ and its options just contain one sentence. The options consisting of fragmentary
sentence such as a word or a phrase are all treated as a sentence. Thus the
maximum embedding bit rate:
when n = 3,4,5, rmax≥1.396, 1.717, 1.984. For a stego text with
large numbers of MCQs and at least 5 options in each MCQ, its embedding bit
rate is possible to exceed 2 bit/sentence.
In experiments, we calculated the embedding bit rate of above-mentioned 50 stego texts. The MCQs including figures were removed and not used for embedding information. The practical average embedding bit rate of stego texts with MCQs from different subjects are listed in Table 1. The experimental results show that it is easy to achieve an embedding bit rate of more than 1 bit/sentence using the proposed method which is superior to results of most linguistic steganographic methods.
After investigating the previous linguistic steganographic methods, we can
find that their embedding bit rates attained from the literatures are almost
always less than one. The machine translation-based method proposed in Stutsman
et al. (2006) has just an embedding bit rate of 0.33 bit/sentence,
as more embedding space must be provided for concealing the extra head information
which is used for recovering the hidden information without the source text.
For the synonym substitutions-based method, the word should be substituted with
a synonym in the same sense to preserve the meaning of the sentence. However,
most words always have multiple senses and the synonyms under different senses
of the same word are always not agreed. Each synonym substitution must be tested
using natural language processing knowledge such as word sense disambiguation
to determine whether the substitution is suitable or not. Thus the number of
successful synonym substitutions for hiding information is relatively small
leading to a low embedding bit rate. For example, Topkara
et al. (2006a) approximately achieved an embedding bit rate of 0.67
bit/sentence. The embedding bit rate of syntactic transformations-based methods
are often very low, as the available syntactic transformations are limited and
not all transformations can be applied to each sentence in a text. The scheme
reported in Atallah et al. (2001), yielded a bit
rate of 0.52 bit/sentence based on the demonstration sample developed by the
authors. And referring the method in Murphy and Vogel (2007),
just a bit rate of 0.3 bit/sentence is claimed by the authors. The bit rate
in Wang et al. (2008) is calculated to be approximately
0.35 bit/sentence using Chinese syntactic transformations based on the data
in this literature. Finally, Meral et al. (2009)
used a especially suitable language Turkish, which has more syntactic transformations
to conceal information than English language and provided a higher embedding
bit rate of 0.81 bit/sentence. Therefore, compared the abovementioned methods
with ours, it is obvious that our method can be more effective for covert communication
with larger embedding bit rate.
This study is the first attempt at constructing covert communication based on MCQs. The proposed method can transmit secret information by automatically generating new texts with MCQs, meanwhile conducting modifications on the options order of MCQs for embedding more information. Regarding the imperceptibility, no linguistics distortions or errors are brought into a stego text generated by the proposed method, making the secret information greatly imperceptible. In addition, the statistical characteristics of MCQs are well preserved which have been proven by theoretical analysis and experimental results. Consequently, the existence of secret information would not be detected by potential steganalysis. The experimental results have also shown that the proposed method achieved higher embedding bit rate than most previous linguistic steganographic methods. In conclusion, the proposed MCQs-based steganography has the advantages of good imperceptibility, high resistance of steganalysis attacks, large embedding bit rate and ease of implementation. It is a practical, effective and secure approach for covert communication.
This study was supported in part by National Natural Science Foundation of China (No. 60736016, No. 60973128, No. 60973113, No. 61073191 and No. 61070196), in part by National Basic Research Program of China (No. 2009CB326202, 2010CB334706) and in part by Hunan Provinicial Natural Science Foundation of China (No. 10JJ4041).
A.1. An example of the Stego text with the secret information China.