Data hiding is the technique that inserts data into the cover multimedia imperceptibly.
In the last decade, data hiding technique have been explored extensively for
multimedia files (Topkara et al., 2006; Chen
et al., 2010; Zeng and Wu, 2010; Chandra
and Khan, 2010; Liang et al., 2011; Li
et al., 2011). In spite of that, document is one of the most prevalent
and indispensable form of information nowadays and always be used as a cover
medium (Yang et al., 2011; Borges
et al., 2008). The text data hiding scheme mainly includes three
types: schemes using characteristics of the text (Brassil
et al., 1999; Rabah, 2004), schemes using
the structure of files (Cantrell and Dampier, 2004;
Castiglione et al., 2007) and schemes using natural
language processing (Topkara et al., 2005; Wang
et al., 2008). The last one called natural language data hiding has
great security and robustness and it has become the most commonly used scheme
for text data hiding.
Most multimedia data hiding techniques modify and hence distort the host data in order to insert the additional information. Often, this embedding distortion is small but still irreversible, i.e., it cannot be removed to recover the cover host data. An intriguing feature of reversible data embedding is reversibility, that is, one can remove the embedded data to restore the cover data.
In some applications, such as medical diagnosis and law enforcement, it is
critical to archiving of valuable original works after the hidden data are extracted
for some legal consideration (Ni et al., 2006).
Any changes of the content will affect the access to the real meaning of the
document. Although some insertion distortion is admissible in some circumstance,
permanent loss of content fidelity is undesirable. This highlights the requirement
for reversible data hiding techniques.
Recently, a number of reversible data hiding schemes have been proposed by
using techniques, such as circular interpretation of the bijective transform
(De Vleeschouwer et al., 2003), lowest levels
replacement (Celik et al., 2005), difference
expansion (Tian, 2003), recovery visible package (Yang
et al., 2009; Hong et al., 2009),
histogram shifting (Hong et al., 2010) or Prediction-Based
(Coltuc, 2011). Nevertheless, these schemes are applicable
only to images, few reversible data hiding scheme have studied on embedding
data in text document. Liu et al. (2010) proposed
a reversible text watermarking scheme by using an invertible transform to perform
the embedding and extracting process. However, due to the low embedding rate
and the strict requirements on the cover text, Lius scheme is hard to
put into practical application.
Microsoft (MS) office system 2007 since its launch, due to the advantage of
its file format, has been accepted by more and more users. Park
et al. (2009) proposed a method to embed secret information in MS
office 2007 document by creating unnoticeable contents which can be created
by using the concept of content relationship.
In this study, we proposed a novel scheme that uses a combination of two common methods to achieve reversible data hiding. Cover text file is compressed and inserted in the MS office 2007 document as an unnoticeable content. Then, the secret information is embedded into the cover text by synonym substitution. The proposed technique could be applied not only to MS office document but also to other structure base documents, such as PDF etc., Moreover, the security, robustness and capacity of the proposed scheme are also discussed.
Our proposed scheme on reversible text data hiding uses a combination of two common data hiding methods: the synonym substitution approach and the MS office 2007 documents structural approach.
The synonym substitution approach: In order to embed data in natural
language text unnoticeably, a systematic method for modifying, or transforming
text should preserve the grammaticality of the sentences. Synonym substitution
is the most widely used linguistic transformation for information hiding systems
since it is a relatively straightforward linguistic data hiding method (Topkara
et al., 2005).
Early in 1996, Bender et al. (1996) described
the method based on synonym substitution roughly. First they defined a synonym
table by a synonym dictionary such as WordNet. For example, =0= was represented
to the word big, 1 to large and then the encoder replaced the
selected words by their synonyms in the text.
A simplified example of synonym substitution is given in Fig.
1. The secret bit string 011 is to be embedded which can be divided
into two code words 01 and 1 and the information carriers in the cover
text are the words like and big. According to the encoding dictionary, enjoy
represents 11 and beautiful represents 0, then these words are chosen
to replace the original words in the cover sentence. The embedded sentence I
love this large house finally is sent to the receiver. The receiver only
needs an encoding dictionary and the decoding algorithm to extract the secret
|| An example of synonym substitution
Most of synonym substitution methods are more or less modified the real meaning of the original text and this is not tolerated in some applications, i.e. reversible data hiding technique is needed.
The MS office 2007 documents structural approach: MS office documents
(e.g., Word, PowerPoint and Excel) are most widely used document types at present.
In recent years, several methods for hiding data in MS office documents have
been proposed (Cantrell and Dampier, 2004; Castiglione
et al., 2007). However, these methods are mainly aimed at the MS
office 1997-2003 documents and not specifically for new MS office 2007 documents.
Park et al. (2009) demonstrated that how data
concealment in MS office 2007 documents is possible. They used OOXML files to
define customized parts, relationships, or both within a MS office 2007 document
to store and conceal information.
It is well known that MS office 1997-2003 documents are binary files but MS
office 2007 documents use a new file format based on OOXML format (Fu
et al., 2011). The MS office 2007 documents are consisted of many
compressed component parts that store in a ZIP format package and each decompressed
package conforms to the OOXML file format. A package is an ordinary Zip file
which contains package content-type item, relationship items and parts (ECMA
International, 2006). An OOXML file is based on the following:
||Package: ZIP archive
||Part: Files in ZIP archive
||Relationship: The relationships between the parts and package or among
There are many unknown parts and unknown relationships in OOXML file which
will not affect the appearance and content of the OOXML file. It is possible
to hide secret files by creating several unknown parts and its corresponding
relationships. First, the encoder copies the secret files to the carrier archive
|| Relationship file
Then, the encoder insert codes which define the secret files extensions and
its associate paths into the [Content_Types].xml. Finally, the encoder modifies
the relationship file, changes the secret information files type and its IDs.
After [Content_Types].xml was modified the secret files became unknown parts
of the document. For example, a secret file s.zip is to embed into the cover
document (in this example, an MS Word 2007 file). The encoder add a string (in
bold and italic) into [Content_Types].xml and relationship file which are
shown in Fig. 2, 3.
A MS office 2007 document containing these hidden files can be opened normally.
Thus these secret files are hidden completely without causing notification and
cannot be detected by any of the functions supported by MS office 2007 applications
(Park et al., 2009).
THE PROPOSED SCHEME
Here, a reversible text data hiding scheme is presented. The proposed data hiding scheme described below composes of the embedding and extraction process.
As stated earlier, the reversible text data hiding refers to natural language data hiding. And the most popular natural language data hiding scheme is based on synonym substitution. However, the method was irreversible i.e., the cover text was permanently transformed into a modified text when some synonyms were substituted. This shortcoming could be remedied by embedding the cover text information into the structure of MS office documents.
The basic idea of the proposed scheme is to use two common embedding methods
to embed the cover text T and the secret information M into a cover document
|| Data embedding process
|| (a) A cover text and (b) Its corresponding SIT
As shown in Fig. 4, cover document D consisted of the text
T and many compressed component parts S. In data embedding process, cover text
T was inserted into the component parts S as an unknown content firstly, then
the cover text T was replaced to a stego text T= by synonym substitutions.
The synonym index Table (SIT) generation: The original words that would be modified in the embedding process should be saved into the component parts of document. And all the modified words were synonyms, thus, the proposed scheme generated a SIT firstly. SIT consisted of the original words= synonym index in the dictionary. For a word W having synonyms in the dictionary, we got its index number t in its synonym set and then, saved the index number t into the SIT in binary format. To illustrate, an example of cover textual content and its corresponding SIT is provided in Fig. 5.
Reversible data embedding process: As described above, the original
words were saved into a SIT. However, it is not all the words which have synonyms
would be substituted when the secret information was embedded.
|| The embedding process
For the sake of imperceptibility, the SIT just saves the synonym index numbers
of synonyms which are about to be substituted. Therefore, the substitution sets
of the words which will be substituted should be found out before generating
the SIT. Figure 6 depicts the embedding process of the proposed
scheme and it consists of following steps:
||Step 1: Encrypt the secret information by any of various
encryption algorithms such as RC2, RC4, DES, etc.
||Step 2: Pre-embed the encrypted secret information by the synonym
substitution-based embedding method. Then, the substitution sets of the
words which were substituted by their synonyms were obtained
||Step 3: Judge whether there is enough space for embedding the secret
information, if enough, then proceed to the next step, otherwise warn the
user of insufficient embedding capacity
||Step 4: Generate the Synonym Index Table (SIT) which consisted
of the original words= synonym index in the dictionary. The substitution
sets of the words were obtained in step 2
||Step 5: Embed the file containing the SIT into the MS Office document
||Step 6: Embed the encrypted secret information by the synonym substitution
method and outputs the embedded document
In step 4, the SIT File Extension is often modified into more common ones, such as jpg, gif. In step 5, the encoder should decompress the MS Office document first and modify the [Content_Types].xml file and relationship file to make the SIT file into an unknown part of the document. Finally, the encoder compresses all parts of the MS Office document to form an embedded document containing the SIT file.
Secret information extraction and the cover textual content retrieving process:
The proposed extraction has two purposes.
|| The extracting process
One is extracting the secret information from and the other is restoring the
cover text process the embedded document. Figure 7 depicts
the extracting process of the proposed scheme mainly including four steps:
||Step 1: Unzip the MS office document and get the SIT
file from the component parts of the document
||Step 2: Extract the encrypted secret information by the synonym
substitution-based method and recover the embedded textual content to the
||Step 3: Decrypt the secret information which was encrypted by the
||Step 4: Out put the cover document and the secret information
In step 2, we have made some modifications to the synonym substitution-based method. First, for a word W= having synonyms in a dictionary, the decoder find its original index t in the SIT. Second, the decoder extract the embedded information while reversed W= to the original word W.
EXPERIMENTAL RESULTS AND DISCUSSION
The proposed scheme was implemented by Visual C++6.0 and used the trash space
of MS office Word 2007 document to save the cover text and run on the Pentium
Dual, 2.2 GHz CPU and 1 GB RAM hardware platform. Since the synonym substitution
method was used to embed the secret information, a dictionary was needed. The
most widely known such dictionary is WordNet (Fellbaum, 1998).
In WordNet English nouns, verbs, adjectives and adverbs are organized into synonym
sets and each set represents an underlying lexical concept. The content of WordNet
2.0 is summarized in Table 1 (Jurafsky and
Martin, 2000). In this scheme, WordNet 2.0 lexical database was utilized
to obtain sets of synonymy for the substitution purpose.
In experiments, a short novel saved as MS office Word 2007 format was used as a cover document which was shown in Fig. 5a. Assuming the secret information to be embedded is YB2011, RC4 was used to encrypt the secret information.
|| WordNet 2.0 database statistics
After the step 1 and step 2 which mentioned in embedding process, the substitution sets of the words which will be substituted were obtained. To shorten the length coding and enhance the coding efficiency, Huffman code was used to encode the substitution sets. As an example, for a word humans, its synonyms mankind and humankind and their Huffman code were 0, 01and 10. The encoder created the corresponding SIT as can be seen from Fig. 5b and embedded it into the documents trash space.
As the cover text was accurate preserved, secret information can be embedded into the text by synonym substitution.
Figure 8 shows the stego text generated by our proposed scheme and the cover text is shown in Fig. 5a. The words which are underlined were substituted in the embedding process.
Security, robustness and capacity: There are three different aspects
in information-hiding scheme contend with each other: capacity, security and
robustness (Provos and Honeyman, 2003). The cover text
was replaced by synonym substitution, replacing a word by a word with similar
meaning, this may make the text which is anomalous at the document level, or
anomalous with regard to the state of the world in the proposed scheme (Chang
and Clark, 2010). The marked MS office document which is produced by using
the proposed method will not be shown on the screen display. Moreover, the embedded
document can resist Format, Impersonation, Save
As, Copy and other active attacks (Fu
et al., 2011). Hence, the proposed scheme by using a combination
of above two methods has a good security and robustness.
Capacity, as one of the most important aspect in data hiding, regards to the
number of secret information bits could be hidden in the cover medium (Yang
et al., 2011). In this paper, we discuss two aspects. First, we consider
the capacity of data hiding in the MS office 2007 document. In MS office 2007
document, a file can be compressed and concealed into the package by using unknown
parts and unknown relationships (Park et al., 2009).
The hidden file size is not restricted, so the size of SIT file theoretically
is unlimited. Furthermore, the SIT file which had been compressed reversible
generally is very small, so it is hard to be detected when it was concealed
into MS office 2007 document. Second, we consider the capacity of data hiding
in cover text by synonym substitution method.
|| A stego text generated by our proposed scheme
|| Embedding bit rate in different strategies
The capacity has great direct relationship to the cover text size and the
synonym substitution strategy, the longer of cover text, the higher of capacity.
Therefore, we only consider the embedding bit rate in different substitution
strategy. In this study, two strategies were compared, one is strict and the
other is relative loose. For simplicity, we name them as strategy1 and 2, respectively.
The only mathematically formal type of linguistic synonymy is when the compared
words can replace each other in any context without any change in meaning. These
are absolute synonyms, e.g., agleam and nitid (Bolshakov
and Gelbukh, 2004; Liu et al., 2008). Otherwise,
the words that may change the meaning while replacing, we name them non-absolute
synonyms. In strategy1, the strict one, a word can only be replaced when it
is absolute synonym. A word can be replaced while it is non-absolute synonym
in strategy 2.
A set of experiments are conducted to measure the data embedding bit rate of
the proposed method when using the above two strategies. Ten theses, five journal
papers and five newspapers were selected as the experiments cover text. Figure
9 depicted the embedding bit rate in different strategies. The embedding
bit rate in strategy 2 is higher than that in strategy 1 because more words
had been substituted. However, the higher embedding bit rate always leads to
the poorer imperceptibility. In the proposed scheme, strategy 1 is used to get
the best imperceptibility and only when the capacity is not enough, strategy
2 is selected.
One common drawback of virtually all data hiding methods is the fact that the cover text is inevitably distorted by some embedding process. Although this distortion is often quite small, it may not be accepted in military, legal, etc., In this study, a novel scheme is developed for reversible data hiding. It enables the exact recovery of the cover text upon extraction of the embedded information. The scheme is achieved by simultaneously using two common methods of information hiding. Cover text is transformed into the compressed SIT file and is embedded as an unknown part with unknown relationship in OOXML file. Then secret information is embedded into the cover document by the synonym substitution.
At this stage, the authors just realized using the combination of synonym substitution and file structure methods. More experiments would be done due to the rapid improvements in natural language data hiding techniques in the future. And a combination of other different methods may be used to achieve higher performance.
This study is supported by the National Natural Science Foundation of China (60736016, 60973128, 61173142, 61103215, 61070196, 61173141, 61172156 and 61173136), National Basic Research Program 973 (2010CB334706 and 2011CB311808), PAPD fund, the 3rd Guangdong Province 211 program for key subject development. Scientific Research Fund of Xiangtan University (No. 11QDZ41). Human Provincial Education Department (No. 11C1215). Human Science and Technology Department (No. 2011GK3205).