INTRODUCTION
In the 21st century, communications are expanded because of developing
new technologies such as computers, the internet, mobile phones, etc.
By using these technologies in different areas of life and work, the issue
of information security has gained special significance. Hidden exchange
of information is one of the important areas of information security which
includes various methods like cryptography, steganography and coding.
In steganography, the main objective is to hide the information in cover
media so that nobody notices the existence of the secret information.
This is the major distinction between steganography and other methods
of hidden exchange of information. For example, in cryptography method,
people become aware of the existence of information by observing coded
information, although they will be unable to comprehend the information.
However, in steganography, nobody will understand the existence of information
in the resources.
Steganography works have been carried out on different medium such as
images, videos and sounds (Hopper, 2004). Text steganography is the most
difficult kind of steganography because there is a little redundant information
in a text file as compared with a picture or a sound file (Bender et
al., 1996).
Of course today, the security of information has been considerably improved
by combination of steganography with other methods mentioned. In addition
to hidden exchange of information, steganography is used in other areas
such as copyright protection, preventing e-document forging and other
applications (Maxemchukand and Low, 1997).
The structure of text documents is identical with what we observe, while
in other types of documents such as in picture, the structure of document
is different from what we observe. Therefore, in such documents, we can
hide information by introducing changes in the structure of the document
without making a notable change in the concerned output.
Contrary to other media such as sounds and video clips, using text documents
has been common since very old times. This has extended until today and
still, using text is preferred over other media, because the texts occupy
lesser space, communicate more information and need less cost for printing
as well as some other advantages.
As the use of text and hidden communication goes back to antiquity, we
have witnessed to steganography in texts since past. For example, this
method has been done by some Iranian classic poets as well.
Today, the computer systems have facilitated information hiding in texts.
The applications of information hiding in text have also expanded from
hiding information in electronic texts and documents to hide information
in web pages.
Most of the text steganography methods are designed for English texts
and there are a few text steganography methods for other languages. In
this study, we propose a new text Steganography method for Persian and
Arabic texts. This method hides data in Persian and Arabic texts which
are stored in Unicode format.
Our method is based on the fact that some letters in Persian and Arabic
languages have different shapes in different places of words. This feature
of Persian and Arabic languages is supported in Unicode format. This feature
is used of Unicode standard for hiding data in Persian and Arabic texts.
A few works have been done on hiding information in texts. Following
is the list of nine different methods of the works carried out and reported
thus far for English text. After explaining these methods, the text Steganography
methods that are especially designed for Persian and Arabic texts are
surveyed.
Steganography of information in random character and word sequences
(Bennett, 2004): By generating a random sequence of characters or
words, specific information can be hidden in this sequence.
In this method, the characters or words sequence is random; therefore
it is meaningless and attracts the attentions too much. It seems to be
that this method is not steganography, but it is a kind of encryption.
Syntactic methods (Bennett, 2004): By placing some punctuation
signs such as full stop (.) and comma (,) in proper places, one can hide
information in a text file.
This method requires identifying proper places for putting punctuation
signs. The amount of information to hide in this method is trivial.
Line shifting (Low et al., 1995) and (Alattar and Alattar,
2004): In this method, the lines of the text are vertically shifted
to some degree (for example, each line shifts 1/300 inch up or down) and
information are hidden by creating a unique shape of the text. This method
is proper for printed texts.
However, in this method, the distances can be observed by using special
instruments of distance assessment and necessary changes can be introduced
to destroy the hidden information. Also if the text is retyped or if character
recognition programs (OCR) are used, the hidden information would get
destroyed.
Word shifting (Low et al., 1995) and (Kim et al., 2003):
In this method, by shifting words horizontally and by changing distance
between words, information are hidden in the text.
This method is acceptable for texts where the distance between words
is varying. This method can be identified less, because change of distance
between words to fill a line is quite common. But if somebody was aware
of the algorithm of distances, he can compare the present text with the
algorithm and extract the hidden information by using the difference.
The text image can be also closely studied to identify the changed distances.
Although this method is very time consuming, there is a high probability
of finding information hidden in the text. Similar to Line Shifting method,
retyping of the text or using OCR programs destroys the hidden information.
Semantic methods (Bennett, 2004): In this method, we use the synonym
of words for certain words thereby hiding information in the text. A major
advantage of this method is the protection of information in case of retyping
or using OCR programs (contrary to Line Shifting and Word Shifting methods).
However, this method may alter the meaning of the text.
Feature coding (Rabah, 2004): In this method, some of the features
of the text are altered. For example, the end part of some characters
such as h, d, b or so on, are elongated or shortened a little thereby
hiding information in the text. In this method, a large volume of information
can be hidden in the text without making the reader aware of the existence
of such information in the text.
By placing characters in a fixed shape, the information is lost. Retyping
the text or using OCR program (as in Line Shifting and Word Shifting methods)
destroys the hidden information.
Open spaces (Bender et al., 1996) and (Huang and Yan, 2001):
Another method for hiding information is the use of abbreviations. In
this method, very little information can be hidden in the text. For example,
only a few bits can be hidden in a file of several kilobytes.
Open spaces (Bender et al., 1996) and (Huang and Yan, 2001):
In this method, hiding information is done through adding extra white-spaces
in the text. These white-spaces can be placed at the end of each line,
at the end of each paragraph or between the words. This method can be
implemented on any arbitrary text and does not raise attention of the
reader.
However, the volume of information hidden under this method is very little.
Also, some text editor programs automatically delete extra white-spaces
and thus destroy the hidden information.
To the best knowledge of the authors, there are only four Persian and
Arabic text steganography methods that are reported in the literatures.
The first two methods were developed by the authors of this article.
Dot steganography method (Shirali-Shahreza and Shirali-Shahreza, 2006a):
In the Dot steganography method (Shirali-Shahreza and Shirali-Shahreza,
2006a), data is hidden in Persian and Arabic texts by using a special
characteristic of these languages. Considering the existence of too many
dots in Persian and Arabic characters, in this approach by vertical displacement
of the dots (Fig. 1), we hide information in the texts.
This method does not attract attention and can hide a large volume of
information in text.
La steganography method (Shirali-Shahreza, 2007): The La steganography
method (Shirali-Shahreza, 2007) uses the special form of La word for hiding
the data. This word is created by connecting Lam and Alef characters. For hiding
bit 0, we use the normal form of word La (ﻻ ) by inserting Arabic extension
character between Lam and Alef characters. But for hiding bit 1, we use the
special form of word La (ﻻ ) which has a unique code in the Unicode Standard
(its code is FEFB in Unicode hex notation). This method is not limited to electronic
documents (e-documents) and can be used on printed documents.
Pointed letters with extension method (Gutub and Fattani, 2007):
The pointed letters with extension method (Gutub and Fattani, 2007), uses
the pointed letters with extension (Kashida in Arabic) to hold secret
bit one and the un-pointed letters with extension to hold secret bit zero.
|
Fig. 1: |
Vertical displacement of the dots of the Persian letter
NOON (Shirali-Shahreza, 2006a) |
|
Fig. 2: |
Example of text steganography method proposed in by
Gutub and Fattani (2007) (adding extensions after letters) |
Note that letter extension does not have any effect to the writing content.
It has a standard character hexadecimal code of 0640 in the Unicode system.
The extension is added before (or after) the pointed letters which can
be extended with extension character to hide bit 1 and added before (or
after) the un-pointed letters to hide bit 0. Figure 2
shows an example of this method.
Diacritic arabic method (Aabed et al., 2007): In the diacritic
Arabic method (Aabed et al., 2007), a diacritic Arabic text is
used for hidden exchange of information. There are eight diacritics in
Arabic text. The most frequent diacritics in Arabic text is Fatha and
the probability of its occurrence is equal to the occurrence probability
of other seven diacritics (Aabed et al., 2007).
In this method, at the first the cover text is assumed to be a fully
diacritical text. To hide a bit 1 a Fatha is kept and to hide a bit 0
a non Fatha diacritic is kept and other diacritics are removed. So, in
the stego text each Fatha represents 1 and each non Fatha diacritic represents
a 0.
The main advantage of this method is its high capacity. But the main
disadvantage of this method is that it attracts the attention of the reader.
This method also needs a fully diacritical text, but most of Arabic texts
have no diacritic.
MATERIALS AND METHODS
In this study, we present a new method for text steganography in Persian
and Arabic Unicode texts.
Before explaining the method, we mention the main characteristics of
these two languages (Shirali-Shahreza and Shirali-Shahreza, 2006b). Then
we explain the Unicode Standard briefly and at last we explain our suggested
method in full details.
The characteristics of Persian and Arabic: Arabic alphabet has
28 letters. Persian has all the letters of Arabic and four more letters
of (Unicodes: 06AF, 0698, 0686, 0673). In these two languages, a letter
can have four different shapes. The shape of each letter is determined
by the position of that letter in a word. For example the letter 0639
is written as FECB at the beginning of a word, as FECC in the middle,
as FECA at the end and as FEC9 in the isolated form. We use this characteristic
of Arabic and Persian languages in our method.
In Persian and Arabic the letters are connected to each other in writing,
while in English the letters are written separately.
In English, the letters are written in a left-to-right format and in
some languages the letters are written in a top-to-bottom format, but
in Arabic and Persian the letters are written in a right-to-left format.
In Arabic and Persian languages, dot is very important and 17 of 32 Persian
letters (and 14 of 28 Arabic letters) have one or more dots. Among these
17 letters, 2 letters have 2 dots and 5 letters have 3 dots and the remaining
10 letters have one single dot, while in English only two small letters
have dot (.) i and j.
In Persian and Arabic some letters do not connected to each other. The
Zero Width Joiner (ZWJ) is a non-printing character which is when placed
between two characters that would otherwise not be connected, a ZWJ causes
them to be printed in their connected forms. The ZWJ`s Unicode is U+200D.
We use this character of Arabic and Persian languages in our method.
Unicode Standard: The Unicode Standard (The Unicode Consortium,
2006) is the international character-encoding standard used for presenting
the texts to process by computers. This standard is compatible to the
second version of ISO/IEC 10646-1:2000 and has the same characters and
codes of ISO/IEC 10646.
The Unicode standard enables us to encode all the characters used in
writing of the world languages. This standard uses the 16-bit encoding
which provides space for 65000 characters. So, it is possible to specify
and define 65000 characters in different moulds such as numbers, letters,
symbols and a great number of current characters in different languages
of the world.
The Unicode standard has determined codes for all the characters used
in main languages of the world. Moreover, because of the wideness of the
space dedicated to the characters, this standard also includes most of
the symbols necessary for high-quality typesetting. The languages whose
writing systems can be supported by this standard are Latin (covering
most of the European languages), Cyrillic (Russian and Serbian), Greek,
Arabic (including Arabic, Persian, Urdu, Kurdish), Hebrew, Indian, Armenian,
Assyrian, Chinese, Katakana, Hiragana (Japanese) and Hangeul (Korean).
Moreover there are a lot of mathematical and technical symbols, punctuation
marks, arrows and miscellaneous marks in this standard.
In the Unicode Standard, the Persian characters belong to the Arabic
block. This block has been developed to cover the characters of the languages
which use Arabic writing system. Among these languages we can mention
Persian, Urdu, Pashto, Sindhi and Kurdish.
This standard has detailed and careful explanations about the implementation
methods including letters-connection method, the exhibition of the right-to-left
and bi-direction texts. This way the programmers do not have to refer
to the local guide.
In the Unicode Standard, each Persian or Arabic letters has its unique
code. Also, all shapes of each letter have their own code. For example,
the code of letter Seen (ﺱ) in the Unicode Standard is 069B and
the codes of different forms are FEB1 for the isolated form (ﺱ),
FEB2 for the final form (ﺱ), FEB3 for the initial form (ﺳ)
and FEB4 for the medial form (ﺳ).
For saving the documents in the Unicode Standard, only the unique code
of each character is saved and the program which shows the letter will
show the correct shape of letter regarding to its position in the word.
Our method: As described earlier, each Persian or Arabic letter
can have four different shapes regarding to its position in the word and
each Persian or Arabic letter have one unique code which show the letter
in isolated form act as a word representative. But the four possible shape
of letter including the isolated form (the initial form, the medial form,
the final form and the isolated form) have their separate code in the
Unicode Standard.
In the Unicode Standard, only the code of representative form of letters
are saved in the text file and the program which shows the letter will
show the correct shape of letters regarding to their position in the word.
However, one can save the text in Unicode Standard by inserting the code
of correct shape of letters (regarding to their position in the word)
instead of their representative letter code. Therefore, the text viewer
-the program which shows the letter - does not determine the word shape
automatically and only show the letter shape which is related to the saved
code in the text.
The method proposed in this study for hiding data in Persian and Arabic
Unicode texts is using this feature of the Unicode Standard.
For each letter in the text, we can save it by using the representative
letter. But we can also save the letter by using the code of correct shape
of the letter (regarding to its position in the word).
For hiding bit 0 in the word, the first option is used for saving the
word. But for hiding bit 1 in the word, the second option is used.
But when we use the mixture of representative letters code and the code
of shape of the letters in one word together, the text viewer does not
select the correct shape of representative letters automatically and shows
them in isolated form.
For solving this problem, we insert the ZWJ character between the two
letters to connect them together. Because this character is a non-printing
character, therefore, this method does not make any apparent changes in
the original text and have a perfect perceptual transparency. A sample
of the process of hiding data in a word is shown in Table
1.
For extracting data from stego text (the text contains hidden data),
the code of letters is checked. If the letter is representative letter,
we conclude that bit 0 is hidden, but if the letter code is its shape
code (not the code of representative letter), we conclude that bit 1 is
hidden in the letter. By putting all the bits of 0 and 1 next to each
other we can extract the information hidden in the text.
Our method has very high hiding capacity, because we hide one bit in
each letter. Now we estimate the amount of data we can hide in a Persian
text. Assume that there are k words in the document and each word has
α letters in average. After each word, there is a space or a punctuation
mark such as comma. So, the size of the text is 2k(α+1) bytes because
there are (α+1) characters for each word and each character need
two bytes in the Unicode.
Table 1: |
Hiding 101 in word  |
 |
In our method, we hide one bits in each letter of the word. So, we hide
α bits in each word in average and a total of kα bit in the
document. Therefore the hiding capacity of our method as bit/kilobyte
is:
The average number of letters in each word (α) is 3.5 in Persian
(Shirali-Shahreza, 1996). So, we have:
This means that our method can hide about 400 bits of information in
each kilobyte of a Persian text.
RESULTS AND DISCUSSION
In out method, the information is hidden in Persian and Arabic texts
using the Unicode Standard.
We tested our method on some Persian text files. We selected the resources
which are used in our earlier Persian and Arabic text Steganography methods
(Shirali-Shahreza and Shirali-Shahreza, 2006a; Shirali-Shahreza, 2007)
in order to compare these methods.
These resources are selected for computing the capacity of the methods
for hiding data and including sport pages of some Iranian newspapers.
The Internet address of these newspapers and the capacity of each text
for hiding data are shown in Table 2. All of the pages
were retrieved on 20 August 2005.
Table 2 shows that we can hide about 400 bits in each
kilobyte of text.
As it is seen in the Table 2, our method capacity is
very high, especially in comparison with La steganography method (Shirali-Shahreza,
2007). In this method we hide a bit of information in each Persian and
Arabic letter, but in the Dot steganography method a bit of information
is hid in each letter with dot.
Table 2: |
Comparing the capacity of our method with the Dot Steganography
and La Steganography methods |
 |
So, Present method capacity is four times
higher than Dot steganography method (Shirali-Shahreza and Shirali-Shahreza,
2006a) in average.
Also, our method has advantages over these methods. For example, contrary
to the Dot steganography method (Shirali-Shahreza and Shirali-Shahreza,
2006a), this method does not change the apparent of the text and does
not required specific font.
CONCLUSIONS
In this study, a new method for Steganography in Persian and Arabic texts
has been presented. This method uses the Unicode Standard and the special
feature of Persian and Arabic languages that each letter has different
shapes.
This method is not dependent on any special format and we can save the
stego text in numerous formats such as HTML pages, Microsoft Word documents
or even plain text format. Because the stego Unicode texts will not change
during copy and paste between computer programs, the data hidden in texts
remains intact during these operations.
There are three important parameters in designing steganography methods:
Perceptual transparency, robustness and hiding capacity. These requirements
are known as the magic triangle and are contradictory (Cvejic, 2004).
Our method satisfies both perceptual transparency and hiding capacity
requirements. It does not make any apparent changes in the original text
by hiding data. So, even if the reader has the original text, it is impossible
for him to realize the hiding of the data by merely observing the appearance
of the text. However, the original texts are not available to the observer
in text Steganography methods usually. Therefore, the main goal of text
Steganography, that is the impossibility of detection of the presence
of data, has been achieved. Also, we hide one bit/letter in the text file,
so our method has high capacity.
In some steganography methods, standard structure of the text will be
disarranged and spelling and grammatical errors will be created in the
text, but in this method the appearance of the text is not changed at
all and the text still remains standard.
The Unicode Standard supports different languages and can be used on
different systems and devices which are supporting the Unicode Standard.
Moreover, the Arabic is the official language of the Muslims and about
two billion Muslims live throughout the world. As a result, a wide range
of the users can use our method.
Since Pashto (the official language of Afghanistan) and Urdu (the official
language of Pakistan) are similar to Arabic and Persian, we can also apply
this method to these two languages.
In addition the Arabic and Persian languages have other specific characteristics
which can be used for text Steganography.
This method can be used for secret communication and for the prevention
of the illegal reproduction and distribution of the texts, especially
e-documents as well (Brassil et al., 1994).
INTRODUCTION
In the 21st century, communications are expanded because of developing
new technologies such as computers, the internet, mobile phones, etc.
By using these technologies in different areas of life and work, the issue
of information security has gained special significance. Hidden exchange
of information is one of the important areas of information security which
includes various methods like cryptography, steganography and coding.
In steganography, the main objective is to hide the information in cover
media so that nobody notices the existence of the secret information.
This is the major distinction between steganography and other methods
of hidden exchange of information. For example, in cryptography method,
people become aware of the existence of information by observing coded
information, although they will be unable to comprehend the information.
However, in steganography, nobody will understand the existence of information
in the resources.
Steganography works have been carried out on different medium such as
images, videos and sounds (Hopper, 2004). Text steganography is the most
difficult kind of steganography because there is a little redundant information
in a text file as compared with a picture or a sound file (Bender et
al., 1996).
Of course today, the security of information has been considerably improved
by combination of steganography with other methods mentioned. In addition
to hidden exchange of information, steganography is used in other areas
such as copyright protection, preventing e-document forging and other
applications (Maxemchukand and Low, 1997).
The structure of text documents is identical with what we observe, while
in other types of documents such as in picture, the structure of document
is different from what we observe. Therefore, in such documents, we can
hide information by introducing changes in the structure of the document
without making a notable change in the concerned output.
Contrary to other media such as sounds and video clips, using text documents
has been common since very old times. This has extended until today and
still, using text is preferred over other media, because the texts occupy
lesser space, communicate more information and need less cost for printing
as well as some other advantages.
As the use of text and hidden communication goes back to antiquity, we
have witnessed to steganography in texts since past. For example, this
method has been done by some Iranian classic poets as well.
Today, the computer systems have facilitated information hiding in texts.
The applications of information hiding in text have also expanded from
hiding information in electronic texts and documents to hide information
in web pages.
Most of the text steganography methods are designed for English texts
and there are a few text steganography methods for other languages. In
this study, we propose a new text Steganography method for Persian and
Arabic texts. This method hides data in Persian and Arabic texts which
are stored in Unicode format.
Our method is based on the fact that some letters in Persian and Arabic
languages have different shapes in different places of words. This feature
of Persian and Arabic languages is supported in Unicode format. This feature
is used of Unicode standard for hiding data in Persian and Arabic texts.
A few works have been done on hiding information in texts. Following
is the list of nine different methods of the works carried out and reported
thus far for English text. After explaining these methods, the text Steganography
methods that are especially designed for Persian and Arabic texts are
surveyed.
Steganography of information in random character and word sequences
(Bennett, 2004): By generating a random sequence of characters or
words, specific information can be hidden in this sequence.
In this method, the characters or words sequence is random; therefore
it is meaningless and attracts the attentions too much. It seems to be
that this method is not steganography, but it is a kind of encryption.
Syntactic methods (Bennett, 2004): By placing some punctuation
signs such as full stop (.) and comma (,) in proper places, one can hide
information in a text file.
This method requires identifying proper places for putting punctuation
signs. The amount of information to hide in this method is trivial.
Line shifting (Low et al., 1995) and (Alattar and Alattar,
2004): In this method, the lines of the text are vertically shifted
to some degree (for example, each line shifts 1/300 inch up or down) and
information are hidden by creating a unique shape of the text. This method
is proper for printed texts.
However, in this method, the distances can be observed by using special
instruments of distance assessment and necessary changes can be introduced
to destroy the hidden information. Also if the text is retyped or if character
recognition programs (OCR) are used, the hidden information would get
destroyed.
Word shifting (Low et al., 1995) and (Kim et al., 2003):
In this method, by shifting words horizontally and by changing distance
between words, information are hidden in the text.
This method is acceptable for texts where the distance between words
is varying. This method can be identified less, because change of distance
between words to fill a line is quite common. But if somebody was aware
of the algorithm of distances, he can compare the present text with the
algorithm and extract the hidden information by using the difference.
The text image can be also closely studied to identify the changed distances.
Although this method is very time consuming, there is a high probability
of finding information hidden in the text. Similar to Line Shifting method,
retyping of the text or using OCR programs destroys the hidden information.
Semantic methods (Bennett, 2004): In this method, we use the synonym
of words for certain words thereby hiding information in the text. A major
advantage of this method is the protection of information in case of retyping
or using OCR programs (contrary to Line Shifting and Word Shifting methods).
However, this method may alter the meaning of the text.
Feature coding (Rabah, 2004): In this method, some of the features
of the text are altered. For example, the end part of some characters
such as h, d, b or so on, are elongated or shortened a little thereby
hiding information in the text. In this method, a large volume of information
can be hidden in the text without making the reader aware of the existence
of such information in the text.
By placing characters in a fixed shape, the information is lost. Retyping
the text or using OCR program (as in Line Shifting and Word Shifting methods)
destroys the hidden information.
Open spaces (Bender et al., 1996) and (Huang and Yan, 2001):
Another method for hiding information is the use of abbreviations. In
this method, very little information can be hidden in the text. For example,
only a few bits can be hidden in a file of several kilobytes.
Open spaces (Bender et al., 1996) and (Huang and Yan, 2001):
In this method, hiding information is done through adding extra white-spaces
in the text. These white-spaces can be placed at the end of each line,
at the end of each paragraph or between the words. This method can be
implemented on any arbitrary text and does not raise attention of the
reader.
However, the volume of information hidden under this method is very little.
Also, some text editor programs automatically delete extra white-spaces
and thus destroy the hidden information.
To the best knowledge of the authors, there are only four Persian and
Arabic text steganography methods that are reported in the literatures.
The first two methods were developed by the authors of this article.
Dot steganography method (Shirali-Shahreza and Shirali-Shahreza, 2006a):
In the Dot steganography method (Shirali-Shahreza and Shirali-Shahreza,
2006a), data is hidden in Persian and Arabic texts by using a special
characteristic of these languages. Considering the existence of too many
dots in Persian and Arabic characters, in this approach by vertical displacement
of the dots (Fig. 1), we hide information in the texts.
This method does not attract attention and can hide a large volume of
information in text.
La steganography method (Shirali-Shahreza, 2007): The La steganography
method (Shirali-Shahreza, 2007) uses the special form of La word for hiding
the data. This word is created by connecting Lam and Alef characters. For hiding
bit 0, we use the normal form of word La (ﻻ ) by inserting Arabic extension
character between Lam and Alef characters. But for hiding bit 1, we use the
special form of word La (ﻻ ) which has a unique code in the Unicode Standard
(its code is FEFB in Unicode hex notation). This method is not limited to electronic
documents (e-documents) and can be used on printed documents.
Pointed letters with extension method (Gutub and Fattani, 2007):
The pointed letters with extension method (Gutub and Fattani, 2007), uses
the pointed letters with extension (Kashida in Arabic) to hold secret
bit one and the un-pointed letters with extension to hold secret bit zero.
|
Fig. 1: |
Vertical displacement of the dots of the Persian letter
NOON (Shirali-Shahreza, 2006a) |
|
Fig. 2: |
Example of text steganography method proposed in by
Gutub and Fattani (2007) (adding extensions after letters) |
Note that letter extension does not have any effect to the writing content.
It has a standard character hexadecimal code of 0640 in the Unicode system.
The extension is added before (or after) the pointed letters which can
be extended with extension character to hide bit 1 and added before (or
after) the un-pointed letters to hide bit 0. Figure 2
shows an example of this method.
Diacritic arabic method (Aabed et al., 2007): In the diacritic
Arabic method (Aabed et al., 2007), a diacritic Arabic text is
used for hidden exchange of information. There are eight diacritics in
Arabic text. The most frequent diacritics in Arabic text is Fatha and
the probability of its occurrence is equal to the occurrence probability
of other seven diacritics (Aabed et al., 2007).
In this method, at the first the cover text is assumed to be a fully
diacritical text. To hide a bit 1 a Fatha is kept and to hide a bit 0
a non Fatha diacritic is kept and other diacritics are removed. So, in
the stego text each Fatha represents 1 and each non Fatha diacritic represents
a 0.
The main advantage of this method is its high capacity. But the main
disadvantage of this method is that it attracts the attention of the reader.
This method also needs a fully diacritical text, but most of Arabic texts
have no diacritic.
MATERIALS AND METHODS
In this study, we present a new method for text steganography in Persian
and Arabic Unicode texts.
Before explaining the method, we mention the main characteristics of
these two languages (Shirali-Shahreza and Shirali-Shahreza, 2006b). Then
we explain the Unicode Standard briefly and at last we explain our suggested
method in full details.
The characteristics of Persian and Arabic: Arabic alphabet has
28 letters. Persian has all the letters of Arabic and four more letters
of (Unicodes: 06AF, 0698, 0686, 0673). In these two languages, a letter
can have four different shapes. The shape of each letter is determined
by the position of that letter in a word. For example the letter 0639
is written as FECB at the beginning of a word, as FECC in the middle,
as FECA at the end and as FEC9 in the isolated form. We use this characteristic
of Arabic and Persian languages in our method.
In Persian and Arabic the letters are connected to each other in writing,
while in English the letters are written separately.
In English, the letters are written in a left-to-right format and in
some languages the letters are written in a top-to-bottom format, but
in Arabic and Persian the letters are written in a right-to-left format.
In Arabic and Persian languages, dot is very important and 17 of 32 Persian
letters (and 14 of 28 Arabic letters) have one or more dots. Among these
17 letters, 2 letters have 2 dots and 5 letters have 3 dots and the remaining
10 letters have one single dot, while in English only two small letters
have dot (.) i and j.
In Persian and Arabic some letters do not connected to each other. The
Zero Width Joiner (ZWJ) is a non-printing character which is when placed
between two characters that would otherwise not be connected, a ZWJ causes
them to be printed in their connected forms. The ZWJ`s Unicode is U+200D.
We use this character of Arabic and Persian languages in our method.
Unicode Standard: The Unicode Standard (The Unicode Consortium,
2006) is the international character-encoding standard used for presenting
the texts to process by computers. This standard is compatible to the
second version of ISO/IEC 10646-1:2000 and has the same characters and
codes of ISO/IEC 10646.
The Unicode standard enables us to encode all the characters used in
writing of the world languages. This standard uses the 16-bit encoding
which provides space for 65000 characters. So, it is possible to specify
and define 65000 characters in different moulds such as numbers, letters,
symbols and a great number of current characters in different languages
of the world.
The Unicode standard has determined codes for all the characters used
in main languages of the world. Moreover, because of the wideness of the
space dedicated to the characters, this standard also includes most of
the symbols necessary for high-quality typesetting. The languages whose
writing systems can be supported by this standard are Latin (covering
most of the European languages), Cyrillic (Russian and Serbian), Greek,
Arabic (including Arabic, Persian, Urdu, Kurdish), Hebrew, Indian, Armenian,
Assyrian, Chinese, Katakana, Hiragana (Japanese) and Hangeul (Korean).
Moreover there are a lot of mathematical and technical symbols, punctuation
marks, arrows and miscellaneous marks in this standard.
In the Unicode Standard, the Persian characters belong to the Arabic
block. This block has been developed to cover the characters of the languages
which use Arabic writing system. Among these languages we can mention
Persian, Urdu, Pashto, Sindhi and Kurdish.
This standard has detailed and careful explanations about the implementation
methods including letters-connection method, the exhibition of the right-to-left
and bi-direction texts. This way the programmers do not have to refer
to the local guide.
In the Unicode Standard, each Persian or Arabic letters has its unique
code. Also, all shapes of each letter have their own code. For example,
the code of letter Seen (ﺱ) in the Unicode Standard is 069B and
the codes of different forms are FEB1 for the isolated form (ﺱ),
FEB2 for the final form (ﺱ), FEB3 for the initial form (ﺳ)
and FEB4 for the medial form (ﺳ).
For saving the documents in the Unicode Standard, only the unique code
of each character is saved and the program which shows the letter will
show the correct shape of letter regarding to its position in the word.
Our method: As described earlier, each Persian or Arabic letter
can have four different shapes regarding to its position in the word and
each Persian or Arabic letter have one unique code which show the letter
in isolated form act as a word representative. But the four possible shape
of letter including the isolated form (the initial form, the medial form,
the final form and the isolated form) have their separate code in the
Unicode Standard.
In the Unicode Standard, only the code of representative form of letters
are saved in the text file and the program which shows the letter will
show the correct shape of letters regarding to their position in the word.
However, one can save the text in Unicode Standard by inserting the code
of correct shape of letters (regarding to their position in the word)
instead of their representative letter code. Therefore, the text viewer
-the program which shows the letter - does not determine the word shape
automatically and only show the letter shape which is related to the saved
code in the text.
The method proposed in this study for hiding data in Persian and Arabic
Unicode texts is using this feature of the Unicode Standard.
For each letter in the text, we can save it by using the representative
letter. But we can also save the letter by using the code of correct shape
of the letter (regarding to its position in the word).
For hiding bit 0 in the word, the first option is used for saving the
word. But for hiding bit 1 in the word, the second option is used.
But when we use the mixture of representative letters code and the code
of shape of the letters in one word together, the text viewer does not
select the correct shape of representative letters automatically and shows
them in isolated form.
For solving this problem, we insert the ZWJ character between the two
letters to connect them together. Because this character is a non-printing
character, therefore, this method does not make any apparent changes in
the original text and have a perfect perceptual transparency. A sample
of the process of hiding data in a word is shown in Table
1.
For extracting data from stego text (the text contains hidden data),
the code of letters is checked. If the letter is representative letter,
we conclude that bit 0 is hidden, but if the letter code is its shape
code (not the code of representative letter), we conclude that bit 1 is
hidden in the letter. By putting all the bits of 0 and 1 next to each
other we can extract the information hidden in the text.
Our method has very high hiding capacity, because we hide one bit in
each letter. Now we estimate the amount of data we can hide in a Persian
text. Assume that there are k words in the document and each word has
α letters in average. After each word, there is a space or a punctuation
mark such as comma. So, the size of the text is 2k(α+1) bytes because
there are (α+1) characters for each word and each character need
two bytes in the Unicode.
Table 1: |
Hiding 101 in word  |
 |
In our method, we hide one bits in each letter of the word. So, we hide
α bits in each word in average and a total of kα bit in the
document. Therefore the hiding capacity of our method as bit/kilobyte
is:
The average number of letters in each word (α) is 3.5 in Persian
(Shirali-Shahreza, 1996). So, we have:
This means that our method can hide about 400 bits of information in
each kilobyte of a Persian text.
RESULTS AND DISCUSSION
In out method, the information is hidden in Persian and Arabic texts
using the Unicode Standard.
We tested our method on some Persian text files. We selected the resources
which are used in our earlier Persian and Arabic text Steganography methods
(Shirali-Shahreza and Shirali-Shahreza, 2006a; Shirali-Shahreza, 2007)
in order to compare these methods.
These resources are selected for computing the capacity of the methods
for hiding data and including sport pages of some Iranian newspapers.
The Internet address of these newspapers and the capacity of each text
for hiding data are shown in Table 2. All of the pages
were retrieved on 20 August 2005.
Table 2 shows that we can hide about 400 bits in each
kilobyte of text.
As it is seen in the Table 2, our method capacity is
very high, especially in comparison with La steganography method (Shirali-Shahreza,
2007). In this method we hide a bit of information in each Persian and
Arabic letter, but in the Dot steganography method a bit of information
is hid in each letter with dot.
Table 2: |
Comparing the capacity of our method with the Dot Steganography
and La Steganography methods |
 |
So, Present method capacity is four times
higher than Dot steganography method (Shirali-Shahreza and Shirali-Shahreza,
2006a) in average.
Also, our method has advantages over these methods. For example, contrary
to the Dot steganography method (Shirali-Shahreza and Shirali-Shahreza,
2006a), this method does not change the apparent of the text and does
not required specific font.
CONCLUSIONS
In this study, a new method for Steganography in Persian and Arabic texts
has been presented. This method uses the Unicode Standard and the special
feature of Persian and Arabic languages that each letter has different
shapes.
This method is not dependent on any special format and we can save the
stego text in numerous formats such as HTML pages, Microsoft Word documents
or even plain text format. Because the stego Unicode texts will not change
during copy and paste between computer programs, the data hidden in texts
remains intact during these operations.
There are three important parameters in designing steganography methods:
Perceptual transparency, robustness and hiding capacity. These requirements
are known as the magic triangle and are contradictory (Cvejic, 2004).
Our method satisfies both perceptual transparency and hiding capacity
requirements. It does not make any apparent changes in the original text
by hiding data. So, even if the reader has the original text, it is impossible
for him to realize the hiding of the data by merely observing the appearance
of the text. However, the original texts are not available to the observer
in text Steganography methods usually. Therefore, the main goal of text
Steganography, that is the impossibility of detection of the presence
of data, has been achieved. Also, we hide one bit/letter in the text file,
so our method has high capacity.
In some steganography methods, standard structure of the text will be
disarranged and spelling and grammatical errors will be created in the
text, but in this method the appearance of the text is not changed at
all and the text still remains standard.
The Unicode Standard supports different languages and can be used on
different systems and devices which are supporting the Unicode Standard.
Moreover, the Arabic is the official language of the Muslims and about
two billion Muslims live throughout the world. As a result, a wide range
of the users can use our method.
Since Pashto (the official language of Afghanistan) and Urdu (the official
language of Pakistan) are similar to Arabic and Persian, we can also apply
this method to these two languages.
In addition the Arabic and Persian languages have other specific characteristics
which can be used for text Steganography.
This method can be used for secret communication and for the prevention
of the illegal reproduction and distribution of the texts, especially
e-documents as well (Brassil et al., 1994).