Steganography is the art of hiding information within container files to conceal the existence of embedded information. Media files have been the most common containers for hiding embedded data due to which there is a lot of scrutiny on media files when they are transferred. Most of the DLP (Data Leak Prevention) system focus on media files when checking for steganography. Word documents on the other hand are common enough that they can be used as containers for hidden information without raising flags.
In this paper we explore hiding secret data in a Word document by inserting multiple color tags into the file that alter the color for each character in the document to encode data without changing the visual look of the document.
Modern DLP systems can detect hidden information in media files such as images, videos or audio files by performing analysis of files to detect modification and potentially identify the hidden data. In order to be able to send data without detection a new method of hiding data needs to be found. In this paper we look at how to hide text in a word document by modifying the color tags in the word document. This allows us to exfiltrate data using word files with a minimal risk of detection using existing tools.
Introduction and History
Steganography is the art of hiding data or a message inside another file or object. This object can be an image, text, audio or video file. The word has Greek roots, and is a combination of steganos (“concealed, protected”) and graphy (“writing.”).
The first known use of steganography was in ancient Greece around 440 B.C, where the Greek ruler Histaeus would shave the head of a slave and tattoo a secret message on the slave’s scalp. After which he would wait for their hair to grow to hide the secret message and send the slave to the recipient who would then shave the head to get the message. (UK Essays, 2021) Another example from the same time period is when Demaratus sent a warning about a forthcoming attack to Greece by carving the message on the wood of a wax tablet before covering it with a fresh wax coat. This tablet that looked blank was delivered to Greece along with other blank tablets, where the Greeks removed the wax layer to read the hidden message. (Perera, 2011)
In more modern times, Steganography was used during the second world war by the Germans who used Microdots to reduce complete documents to the size of a dot which was then placed on a normal looking letter or document. Another technique used often was to encode messages in knitted scarves or sweaters sent to operatives. Every knitted garment is made of different combinations of just two stitches: a knit stitch, which is smooth and looks like a “v”, and a purl stitch, which looks like a horizontal line or a little bump. By making a specific combination of knits and purls in a predetermined pattern, spies could pass on a custom piece of fabric and read the secret message. (Zarrelli, 2021)
With the Digital age, the options to encode messages in digital files became available and steganography evolved to make use of the new medium.
How Digital Steganography works
Most digital files contain sections that can be altered without showing any obvious effects in the file. Modern techniques hide data in files by using one of the following approaches:
Adding bits to a file:
In this approach the hidden text is added to the “file header”, which usually contains information such as the file type or the resolution and color depth of a photo. This method is relatively easy to detect if we look at the file size difference. For example, if we add 1 MB of secret data to a 4 MB file, the output file size would increase by 1MB making it easy to detect if the resultant file was compared with the original.
Changing the Least Significant Bit (LSB):
To resolve this problem of changing file size, a new technique was created that makes use of the fact that the LSB’s in a file can be altered without significantly altering the source i.e. if the container was an image the altered image would look the same to human eyes. As an example, in an image file each pixel is comprised of three bytes of data corresponding to the colors red, green, and blue. LSB steganography changes the last bit of each of those bytes to hide one bit of data. Which allows a user to hide data in the file without changing the file size. The same technique can be applied to other media files such as Video or Audio files as well.
The larger the container file, the more data can be encoded into the file, which is why use of Images, Video and audio files is very popular with Steganographic users, as it allows the user to hide large quantities of data in a single file. The major limitation of using media files is that if the target doesn’t usually send or receive media files, then it is a break in the routine if they start suddenly sending or receiving such files.
Word Documents or Text files on the other hand are the bread and butter of all organizations and every user sends and receives a lot of documents throughout the course of the day. So, if we are able to hide data in a word file, then it would be easier to exfiltrate the data.
How to hide data in a text file
There are a lot of options available for use to hide information in a text file and some of them have been used historically for this purpose already, the digital text just gives us a new medium for the hidden text. Some of the options are as below:
Using patterns of letters within word
In this technique the user would send a normal looking message or document to another user. They would hide a secret message in the file by encoding a message that can only be read by taking the ith letter of each word in the message. The advantage is that you can send a lot of data using this technique, but the disadvantage is that the message can end up sounding very stilted because of the requirements of the steganography.
Using the Whitespace in the document to hide data
Another option is to use the spacing differences in the file to encode a message. One example is for the sender to put in one space after a full stop to mean 0 and two spaces after it to represent a 1. By looking at the spacing the secret message can be spelled out. The main problem with this approach is that it does not allow large quantity of data to be sent in a file, but the advantage is that it is harder to detect.
In this paper we are looking at a third way to hide data in a document by modifying the color tags in the document and we will look at this in more detail in the next section.
Hiding information using color tags in a Word Document
All versions of MS Office since 2007 save files in the Microsoft Office Open XML specification which are then zipped to create files in the DOCX format. Word files allow a user to show text in multiple colors by inserting the corresponding color tag into the file. (Microsoft, 2021) When the color of the displayed text is modified to a different color, the system adds a tag in the document.xml file located in the zip file like the following: <w:color w:val=”000000″/> to show the change in font color. The tag shows the color of the text in a Hex format, with 00 as Black and FF showing White color.
Each of the pair of bits in the color tag corresponds to the Red, Green or Blue color pallet. In each pair, the second bit is the least significant bit and its value can be modified without the output color looking significantly different to the viewer. So, visually speaking the font color represented by Hex value 000000 looks almost exactly the same as color represented by the Hex value of 010101. By altering the value of the second bit in the pair from 0 to 1 or vice versa information can be encoded into the file without adding text or information that can be found by security systems/reviewers. Since the data is in XML format, the sender can insert data into the document by inserting color tags into the document for each character. The process to hide the data would look like the following:
- The user provides a word file to be used as an input. The file would contain sufficient text to allow the sender to encode data.
- The system extracts the contents of the documents from the file by unzipping it.
- The content of the document is stored in the ‘documents.xml’ file under the word folder created in the previous step.
- The system extracts the text from the file by striping the XML tags from the file
- For each character in the text, it adds a color tag such as or . The second bit in the pair is set to a 0 or a 1 depending on the data being encoded.
- The original tags are restored to the file along with the new tags created.
- The resulting file is saved as document.xml in the word folder
- The folder is compressed as a ZIP file and renamed to .docx
The resultant file will contain the hidden data with little visual indication of the changes being made to the document and can be mailed our as usual with little chance of detection.
The recipient would follow these steps to extract the hidden data from the file:
- Unzip the document to extract the content
- Extract all the color font tags in the file
- Read the second bit in every pair of color code
- Save the values in a separate file that contains the secret information.
- Review the information at your leisure.
This technique is fairly easy to implement with minimal coding skills required. If the setup doesn’t allow users to send out word documents, then the same technique can also be used to hide data in the html source of a website that the recipient would then download and extract. The same can also be accomplished by encoding data in emails sent from the user’s account.
Detection Techniques for hidden data in documents
Like any techniques to send hidden data the technique we just discussed has its weaknesses which can be used to detect hidden messages encoded in the document. However, such detection is not easy and most of the currently available tools will not be able to detect data hidden using this technique. This is because most commercial tools available in the market focus their efforts to detect hidden data with media files such as images, videos or audio files as they have traditionally been the most common containers used to hide data. Some of the options available to detect the possibility of hidden data are as follows:
- Create a tool that examines all documents sent out to count the number of font tags in use in the document. If the count of the tags is over a certain threshold the file can be quarantined for review by a human
- Use a tool checks the size a given document is expected to be based on the amount of text in the document. If the size of the file is significantly higher (due to anomalously high number of tags in the file) the file can be quarantined for review.
- We would need to take into account any images etc embedded in the file when performing the analysis
- Create a machine learning tool that uses AI/ML to detect files with hidden data.
Conclusion
Any data or file being sent outside the organizations network can be used to exfiltrate information from the network. The trick to detecting these attempts is to create a baseline of the activity, data sizes of the files transferred during a regular day and create alerts to notify administrators when there is a significant variation from the baseline.
Done correctly this will decrease the risk of data exfiltration but no technique to detect data is perfect so a lot of review and audits need to be done on a periodic basis to ensure that the system is still secure.
References
Microsoft. (2021, August 25). File format reference for word, Excel, and PowerPoint. Deploy Office | Microsoft Docs. Retrieved September 19, 2021, from https://docs.microsoft.com/en-us/deployoffice/compat/office-file-format-reference.
Perera, H. L. (2011, February 4). History of steganography. hareenlaks. Retrieved September 19, 2021, from http://hareenlaks.blogspot.com/2011/04/history-of-steganography.html.
UK Essays. (2021, August 12). The history & background of steganography. UK Essays. Retrieved September 19, 2021, from https://www.ukessays.com/essays/english-language/background-of-steganography.php.
Zarrelli, N. (2021, June 10). The wartime spies who used knitting as an espionage tool. Atlas Obscura. Retrieved September 19, 2021, from https://www.atlasobscura.com/articles/knitting-spies-wwi-wwii.
Note: This was originally written as a paper for one of my classes at EC-Council University in Q3 2021.
– Suramya