I have a bunch of PDF files that I need to modify to remove text from them. Initially I was using LibreDraw but that was a manual task so I thought that I should script it/Automate it. Little did I know that programmatically editing PDF’s is not that simple. I tried a bunch of libraries such as PyPDF4, pikepdf etc but the only one which worked was borb which is a library by Joris Schellekens. They have a great collection of examples and using that I got my first script that searched and replaced text in the PDF working.
However, when I tried to run the script against my pdf file the script fails with the following error:
Traceback (most recent call last):
File "/home/suramya/Temp/BorbReplace.py", line 26, in
main()
File "/home/suramya/Temp/BorbReplace.py", line 18, in main
doc = SimpleFindReplace.sub("Manual", "", doc)
File "/usr/local/lib/python3.10/dist-packages/borb/toolkit/text/simple_find_replace.py", line 80, in sub
page.apply_redact_annotations()
File "/usr/local/lib/python3.10/dist-packages/borb/pdf/page/page.py", line 271, in apply_redact_annotations
.read(io.BytesIO(self["Contents"]["DecodedBytes"]), [])
File "/usr/local/lib/python3.10/dist-packages/borb/pdf/canvas/canvas_stream_processor.py", line 290, in read
raise e
File "/usr/local/lib/python3.10/dist-packages/borb/pdf/canvas/canvas_stream_processor.py", line 284, in read
operator.invoke(self, operands, event_listeners)
File "/usr/local/lib/python3.10/dist-packages/borb/pdf/canvas/redacted_canvas_stream_processor.py", line 271, in invoke
self._write_chunk_of_text(
File "/usr/local/lib/python3.10/dist-packages/borb/pdf/canvas/redacted_canvas_stream_processor.py", line 203, in _write_chunk_of_text
)._write_text_bytes()
File "/usr/local/lib/python3.10/dist-packages/borb/pdf/canvas/layout/text/chunk_of_text.py", line 145, in _write_text_bytes
return self._write_text_bytes_in_hex()
File "/usr/local/lib/python3.10/dist-packages/borb/pdf/canvas/layout/text/chunk_of_text.py", line 160, in _write_text_bytes_in_hex
assert cid is not None, "Font %s can not represent '%s'" % (
AssertionError: Font Arial,Bold can not represent 'E'
Process finished with exit code 1
I tried a couple of different files and the font name changes but the error remains
The script I was using is:
from borb.pdf import Document
from borb.pdf import PDF
from borb.toolkit import SimpleFindReplace
import typing
def main():
# attempt to read a PDF
doc: typing.Optional[Document] = None
with open("/home/suramya/Downloads/t/MAA1.pdf", "rb") as pdf_file_handle:
doc = PDF.loads(pdf_file_handle)
# check whether we actually read a PDF
assert doc is not None
# find/replace
doc = SimpleFindReplace.sub("PRIVATE", "XXXX", doc)
# store
with open("/home/suramya/Downloads/t/MAABLR_out.pdf", "wb") as pdf_file_handle:
PDF.dumps(pdf_file_handle, doc)
if __name__ == "__main__":
main()
I searched on the web and didn’t find any solutions so I reached out to the project owner and they responded with the following message “Not every font can represent every possible character in every language. you are trying to insert a piece of text that contains a character that Arial can not represent. Maybe some weird kind of “E” (since uppercase E should not be a problem).”. The problem was that I wasn’t trying to replace any strange characters, just a normal uppercase E.
To help trouble shoot, they asked me for a copy of the file. So I was masking the data in the PDF file to share it and the script suddenly started working. Turns out that there was an extra space after the word PRIVATE in the file and when I removed it things started working (even on the unmasked file). So it looks like the issue is caused when there is an encoding issue with the PDF file. Opening it in Libre Draw and exporting as a new PDF file seems to resolve the issue.
Now we are a step closer to the solution, I just need to figure out how to convert the file from the command line and I will be home free. Something to work on when I have had some sleep.
– Suramya