{"id":5483,"date":"2023-01-21T00:47:02","date_gmt":"2023-01-20T19:17:02","guid":{"rendered":"https:\/\/www.suramya.com\/blog\/?p=5483"},"modified":"2023-01-21T00:47:02","modified_gmt":"2023-01-20T19:17:02","slug":"fixing-assertionerror-font-arialbold-can-not-represent-e-when-using-borb-to-modify-pdf-files","status":"publish","type":"post","link":"https:\/\/www.suramya.com\/blog\/2023\/01\/fixing-assertionerror-font-arialbold-can-not-represent-e-when-using-borb-to-modify-pdf-files\/","title":{"rendered":"Fixing AssertionError: Font Arial,Bold can not represent &#8216;E&#8217; when using Borb to modify PDF Files"},"content":{"rendered":"<p>I have a bunch of PDF files that I need to modify to remove text from them. Initially I was using LibreDraw but that was a manual task so I thought that I should script it\/Automate it. Little did I know that programmatically editing PDF&#8217;s is not that simple. I tried a bunch of libraries such as PyPDF4, pikepdf etc but the only one which worked was <a href=\"https:\/\/github.com\/jorisschellekens\/borb\">borb<\/a> which is a library by  Joris Schellekens. They have a great collection of examples and using that I got my first script that searched and replaced text in the PDF working. <\/p>\n<p>However, when I tried to run the script against my pdf file the script fails with the following error:<\/p>\n<pre class='code'>\r\nTraceback (most recent call last):\r\n  File \"\/home\/suramya\/Temp\/BorbReplace.py\", line 26, in <module>\r\n    main()\r\n  File \"\/home\/suramya\/Temp\/BorbReplace.py\", line 18, in main\r\n    doc = SimpleFindReplace.sub(\"Manual\", \"\", doc)\r\n  File \"\/usr\/local\/lib\/python3.10\/dist-packages\/borb\/toolkit\/text\/simple_find_replace.py\", line 80, in sub\r\n    page.apply_redact_annotations()\r\n  File \"\/usr\/local\/lib\/python3.10\/dist-packages\/borb\/pdf\/page\/page.py\", line 271, in apply_redact_annotations\r\n    .read(io.BytesIO(self[\"Contents\"][\"DecodedBytes\"]), [])\r\n  File \"\/usr\/local\/lib\/python3.10\/dist-packages\/borb\/pdf\/canvas\/canvas_stream_processor.py\", line 290, in read\r\n    raise e\r\n  File \"\/usr\/local\/lib\/python3.10\/dist-packages\/borb\/pdf\/canvas\/canvas_stream_processor.py\", line 284, in read\r\n    operator.invoke(self, operands, event_listeners)\r\n  File \"\/usr\/local\/lib\/python3.10\/dist-packages\/borb\/pdf\/canvas\/redacted_canvas_stream_processor.py\", line 271, in invoke\r\n    self._write_chunk_of_text(\r\n  File \"\/usr\/local\/lib\/python3.10\/dist-packages\/borb\/pdf\/canvas\/redacted_canvas_stream_processor.py\", line 203, in _write_chunk_of_text\r\n    )._write_text_bytes()\r\n  File \"\/usr\/local\/lib\/python3.10\/dist-packages\/borb\/pdf\/canvas\/layout\/text\/chunk_of_text.py\", line 145, in _write_text_bytes\r\n    return self._write_text_bytes_in_hex()\r\n  File \"\/usr\/local\/lib\/python3.10\/dist-packages\/borb\/pdf\/canvas\/layout\/text\/chunk_of_text.py\", line 160, in _write_text_bytes_in_hex\r\n    assert cid is not None, \"Font %s can not represent '%s'\" % (\r\nAssertionError: Font Arial,Bold can not represent 'E'\r\n\r\nProcess finished with exit code 1<\/module><\/pre>\n<p>I tried a couple of different files and the font name changes but the error remains<\/p>\n<p>The script I was using is: <\/p>\n<pre class='code'>from borb.pdf import Document\r\nfrom borb.pdf import PDF\r\nfrom borb.toolkit import SimpleFindReplace\r\n\r\nimport typing\r\n\r\ndef main():\r\n\r\n    # attempt to read a PDF\r\n    doc: typing.Optional[Document] = None\r\n    with open(\"\/home\/suramya\/Downloads\/t\/MAA1.pdf\", \"rb\") as pdf_file_handle:\r\n        doc = PDF.loads(pdf_file_handle)\r\n\r\n    # check whether we actually read a PDF\r\n    assert doc is not None\r\n\r\n    # find\/replace\r\n    doc = SimpleFindReplace.sub(\"PRIVATE\", \"XXXX\", doc)\r\n\r\n    # store\r\n    with open(\"\/home\/suramya\/Downloads\/t\/MAABLR_out.pdf\", \"wb\") as pdf_file_handle:\r\n        PDF.dumps(pdf_file_handle, doc)\r\n\r\n\r\nif __name__ == \"__main__\":\r\n    main()<\/pre>\n<p>I searched on the web and didn&#8217;t find any solutions so I reached out to the project owner and they responded with the following message <em>&#8220;Not every font can represent every possible character in every language. you are trying to insert a piece of text that contains a character that Arial can not represent. Maybe some weird kind of &#8220;E&#8221; (since uppercase E should not be a problem).&#8221;<\/em>. The problem was that I wasn&#8217;t trying to replace any strange characters, just a normal uppercase E. <\/p>\n<p>To help trouble shoot, they asked me for a copy of the file. So I was masking the data in the PDF file to share it and the script suddenly started working. Turns out that there was an extra space after the word PRIVATE in the file and when I removed it things started working (even on the unmasked file). So it looks like the issue is caused when there is an encoding issue with the PDF file. Opening it in Libre Draw and exporting as a new PDF file seems to resolve the issue. <\/p>\n<p>Now we are a step closer to the solution, I just need to figure out how to convert the file from the command line and I will be home free. Something to work on when I have had some sleep. <\/p>\n<p>&#8211; Suramya<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I have a bunch of PDF files that I need to modify to remove text from them. Initially I was using LibreDraw but that was a manual task so I thought that I should script it\/Automate it. Little did I know that programmatically editing PDF&#8217;s is not that simple. I tried a bunch of libraries [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"activitypub_content_warning":"","activitypub_content_visibility":"","activitypub_max_image_attachments":3,"activitypub_interaction_policy_quote":"anyone","activitypub_status":"","footnotes":""},"categories":[18,24,2],"tags":[],"class_list":["post-5483","post","type-post","status-publish","format-standard","hentry","category-computer-software","category-knowledgebase","category-techie-stuff"],"_links":{"self":[{"href":"https:\/\/www.suramya.com\/blog\/wp-json\/wp\/v2\/posts\/5483","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.suramya.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.suramya.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.suramya.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.suramya.com\/blog\/wp-json\/wp\/v2\/comments?post=5483"}],"version-history":[{"count":2,"href":"https:\/\/www.suramya.com\/blog\/wp-json\/wp\/v2\/posts\/5483\/revisions"}],"predecessor-version":[{"id":5485,"href":"https:\/\/www.suramya.com\/blog\/wp-json\/wp\/v2\/posts\/5483\/revisions\/5485"}],"wp:attachment":[{"href":"https:\/\/www.suramya.com\/blog\/wp-json\/wp\/v2\/media?parent=5483"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.suramya.com\/blog\/wp-json\/wp\/v2\/categories?post=5483"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.suramya.com\/blog\/wp-json\/wp\/v2\/tags?post=5483"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}