This could just be a really stupid format, put out by a specific application for creating PDFs, because the original authors didn’t want to pay Adobe (never attribute to malice, that which can be sufficiently explained with stupidity).
Does pdfinfo give any indication of the application used to create the document? If it chokes on the Java bit up front, can you extract just the PDF from the file and look at that? You might also dig through the PDF a bit using Dider Stevens 's Tools, looking for JavaScript or other indicators of PDF fuckery.
Does the file contain any other Java bytecode? If so, can you pass that through a decompiler?
would love it if attempts to reach the cloud could be trapped and recorded to a log file in the course of neutering the PDF.
This is possible, but it takes a bit of setup. In my own lab, I have PolarProxy running in one Virtual Machine (VM), using QEMU/KVM. That acts as a gateway between an isolated network and a network with internet access. It runs transparent TLS break and inspect on port 443/tcp and tcpdump capturing port 80/tcp. It also serves DNS using Bind.
There is then the “victim” VM which is running bog standard Windows 10. The PolarProxy root cert has been added to the Trusted Roots certificate store. The Default Gateway and DNS servers are hard coded to the PolarProxy VM. Suspicious stuff is tested on this system and all network traffic is recorded on the PolarProxy system in standard pcap format for analysis.
Hey now. Those Clydesdales are doing their absolute best to piss out Bud.