Parse PDF Documents

How should I parse the PDF documents. I am getting all null attributes when I give the URL of a pdf. Where else if I pass a HTML document, the api successfully retrieves the attributes. ####code import http.client import urllib.parse import urllib.request import urllib.parse import urllib.error import base64 import json headers = { # Request headers 'Content-Type': 'application/json', 'Ocp-Apim-Subscription-Key': '----KEY----', } # headers = json.dumps(headers) params = urllib.parse.urlencode({}) print("########",params) data = { "summary_length": 5, "articles": [ { "type": "article", "url": "https://www.craig.csufresno.edu/sif/Reports/Amazon-Thomas%20Eble.pdf" } ] } try: conn = http.client.HTTPSConnection("api.agolo.com") conn.request("POST", "/nlp/v0.2/summarize?%s" % params, body=json.dumps(data), headers=headers) response = conn.getresponse() data = response.read() print(data) conn.close() except Exception as e: print("[Errno {0}] {1}".format(e.errno, e.strerror))

Comments

  •  
    Hi Ayan, We stopped the PDF summarization feature temporarily because the underlying PDF library that we were using failed to parse some documents. We are currently working on our own PDF parsing technology, and we will announce it once it is production ready.
    Posted by Hidden Tue, 20 Mar 2018 13:36:24 GMT


You're not signed in. Please sign-in to report an issue or post a comment.