Convert PDF to Text in SAP UI5 using PDF.js

by | Jul 30, 2023 | SAP, UI5, UI5 Integrations

Home » SAP » Convert PDF to Text in SAP UI5 using PDF.js

Introduction

Discover the power of PDF.js in our comprehensive guide to transforming PDF files into readable text within the SAP UI5 framework. As digital data continues to proliferate, the need for seamless data conversion and extraction increases. Leveraging the power of PDF.js, we will unravel the process of converting complex PDF files into simple text directly in your SAP UI5 applications. This powerful JavaScript library brings PDF rendering capabilities to your browser, making it easier than ever to parse and extract data from PDF files.

Whether you’re a seasoned SAP UI5 developer or just beginning your journey, this guide offers a practical, step-by-step process for implementing PDF.js in your applications. From installation to implementation, we’ve got you covered. Improve your SAP UI5 skill set, enhance your applications, and unlock new possibilities with PDF.js today. Let’s dive into the world of PDF to text conversion in SAP UI5 with PDF.js.

What is PDF.js

PDF.js is a robust, open-source JavaScript library developed by Mozilla. It is used to parse and render PDF files directly in a web browser. One of its primary features is its ability to maintain the high fidelity of the original PDF file, including its fonts, vector graphics, and images.

CDN for PDF.js: cdnjs.cloudflare.com/ajax/libs/pdf.js/2.10.377/pdf.min.js

The library offers a wide range of functionalities. For instance, it allows developers to highlight text within the PDF, search for specific text, and zoom in or out of the document. PDF.js can also be used to convert PDF files into other formats such as text or images, which is especially useful when you need to extract or manipulate the content of PDF files programmatically.

By leveraging the power of modern web standards like HTML5 and JavaScript, PDF.js ensures a seamless, platform-independent solution that works across various devices and operating systems. As a result, it’s used in numerous applications and projects around the world, including the Firefox browser, where it’s used as the default PDF viewer.

How to convert PDF to Text using PDF.js using JavaScript

To convert PDF to Text using PDF.js with JavaScript, you can follow these steps:

1. First, you need to include the PDF.js library in your project. You can download it from the official GitHub repository and include it in your HTML file.

2. Load your PDF file using the ‘getDocument’ function. This function returns a promise that is resolved with a PDFDocumentProxy object.

3. Once you have the PDFDocumentProxy object, you can call the ‘getPage’ function to get a particular page (let’s assume the first page). This function also returns a promise that is resolved with a PDFPageProxy object.

4. Now, you can call the ‘getTextContent’ function on the PDFPageProxy object. This function returns a promise that is resolved with a TextContent object.

5. The TextContent object has an array called ‘items’ which contains the individual text items of the page. You can loop through this array and concatenate all the strings to get the full text of the page.

6. Repeat steps 3 to 5 for all pages in the PDF document.

Please note that these are the general steps. Depending on your specific project and setup, there might be some variations in how you apply them. You will need to have a good understanding of JavaScript and Promises to work with PDF.js effectively.

Also note that text extraction with PDF.js might not always be perfect. Due to the way PDF files are structured, sometimes the order of the text can get mixed up, or some characters might not get extracted correctly.

Convert PDF to Text in SAP UI5 using PDF.js

Index.html

<!DOCTYPE html>
<html>
    <head>
        <meta charset="utf-8">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <title>My Project Ideas</title>
        <link rel="icon" type="image/x-icon" href="https://myprojectideas.com/wp-content/uploads/2021/08/cropped-Screenshot-2021-07-26-at-1.39.04-PM.png"/>
        <script id="sap-ui-bootstrap"
            src="resources/sap-ui-core.js"
            data-sap-ui-theme="sap_fiori_3"
            data-sap-ui-resourceroots='{"pdfUploader.pdfUploader": "./"}'
            data-sap-ui-compatVersion="edge"
            data-sap-ui-oninit="module:sap/ui/core/ComponentSupport"
            data-sap-ui-async="true"
            data-sap-ui-frameOptions="trusted">
        </script>
            <script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.10.377/pdf.min.js"></script>
    </head>
    <body class="sapUiBody">
        <div data-sap-ui-component data-name="pdfUploader.pdfUploader" data-id="container" data-settings='{"id" : "pdfUploader"}'></div>
    </body>
</html>

View

<mvc:View controllerName="pdfUploader.pdfUploader.controller.Main" xmlns:mvc="sap.ui.core.mvc" xmlns:u="sap.ui.unified" displayBlock="true"
    xmlns="sap.m">
    <Shell id="shell">
        <App id="app">
            <pages>
                <Page id="page" title="My Project Ideas: Convert PDF to Text in SAP UI5 using PDF.js">
                    <content>
                        <VBox alignItems="Center" justifyContent="Center">
                            <u:FileUploader name="pdfUploader" buttonText="Upload PDF" fileType="pdf" change="onFileUpload"/>
                            <TextArea id="textArea" rows="10" width="100%" editable="false"/>
                        </VBox>
                    </content>
                </Page>
            </pages>
        </App>
    </Shell>
</mvc:View>

 

Controller

sap.ui.define([
    "sap/ui/core/mvc/Controller"
], function (Controller) {
    "use strict";

    return Controller.extend("pdfUploader.pdfUploader.controller.Main", {
        onInit: function () {

        },

        onFileUpload: function (event) {
            var that = this;
            var file = event.getParameter("files")[0];
            var reader = new FileReader();

            reader.onload = function (event) {
                var pdfData = new Uint8Array(event.target.result);

                // Load the PDF data using PDF.js
                pdfjsLib.getDocument(pdfData).promise.then(function (pdf) {
                    // Access the first page of the PDF
                    pdf.getPage(1).then(function (page) {
                        // Extract the text content from the page
                        page.getTextContent().then(function (textContent) {
                            var extractedText = "";

                            // Concatenate the text content from all items
                            textContent.items.forEach(function (item) {
                                extractedText += item.str + " ";
                            });

                            // Bind the extracted text to the TextArea control
                            var textArea = that.getView().byId("textArea");
                            textArea.setValue(extractedText);
                        }.bind(this));
                    });
                });
            }.bind(this);

            reader.readAsArrayBuffer(file);
        },
    });
});

 

Output

Sample PDF:

Sample PDF

After Upload:

Convert PDF to Text in SAP UI5 using PDF.js

 

Author

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Author