3.25.  Text

Overview

The most common test case for PDF documents is probably to check the presence of expected text. That can be done with the tag <hasText /> which can contain other tags and attributes.

<!-- Tags to verify content: -->

<hasText />

<!-- Nested tags of <hasText />:  -->

<hasText >
  <inClippingArea />     (optional)

  <!-- Comparing content: -->
  <containing />         (optional)
  <endingWith />         (optional)
  <matchingComplete />   (optional)
  <matchingRegex />      (optional)
  <startingWith />       (optional)
  
  <!-- Prove the absence of text: -->
  <notContaining />      (optional)
  <notEndingWith />      (optional)
  <!-- <notMatchingRegex /> is itentionally not provided -->
  <notMatchingRegex />   (optional)
  <notStartingWith />    (optional)
<hasText />

<!-- Attributes of <hasText /> to select pages.  -->
<!-- One of these attributes has to be used:  -->

<hasText on=".."                />
<hasText onPage=".."            />
<hasText onEveryPageAfter=".."  />
<hasText onEveryPageBefore=".." />
<hasText onAnyPageAfter=".."    />
<hasText onAnyPageBefore=".."   />

<!-- 
Whitespace processing, see:  13.4: “Whitespace Processing”  
--> 

Text on Individual Pages

If you are looking for a text on the first page of a letter, test it this way:

<testcase name="hasText_OnFirstPage_Containing">
  <assertThat testDocument="content/diverseContentOnMultiplePages.pdf">
    <hasText on="FIRST_PAGE">
      <containing>Content on first page.</containing>
    </hasText>
  </assertThat>
</testcase>

You can declare specific pages using the attribute on=".." which provides several constants, e.g. on="FIRST_PAGE", on="EVERY_PAGE", on="ODD_PAGES" etc. Chapter 13.2: “Page Selection” describes more constants and how to use them.

The next example searches a text on the last page:

<testcase name="hasText_OnLastPage">
  <assertThat testDocument="content/diverseContentOnMultiplePages.pdf">
    <hasText on="LAST_PAGE">
      <containing>Content on last page.</containing>
    </hasText>
  </assertThat>
</testcase>

Also, you can test individual pages using the attribute onPage="..".

<testcase name="hasText_OnIndividualPages">
  <assertThat testDocument="content/diverseContentOnMultiplePages.pdf">
    <hasText onPage="2, 3">
      <containing>Content on</containing>
    </hasText>
  </assertThat>
</testcase>

Page numbers in the attribute onPage=".." must be separated by commas.

The chapter 13.2: “Page Selection” describes page selection in detail.

Text on All Pages

There are two constants for the attribute on=".." to search text on multiple pages: on="EACH_PAGE", on="EVERY_PAGE" and on="ANY_PAGE".

<testcase name="hasText_OnEveryPage">
  <assertThat testDocument="content/diverseContentOnMultiplePages.pdf">
    <hasText on="EVERY_PAGE" >
      <startingWith>PDFUnit</startingWith>
    </hasText>
  </assertThat>
</testcase>
<testcase name="hasText_OnAnyPage">
  <assertThat testDocument="content/diverseContentOnMultiplePages.pdf">
    <hasText on="ANY_PAGE">
      <containing>Page # 3</containing>
    </hasText>
  </assertThat>
</testcase>

The constants on="EVERY_PAGE" and on="EACH_PAGE" require that the text really exists on every page. When you use the constant on="ANY_PAGE", a test is successful if the expected text exists on one or more of the pages.

Negated Search

The logic of the two previous examples is clear. But the logic becomes unclear when you negate both statements. In everyday speech, the difference between Every page does not contain the expected text and Any page does not contain the expected text is unclear. And the last sentence itself has an unclear meaning.

To avoid mistakes, PDFUnit does not allow negated tests with the constant ON_ANY_PAGE. The following test is not allowed and throws an exception:

<testcase name="hasText_NotMatchingRegex">
  <assertThat testDocument="content/diverseContentOnMultiplePages.pdf">
    <hasText on="ANY_PAGE">
      <notEndingWith>wrongValueIntended</notEndingWith>
    </hasText>
  </assertThat>
</testcase>

The error message is:

Searching text 'ON_ANY_PAGE' in combination with negated methods is not supported.

Instead of asking that any page does NOT contain an expected text it is better to write every page contains the expected text and catch the exception.

Line Breaks in Text

When searching text, line breaks and other whitespaces are ignored in the expected as well as in the text being tested. In the following example the text to be searched for belongs to the document Digital Signatures for PDF Documents from Bruno Lowagie (iText). The first chapter has some line breaks:

The following tests for the marked text use different line breaks. They both succeed:

<!-- 
  The PDF document has a (visible) line break after the word "The".
  The search string does not contain a line break. 
-->

<testcase name="hasText_ContainingLineBreaks_LineBreakInPDF">
  <assertThat testDocument="digitalsignatures20121017.pdf">
    <hasText on="FIRST_PAGE">
      <containing>The technology was conceived</containing>
    </hasText>
  </assertThat>
</testcase>
<!-- 
  The expected search string intentionally contains other line breaks.
-->

<testcase name="hasText_ContainingLineBreaks_LineBreakInExpectedString">
  <assertThat testDocument="digitalsignatures20121017.pdf">
    <hasText on="FIRST_PAGE">
      <containing>
        The 
        technology 
        was 
        conceived
      </containing>
    </hasText>
  </assertThat>
</testcase>

Text in Parts of a Page

Text can be searched not only on whole pages, but also on a section of a page. The chapter 13.6: “Defining Page Areas” describes that topic.

Empty Pages

You can verify that your PDF document does not have empty pages:

<testcase name="hasText_AnyPageEmpty">
  <assertThat testDocument="content/diverseContentOnMultiplePages.pdf">
    <hasText on="EVERY_PAGE" />
  </assertThat>
</testcase>

If you want to verify that a page or a section of a page does not contain text, you can use the method hasNoText:

<testcase name="hasNoTextInClippingArea" >
  <assertThat testDocument="&pdfdir;/emptyPages/pagesPartiallyEmpty.pdf">
    <hasNoText on="FIRST_PAGE" >
      <inClippingArea upperLeftX="70" upperLeftY="80" width="90" height="60" />
    </hasNoText>
  </assertThat>
</testcase>

Multiple Search Tokens

It is annoying to write a separate test for every expected text on a page. So you can invoke some tags more than once:

<testcase name="hasText_Containing_MultipleTokens">
  <assertThat testDocument="content/diverseContentOnMultiplePages.pdf">
    <hasText on="ODD_PAGES">
      <containing>on</containing>
      <containing>page</containing>
      <containing>odd pagenumber</containing>
    </hasText>
  </assertThat>
</testcase>
<testcase name="hasText_NotContaining_MultipleTokens">
  <assertThat testDocument="content/diverseContentOnMultiplePages.pdf">
    <hasText on="FIRST_PAGE">
      <notContaining>even pagenumber</notContaining>
      <notContaining>Page #2</notContaining>
    </hasText>
  </assertThat>
</testcase>

In the first example the test is successful when all expected tokens are found, and the second test is successful when none of the expected tokens are found.

You can only use the tags <startingWith /> and <endingWith />.

Multiple text comparisons are all related to the specificied page numbers declared in the outer tag <hasText />:

<testcase name="hasText_MultipleInvocation">
<assertThat testDocument="content/diverseContentOnMultiplePages.pdf">
  <hasText on="ANY_PAGE">
    <startingWith>PDFUnit</startingWith>
    <containing>Content on last page.</containing>
    <matchingRegex>.*[Cc]ontent.*</matchingRegex>
    <endingWith>of 4</endingWith>
  </hasText>
</assertThat>
</testcase>

The tag <hasText /> must be used multiple times if multiple validations are used pointing to different pages:

<!-- 
  Different pages and different comparisons in one concatenated statement.
  This test works, but it is not recommended.
  When the test fails, the error analysis is more complicated than
  if you had 3 individual tests.
-->
<testcase name="hasText_ComplexSearchOverDifferentPages">
  <assertThat testDocument="content/diverseContentOnMultiplePages.pdf">
    <hasText on="ANY_PAGE">
      <startingWith>PDFUnit - Automated PDF Tests</startingWith>
    </hasText>
    <hasText on="EVEN_PAGES">
      <containing>Content</containing>
      <containing>even pagenumber</containing>
    </hasText>
    <hasText on="ODD_PAGES">
      <containing>odd pagenumber</containing>
    </hasText>
  </assertThat>
</testcase>

This test is not good because the name of the test is not clear enough.

Individual Pages with Upper and Lower Limit

Do you need to know that an expected text can be found on every page except the first page? Such a test looks like this:

<testcase name="hasText_OnAnyPageAfter">
  <assertThat testDocument="content/diverseContentOnMultiplePages.pdf">
    <hasText onAnyPageAfter="1">
      <containing>Content on</containing>
    </hasText>
  </assertThat>
</testcase>

Page numbers start from 1.

Invalid page limits are not necessarily an error. In the following example, the text is searched for on all pages between 1 and 99 (exclusive). Although the document has only 4 pages, the test ends successfully because the expected string is found on page 1:

<!--
  Attention: the document has the search token on page 1. 
  And '1' is before '99'. So this test ends successfully.
-->
<testcase name="hasText_OnAllPagesBefore_WrongPageNumber">
  <assertThat testDocument="content/diverseContentOnMultiplePages.pdf">
    <hasText onAnyPageBefore="99">
      <containing>Content on</containing>
    </hasText>
  </assertThat>
</testcase>

Visible Text Order - Potential Problem

The visible sequence of text on a PDF page does not necessarily correspond to the text sequence within the PDF document. This might result in PDFUnit does not recognizing text sequences, but PDFUnit uses iText's powerful text recognition which assembles text objects based on their positions on a page.

Although the text in the next example is a separate text object in a frame, a test for the text sequence "the beginning. This is content" succeeds:

<!-- 
  The PDF document does not store the text in the visible order.
-->  
<testcase name="hasText_TextNotInVisibleOrder">
  <assertThat testDocument="content/contentNotInVisibleOrder.pdf">
    <hasText on="FIRST_PAGE">
      <containing>
        Content at the beginning.
        This is content, placed in a frame by OpenOffice.
        Content at the end.
      </containing>
    </hasText>
  </assertThat>
</testcase>