Retrieve fulltext and metadata for all editions of one project

Task:

  • We want to analyse the fulltext content of all editions for one project.

  • We want to have access to the edition metadata.

  • As the there may be lots of editions we split the job by paging the search results.

from tgclients import TextgridSearch, TextgridSearchRequest, Aggregator, TextgridConfig
from tgclients.config import DEV_SERVER

###
# prepare textgrid clients which are configured to use the dev instance, 
# because aggregating fulltext of editions may put high load on the aggregator
###
config = TextgridConfig(DEV_SERVER)
tgsearch = TextgridSearch(config)
aggregator = Aggregator(config)


###
# choose a project ID, look at https://sandbox.dev.textgridrep.org/projects for inspiration
###

#project_id =  'TGPR-1789bd93-99c5-58e4-e100-619e27ec1119' # Keine Wahlwerbung (1 Ed.)
project_id =  'TGPR-f3e628ae-74b8-2ebb-9fee-614c59c9b522' # Distant Reading – 2021-09-23 (8 Ed.)
#project_id =  'TGPR-ca80b39a-5487-27ee-1289-6294a25f975a' # Goethes Farbenlehre (51 Ed.)
#project_id =  'TGPR-44684af6-1d30-b6d0-3665-62a87b5380b7' # CoNSSA (219 Ed.)
#project_id = 'TGPR-372fe6dc-57f2-6cd4-01b5-2c4bbefcfd3c' # Digitale Bibliothek (93462 Ed.)

###
# start is the pointer which gets incremented, starting with 0
# limit is the number of search results to retrieve at once
###
start = 0
limit = 10

nextpage = True

while nextpage:

    ###
    # filter for all editions in the chosen project
    ###
    results = tgsearch.search(
                filters=[
                    'project.id:'+project_id, 
                    'format:text/tg.edition+tg.aggregation+xml'],
                start=start, limit=limit)

    for result in results.result:
        edition_uri = result.object_value.generic.generated.textgrid_uri.value
        edition_agent = result.object_value.edition.agent[0].value
        edition_title = result.object_value.generic.provided.title[0]

        # edition metadata
        print(edition_agent + ' - ' + edition_title + '\n')

        ###
        # aggregate all text content of all children of this edition as plaintext
        ###
        fulltext = aggregator.text(edition_uri).text
        print(fulltext[0:100])
        print("---\n")

    # incremet the start counter for the next run
    start = start + limit
    if start > int(results.hits):
        # stop if there are no more results left
        nextpage = False

print('\n+------+\n| DONE |\n+------+')
Frances Trollope - The Life and Adventures of Michael Armstrong
PREFACE.
When the author of "Michael Armstrong" first determined on attempting to draw the attention
---

Otto, Louise - Nürnberg. Zweiter Band
Erstes Capitel
Gobelins
Die kalten Strahlen einer halbverschleierten Wintersonne brachen sich auf de
---

Christ, Lena - Mathias Bichler
Im Weidhof
Meine Kostmutter hat mir gesagt, daß ich am vierten Sonntag nach der Erscheinung des Herr
---

Carroll, Lewis, 1832-1898 - Alice's Adventures in Wonderland
ALICE’S ADVENTURES IN WONDERLAND
By Lewis Carroll
THE MILLENNIUM FULCRUM EDITION 3.0
CHAPTER I. Down
---

Nesbit, E. (Edith), 1858-1924 - The Red House
THE RED HOUSE A Novel
BY E. NESBIT AUTHOR OF “THE TREASURE “THE WOULDBEGOODS,”
ILLUSTRATED BY A.I. K
---

Fontane, Theodor - Unterm Birnbaum
Erstes Kapitel
Vor dem in dem großen und reichen Oderbruchdorfe Tschechin um Michaeli 20 eröffneten 
---

 - The Mayor of Casterbridge: The Life and Death of a Man of Character
THE MAYOR OF CASTERBRIDGE
by Thomas Hardy
1.
One evening of late summer, before the nineteenth centu
---

Willkomm, Ernst Adolf - Weisse Sclaven oder die Leiden des Volkes
Erster Theil
Erstes Buch
Erstes Kapitel.
Der Haidekretscham.
Ein ansehnlicher Theil der beiden Lausi
---


+------+
| DONE |
+------+